Perfilamiento de autor en escenarios lingüísticos informales y formales mediante aprendizaje por transferencia

Palabras clave: Perfilamiento de autor, Reconocimiento de género y variedad lingüística, Transferencia de aprendizaje, Procesamiento del lenguaje natural

Resumen

El interés en tareas de perfilamiento de autor ha aumentado en la comunidad científica porque las aplicaciones han mostrado éxito en diferentes sectores como la seguridad, el mercadeo, la salud, entre otros. El reconocimiento e identificación de rasgos como el género, la edad, el dialecto o la personalidad a partir de datos de texto puede ayudar a mejorar diferentes estrategias de mercadeo. Este tipo de tecnología ha sido ampliamente discutida considerando documentos de redes sociales. Sin embargo, los métodos han sido pobremente estudiados en datos con una estructura más formal, donde no se tiene acceso a emoticones, menciones, y otros fenómenos lingüísticos que solo están presentes en redes sociales. Este trabajo propone el uso de redes neuronales recurrentes y convolucionales, y una estrategia de transferencia de aprendizaje para reconocer dos rasgos demográficos: el género y la variedad lingüística en documentos que están escritos en lenguajes informales y formales. Los modelos se prueban en dos bases de datos diferentes que consisten en Tuits (informal) y conversaciones de centros de llamadas (formal). Se obtienen precisiones del 75 % y del 68 % para el reconocimiento de género en documentos con una estructura informal y formal, respectivamente. Además, para el reconocimiento de variedad lingüística se obtuvieron precisiones del 92 % y del 72 % en documentos con una estructura informal y formal, respectivamente. Los resultados indican que, para los rasgos considerados, es posible transferir el conocimiento de un sistema entrenado en un tipo específico de expresiones a otro, donde la cantidad de datos es más escasa y su estructura es completamente diferente.

Biografía del autor/a

Daniel Escobar-Grisales*, Universidad de Antioquia, Colombia

Universidad de Antioquia, Medellín-Colombia, daniel.esobar@udea.edu.co

Juan Camilo Vásquez-Correa, Friedrich Alexander Universität, Alemania

Universidad de Antioquia, Medellín-Colombia; Friedrich Alexander Universität, Erlangen Nürnberg-Germany; Pratech Group, Medellín-Colombia, jcvasquez@pratechgroup.com

Juan Rafael Orozco-Arroyave, Friedrich Alexander Universität, Alemania

Universidad de Antioquia, Medellín-Colombia; Friedrich Alexander Universität, Erlangen Nürnberg-Germany, rafael.orozco@udea.edu.co

Referencias bibliográficas

F. Chiu Hsieh; R. F. Sandroni Dias; I. Paraboni, “Author profiling from Facebook corpora,” in Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), pp. 2566- 2570, 2018. https://aclanthology.org/L18-1407.pdf

O. Dogan; B. Oztaysi, “Gender prediction from classified indoor customer paths by fuzzy C-medoids clustering,” in Intelligent and Fuzzy Techniques in Big Data Analytics and Decision Making INFUS 2019. Advances in Intelligent Systems and Computing, vol 1029. Springer, Cham., pp. 160–169. https://doi.org/10.1007/978-3-030-23756-1_21

R. Hirt; N. Kühl; G. Satzger, “Cognitive computing for customer profiling: meta classification for gender prediction,” Electron. Mark., vol. 39, no. 1, pp. 93–106, Feb. 2019. https://doi.org/10.1007/s12525-019-00336-z

D. Fernandez-Lanvin; J. de Andres-Suarez; M. Gonzalez-Rodriguez; B. Pariente-Martinez, “The dimension of age and gender as user model demographic factors for automatic personalization in e-commerce sites,” Comput. Stand. Interfaces, vol. 59, pp. 1–9, Aug. 2018. https://doi.org/10.1016/j.csi.2018.02.001

M. Arroju; A. Hassan; G. Farnadi, “Age, gender and personality recognition using tweets in a multilingual setting Notebook for PAN at CLEF 2015”. in 6th Conference and Labs of the Evaluation Forum (CLEF), 2015, pp. 23-31. https://biblio.ugent.be/publication/7100086

A. Nemati, “Gender and Age Prediction Multilingual Author Profiles Based on Comments”. in FIRE (Working Notes), 2018. http://ceur-ws.org/Vol-2266/T4-4.pdf

P. Mishra; M. Del Tredici; H. Yannakoudakis; E. Shutova, “Author profiling for abuse detection”. in Proceedings of the 27th international conference on computational linguistics, 2018. https://aclanthology.org/C18-1093.pdf

B. G. Gebre; M. Zampieri; P. Wittenburg; T. Heskes, “Improving native language identification with TF-IDF weighting”. in Proceedings of the Eighth Workshop on Innovative Use of NLP for Building Educational Applications, 2013, pp. 216-223. http://hdl.handle.net/11858/00-001M-0000-000E-FB4D-B

K. M. Alomari; H. M. ElSherif; K. Shaalan, “Arabic tweets sentimental analysis using machine learning”. in Advances in Artificial Intelligence: From Theory to Practice. IEA/AIE 2017. Lecture Notes in Computer Science, vol 10350. Springer, Cham. https://doi.org/10.1007/978-3-319-60042-0_66

I. Markov; H. Gómez-Adorno; G. Sidorov, “Language-and subtask-dependent feature selection and classifier parameter tuning for author profiling Notebook for PAN at CLEF 2017,” CLEF (Working Notes), 2017. https://www.researchgate.net/profile/Ilia-Markov/publication/318501982_Language-_and_Subtask-Dependent_Feature_Selection_and_Classifier_Parameter_Tuning_for_Author_Profiling/links/596e040d0f7e9bd5f75f5d36/Language-and-Subtask-Dependent-Feature-Selection-and-Classifier-Parameter-Tuning-for-Author-Profiling.pdf

M. Martinc; I. Skrjanec; K. Zupan; S. Pollak, “PAN 2017: Author profiling-gender and language variety prediction,” in CLEF (Working Notes), 2017. https://pan.webis.de/downloads/publications/papers/martinc_2017.pdf

F. Rangel; P. Rosso; M. Potthast; B. Stein, “Overview of the 5th author profiling task at PAN 2017: Gender and language variety identification in twitter,” in Working notes papers of the CLEF, pp. 1613–0073, 2017. http://personales.upv.es/prosso/resources/RangelEtAl_PAN17.pdf

A. Basile; G. Dwyer; M. Medvedeva; J. Rawee; H. Haagsma; M. Nissim, “N-gram: New Groningen author-profiling model,” Jul. 2017. https://arxiv.org/abs/1707.03764

M. Potthast; T. Gollub; F. Rangel; P. Rosso; E. fstathios Stamatatos; B. Stein, “Improving the reproducibility of PAN’s shared tasks,” in Information Access Evaluation. Multilinguality, Multimodality, and Interaction. CLEF 2014. Lecture Notes in Computer Science, vol 8685. Springer, Cham, 2014, pp. 268–299. https://doi.org/10.1007/978-3-319-11382-1_22

M. L. Newman; C. J. Groom; L. D. Handelman; J. W. Pennebaker, “Gender differences in language use: An analysis of 14,000 text samples,” Discourse Processes, vol. 45, no. 3, pp. 211–236, Jun. 2008. https://doi.org/10.1080/01638530802073712

D. Rao; D. Yarowsky; A. Shreevats; M. Gupta, “Classifying latent user attributes in twitter,” in Proceedings of the 2nd international workshop on Search and mining user-generated contents - SMUC ’10, 2010, pp. 37–44. https://doi.org/10.1145/1871985.1871993

H. A. Schwartz et al., “Personality, gender, and age in the language of social media: The open-vocabulary approach,” PloS one, vol. 8, no. 9, e73791, Sep. 2013. https://doi.org/10.1371/journal.pone.0073791

W. Li; M. Dickinson, “Gender prediction for Chinese social media data,” in Proceedings of Recent Advances in Natural Language Processing, Varna, Bulgaria, 2017, pp. 438–445. https://doi.org/10.26615/978-954-452-049-6_058

M. Franco-Salvador; G. Kondrak; P. Rosso, “Bridging the native language and language variety identification tasks”, Procedia Computer Science, vol.112, pp. 1554–1561, 2017. https://doi.org/10.1016/j.procs.2017.08.068

M. E. Aragón; A. P. López-Monroy, “Author profiling and aggressiveness detection in Spanish tweets: Mex-a3t 2018,” in IberEval@SEPLN, 2018, pp. 134–139.

Y. Kim, “Convolutional neural networks for sentence classification,” in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, 2014, pp. 1746–1751. https://doi.org/10.3115/v1/D14-1181

N. Kalchbrenner; E. Grefenstette; P. Blunsom, “A convolutional neural network for modelling sentences,” in Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Baltimore, 2014, pp. 655–665. https://doi.org/10.3115/v1/P14-1062

N. Majumder; S. Poria; A. Gelbukh; E. Cambria, “Deep learning-based document modeling for personality detection from text,” IEEE Intell. Syst., vol. 32, no. 2, pp. 74–79, Mar. 2017. https://doi.org/10.1109/mis.2017.23

S. Ruder; P. Ghaffari; J. Breslin, “Character-level and multi-channel convolutional neural networks for large-scale authorship attribution,” ArXiv, Sep. 2016. https://arxiv.org/abs/1609.06686

H. Gómez-Adorno et al., “A convolutional neural network approach for gender and language variety identification,” J. Intell. Fuzzy Syst., vol. 36, no. 5, pp. 4845–4855, May. 2019. https://doi.org/10.3233/JIFS-179032

D. Kodiyanet, “Author profiling with bidirectional RNNs using attention with GRUs,” Notebook for PAN at CLEF 2017. https://web.archive.org/web/20181102143341id_/https://digitalcollection.zhaw.ch/bitstream/11475/1865/1/kodiyan17-notebook.pdf

J. V. Lochter; R. M. Silva; T. A. Almeida, “Deep learning models for representing out-of-vocabulary words”. in Brazilian Conference on Intelligent Systems. Springer, Cham, 2020, pp. 418-434. https://doi.org/10.1007/978-3-030-61377-8_29

M. González Bermúdez, “An analysis of twitter corpora and the differences between formal and colloquial tweets,” in Findings of the Association for Computational Linguistics: EMNLP 2020, 2020, pp. 3153–3159. https://upcommons.upc.edu/handle/2117/79542

J. Gu; Z. Yu, “Data annealing for informal language understanding tasks,” arXiv, Apr. 2020. https://arxiv.org/abs/2004.13833

M. Potthast, F. Rangel; M. Tschuggnall; E. Stamatatos; P. Rosso; B. Stein, “Overview of PAN’17”. in CLEF 2017: Experimental IR Meets Multilinguality, Multimodality, and Interaction, Springer, Cham, pp 275-290. https://doi.org/10.1007/978-3-319-65813-1_25

D. W. Otter; J. R. Medina; J. K. Kalita, “A survey of the usages of deep learning for natural language processing,” in IEEE Trans. Neural Networks Learn. Syst., vol. 32, no. 2, pp. 604-624, Feb. 2021. https://doi.org/10.1109/TNNLS.2020.2979670

A. Torfi; R. A. Shirvani; Y. Keneshloo; N. Tavvaf; E. A Fox, “Natural language processing advancements by deep learning: A survey.” ArXiv, Mar. 2020. https://arxiv.org/abs/2003.01200

L. Arras; G. Montavon; K. R. Müller; W. Samek, “Explaining recurrent neural network predictions in sentiment analysis,” proceedings of the 8th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis, Copenhagen, 2017. https://doi.org/10.18653/v1/W17-5221

S. Minaee; E. Azimi; A. Abdolrashidi, “Deep-sentiment: Sentiment analysis using ensemble of CNN and bi-LSTM models,” ArXiv, Apr. 2019. https://arxiv.org/abs/1904.04206

J. Trofimovich, “Comparison of neural network architectures for sentiment analysis of Russian tweets,” in Computational Linguistics and Intellectual Technologies: Proceedings of the International Conference Dialogue, Moscow. 2016, pp. 50–59. http://www.dialog-21.ru/media/3380/arkhipenkoetal.pdf

V. Satopaa; J. Albrecht; D. Irwin; B. Raghavan, “Finding a" kneedle" in a haystack: Detecting knee points in system behavior,” in 31st International Conference on Distributed Computing Systems Workshops, Minneapolis, 2011, pp. 166-171. https://doi.org/10.1109/ICDCSW.2011.20

Cómo citar
[1]
D. Escobar-Grisales, J. C. Vásquez-Correa, y J. R. Orozco-Arroyave, «Perfilamiento de autor en escenarios lingüísticos informales y formales mediante aprendizaje por transferencia», TecnoL., vol. 24, n.º 52, p. e2166, dic. 2021.

Descargas

Los datos de descargas todavía no están disponibles.
Publicado
2021-12-17
Sección
Artículos de investigación

Métricas