首页 » 文章 » 文章详细信息
Scientific Programming Volume 2019 ,2019-01-16
Design and Implementation of a Machine Learning-Based Authorship Identification Model
Research Article
Waheed Anwar 1 Imran Sarwar Bajwa 1 Shabana Ramzan 2
Show affiliations
DOI:10.1155/2019/9431073
Received 2018-10-29, accepted for publication 2018-12-18, Published 2018-12-18
PDF
摘要

In this paper, a novel approach is presented for authorship identification in English and Urdu text using the LDA model with n-grams texts of authors and cosine similarity. The proposed approach uses similarity metrics to identify various learned representations of stylometric features and uses them to identify the writing style of a particular author. The proposed LDA-based approach emphasizes instance-based and profile-based classifications of an author’s text. Here, LDA suitably handles high-dimensional and sparse data by allowing more expressive representation of text. The presented approach is an unsupervised computational methodology that can handle the heterogeneity of the dataset, diversity in writing, and the inherent ambiguity of the Urdu language. A large corpus has been used for performance testing of the presented approach. The results of experiments show superiority of the proposed approach over the state-of-the-art representations and other algorithms used for authorship identification. The contributions of the presented work are the use of cosine similarity with n-gram-based LDA topics to measure similarity in vectors of text documents. Achievement of overall 84.52% accuracy on PAN12 datasets and 93.17% accuracy on Urdu news articles without using any labels for authorship identification task is done.

授权许可

Copyright © 2019 Waheed Anwar et al. 2019
This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

通讯作者

Waheed Anwar.Department of Computer Science & IT, The Islamia University of Bahawalpur, Bahawalpur, Pakistan, iub.edu.pk.waheed@iub.edu.pk

推荐引用方式

Waheed Anwar,Imran Sarwar Bajwa,Shabana Ramzan. Design and Implementation of a Machine Learning-Based Authorship Identification Model. Scientific Programming ,Vol.2019(2019)

您觉得这篇文章对您有帮助吗?
分享和收藏
0

是否收藏?

参考文献
[1] F. J. Tweedie, S. Singh, D. I. Holmes. (1996). Neural network applications in stylometry: the Federalist Papers. Computers and the Humanities.30(1):1-10. DOI: 10.1093/llc/13.3.111.
[2] P. Juola. (2012). An overview of the traditional authorship attribution subtask. :37-41. DOI: 10.1093/llc/13.3.111.
[3] R. Zheng, J. Li, H. Chen, Z. Huang. et al.(2006). A framework for authorship identification of online messages: writing-style features and classification techniques. Journal of the American Society for Information Science and Technology.57(3):378-393. DOI: 10.1093/llc/13.3.111.
[4] A. Caliskan-Islam. (2015). Stylometric fingerprints and privacy behavior in textual data. . DOI: 10.1093/llc/13.3.111.
[5] P. Juola, R. H. Baayen. (2005). A controlled-corpus experiment in authorship identification by cross-entropy. Literary and Linguistic Computing.20(1):59-67. DOI: 10.1093/llc/13.3.111.
[6] J. Hoorn, S. Frank, W. Kowalczyk, F. van der Ham. et al.(1999). Neural network identification of poets using letter sequences. Literary and Linguistic Computing.14(3):311-338. DOI: 10.1093/llc/13.3.111.
[7] S. Bird, E. Loper. NLTK: the natural language toolkit. :1-4. DOI: 10.1093/llc/13.3.111.
[8] Y. Seroussi, I. Zukerman, F. Bohnert. Authorship attribution with latent dirichlet allocation. :181-189. DOI: 10.1093/llc/13.3.111.
[9] D. I. Holmes. (1994). Authorship attribution. Computers and the Humanities.28(2):87-106. DOI: 10.1093/llc/13.3.111.
[10] P. Juola. (2006). Authorship attribution. Foundations and Trends in Information Retrieval.1(3):233-334. DOI: 10.1093/llc/13.3.111.
[11] A. S. Altheneyan, M. E. B. Menai. (2014). Naïve Bayes classifiers for authorship attribution of Arabic texts. Journal of King Saud University-Computer and Information Sciences.26(4):473-484. DOI: 10.1093/llc/13.3.111.
[12] A. A. Raza, A. Athar, S. Nadeem. -gram based authorship attribution in Urdu poetry. :88-93. DOI: 10.1093/llc/13.3.111.
[13] C. E. Chaski. (2001). Empirical evaluations of language-based author identification techniques. Forensic Linguistics.8(1):1-65. DOI: 10.1093/llc/13.3.111.
[14] I. Markov, E. Stamatatos, G. Sidorov. Improving cross-topic authorship attribution: the role of pre-processing. . DOI: 10.1093/llc/13.3.111.
[15] A. Jamak, A. Savatić, M. Can. (2012). Principal component analysis for authorship attribution. Business Systems Research.3(2):49-56. DOI: 10.1093/llc/13.3.111.
[16] P. Maitra, S. Ghosh, D. Das. Authorship verification – an approach based on random forest notebook for PAN at CLEF 2015. :1-9. DOI: 10.1093/llc/13.3.111.
[17] D. M. Blei, A. Y. Ng, M. I. Jordan. (2003). Latent dirichlet allocation. Journal of Machine Learning Research.3(3):993-1022. DOI: 10.1093/llc/13.3.111.
[18] F. Sebastiani. (2002). Machine learning in automated text categorization. ACM Computing Surveys.34(1):1-47. DOI: 10.1093/llc/13.3.111.
[19] J. F. Burrows. (1987). Word-patterns and story-shapes: the statistical analysis of narrative style. Literary and Linguistic Computing.2(2):61-70. DOI: 10.1093/llc/13.3.111.
[20] W. Anwar, I. Sarwar Bajwa, M. A. Choudhary, S. Ramzan. et al.(2018). An empirical study on forensic analysis of Urdu text using LDA based authorship attribution. IEEE Access.6:6600. DOI: 10.1093/llc/13.3.111.
[21] S. Argamon, M. Koppel, J. W. Pennebaker, J. Schler. et al.(2009). Automatically profiling the author of an anonymous text. Communications of the ACM.52(2):119-123. DOI: 10.1093/llc/13.3.111.
[22] J. Burrows. (2002). ‘Delta’: a measure of stylistic difference and a guide to likely authorship. Literary and Linguistic Computing.17(3):267-287. DOI: 10.1093/llc/13.3.111.
[23] D. I. Holmes, M. Robertson, R. Paez. (2001). Stephen crane and the New-York tribune: a case study in traditional and non-traditional authorship attribution. Computers and the Humanities.35(3):315-331. DOI: 10.1093/llc/13.3.111.
[24] A. Abbasi, H. Hsinchun Chen. (2005). Applying authorship analysis to extremist-group Web forum messages. IEEE Intelligent Systems.20(5):67-75. DOI: 10.1093/llc/13.3.111.
[25] J. Grieve. (2007). Quantitative authorship attribution: an evaluation of techniques. Literary and Linguistic Computing.22(3):251-270. DOI: 10.1093/llc/13.3.111.
[26] M. Koppel, J. Schler, S. Argamon. (2010). Authorship attribution in the wild. Language Resources and Evaluation.45(1):83-94. DOI: 10.1093/llc/13.3.111.
[27] M. Kestemont. (2018). Overview of the author identification task at PAN-2018: cross-domain authorship attribution and style change detection. CEUR Workshop Proceedings.2125. DOI: 10.1093/llc/13.3.111.
[28] D. I. Holmes. (1992). A stylometric analysis of mormon scripture and related texts. Journal of the Royal Statistical Society. Series A (Statistics in Society).155(1):91-120. DOI: 10.1093/llc/13.3.111.
[29] M. Koppel, J. Schler, S. Argamon. (2009). Computational methods in authorship attribution. Journal of the American Society for Information Science and Technology.60(1):9-26. DOI: 10.1093/llc/13.3.111.
[30] G. Yule. (1944). The statistical study of literary vocabulary. Modern Language Review.39(3):291-293. DOI: 10.1093/llc/13.3.111.
[31] M. Omar, B.-W. On, I. Lee, G. S. Choi. et al.(2015). LDA Topics : representation and evaluation. Journal of Information Science.41(5):1-14. DOI: 10.1093/llc/13.3.111.
[32] J. D. Burger, J. Henderson, G. Kim, G. Zarrella. et al.(2011). Discriminating gender on twitter. Association for Computational Linguistics.146:1301-1309. DOI: 10.1093/llc/13.3.111.
[33] H. Ding, I. Takigawa, H. Mamitsuka, S. Zhu. et al.(2013). Similarity-based machine learning methods for predicting drug-target interactions: a brief review. Briefings in Bioinformatics.15(5):734-747. DOI: 10.1093/llc/13.3.111.
[34] R. Rehurek, P. Sojka. Software framework for topic modelling with large corpora. :45-50. DOI: 10.1093/llc/13.3.111.
[35] R. Arun, R. Saradha, V. Suresh. (2009). Stopwords and stylometry: a latent Dirichlet allocation approach. NIPS Work:1-4. DOI: 10.1093/llc/13.3.111.
[36] J. Savoy. (2013). Authorship attribution based on a probabilistic topic model. Information Processing & Management.49:341-354. DOI: 10.1093/llc/13.3.111.
[37] D. I. Holmes. (1998). The evolution of stylometry in humanities scholarship. Literary and Linguistic Computing.13(3):111-117. DOI: 10.1093/llc/13.3.111.
[38] Q. Le, T. Mikolov. Distributed representations of sentences and documents. :1188-1196. DOI: 10.1093/llc/13.3.111.
[39] T. Mikolov, G. Corrado, K. Chen, J. Dean. et al.Efficient estimation of word representations in vector space. :1-12. DOI: 10.1093/llc/13.3.111.
[40] C. E. Chaski. (2005). Who’ s at the keyboard? Authorship attribution in digital evidence investigations. International Journal of Digital Evidence.4(1):1-13. DOI: 10.1093/llc/13.3.111.
[41] S. Sohangir, D. Wang. Document understanding using improved sqrt-cosine similarity. :278-279. DOI: 10.1093/llc/13.3.111.
[42] R. Sousa Silva, G. Laboreiro, L. Sarmento, T. Grant. et al.‘twazn me!!! ;(’ automatic authorship analysis of micro-blogging messages. .6716:161-168. DOI: 10.1093/llc/13.3.111.
[43] T. C. Mendenhall. (1887). The characteristic curves of composition. Science.9:237-246. DOI: 10.1093/llc/13.3.111.
[44] M. Rosen-Zvi, T. Griffiths, M. Steyvers, P. Smyth. et al.The author-topic model for authors and documents. :487-494. DOI: 10.1093/llc/13.3.111.
[45] V. Kešelj, F. Peng, N. Cercone, C. Thomas. et al.-Gram-based Author profiles for authorship attribution. :255-264. DOI: 10.1093/llc/13.3.111.
[46] D. Pavelec, L. Oliveira, E. Justino, L. Batista. et al.(2008). Using conjunctions and adverbs for author verification. Journal of Universal Computer Science.14(18):2967-2981. DOI: 10.1093/llc/13.3.111.
[47] E. Stamatatos. (2008). Author identification: using text sampling to handle the class imbalance problem. Information Processing & Management.44(2):790-799. DOI: 10.1093/llc/13.3.111.
[48] E. Stamatatos, N. Fakotakis, G. Kokkinakis. Text genre detection using common word frequencies. .2:808. DOI: 10.1093/llc/13.3.111.
文献评价指标
浏览 42次
下载全文 5次
评分次数 0次
用户评分 0.0分
分享 0次