A new statistics-based approach to improve Word2Vec's sentiment classification success
Abstract
Sentiment classification is the process of predicting the emotion that the text wants to give by analyzing the written texts. Studies on estimating the emotion of a sentence or a document rather than the meaning of a word have increased in recent years. In this study, statistical approaches that can be alternative to the current use of the Word2Vec method in sentiment classification are presented. Currently, when a sentiment classification is desired to be made with Word2Vec, the arithmetic average of the vectors created for all words in the relevant document is taken. In this study, the performances of the statistical methods presented as an alternative to the arithmetic mean for 5 different machine learning methods on 2 different data sets were compared. In addition, the results obtained by performing the same studies in Doc2Vec and BoW were compared with Word2Vec.
Among the proposed approaches, Median has achieved better results than both the mean and the other two proposed methods. As a reason for this, it can be said that the media shows the central distribution better. Although the Word2Vec-CBOW approach obtained similar values to SG, it was observed that it produced more stable results. Word2Vec has achieved better results than both Doc2Vec and BoW. Among the proposed statistical approaches, it can be said that Median has a positive effect on the success of the system when used with Word2Vec and can be an alternative to the mean approach used in the literature.Keywords
Full Text:
PDFReferences
Maron ME. Automatic Indexing: An Experimental Inquiry. Journal of the ACM 1961;8(3):404-417.
Sebastiani F. Machine Learning in Automated Text Categorization. ACM Computing Surveys 2002;34(1): 1-47.
Dalal MK, Mukesh AZ. Automatic Text Classification: A Technical Review. International Journal of Computer Applications 2011;28(2):7-40. doi:10.5120/3358-4633.
Sommer S, Schieber A, Hilbert A, Heinrich K. Analyzing Customer Sentiments in Microblogs–A Topic-Model-Based Approach for Twitter Datasets. In:17th Americas Conference on Information Systems; Detroit, Michigab, USA; 2011. pp.1-8.
Liu B, Zhang L. A Survey of Opinion Mining and Sentiment Analysis. In: Aggarwal C, Zhai C (editors). Mining Text Data. Boston, MA,USA: Springer, 2012, pp. 415-463.
Prabowo R, Thelwall M. Sentiment Analysis: A Combined Approach. Journal of Informetrics 2009;3(2): 143-157. doi:10.1016/j.joi.2009.01.003
Zhang D, Xu H, Su Z, Xu Y. Chinese Comments Sentiment Classification Based on Word2vec and SVMperf. Expert Systems with Applications 2015;42(4): 1857-1863. doi:10.1016/j.eswa.2014.09.011.
Dickinson B, Hu W. Sentiment Analysis of Investor Opinions on Twitter. Social Networking 2015;4(03):62-71. doi:10.4236/sn.2015.43008.
Tang D, Wei F, Yang N, Zhou M, Liu T et al. Learning Sentiment-Specific Word Embedding for Twitter Sentiment Classification. In: 52nd Annual Meeting of the Association for Computational Linguistics; Baltimore, Maryland, USA; 2014. pp: 1555-1565.
Polpinij J, Srikanjanapert N, Sopon P. Word2Vec Approach for Sentiment Classification Relating to Hotel Reviews. In: International Conference on Computing and Information Technology; Bangkok, Thailand; 2017.pp:308-316.
Şahin G. Turkish Document Classification Based on Word2Vec and SVM Classifier. In: IEEE 25th Signal Processing and Communications Applications Conference; Antalya, Turkey; 2017. pp:1-4.
Xue B, Fu C, Shaobin Z. A Study on Sentiment Computing and Classification of Sina Weibo with Word2vec. In: IEEE International Congress on Big Data; Anchorage, AK, USA; 2014. pp:358-363.
Çetin M, Amasyalı MF. Supervised and Traditional Term Weighting Methods for Sentiment Analysis. In: IEEE 21st Signal Processing and Communications Applications Conference; Girne, Turkish Republic of Northern Cyprus; 2013. pp: 1-4.
Bilgin M, Şentürk IF. Sentiment Analysis on Twitter Data with Semi-Supervised Doc2Vec. In: IEEE International Conference on Computer Science and Engineering; Antalya, Turkey; 2017. pp: 661-666.
Bilgin M, Köktaş H. Word2Vec Based Sentiment Analysis for Turkish Texts. In: International Conference on Engineering Technologies; Konya, Turkey; 2017. pp: 106-109.
Mahdaouy AE, Gaussier E, Alaoui SOE. Arabic Text Classification Based on Word and Document Embeddings. In: International Conference on Advanced Intelligent Systems and Informatics; Cairo, Egypt; 2016. pp:32-41.
Çoban Ö, Karabey I. Music Genre Classification with Word and Document Vectors. In: IEEE 25th Signal Processing and Communications Applications Conference; Antalya, Turkey; 2017. pp: 1-4.
Ma H, Wang X, Hou J, Lu Y. Course Recommendation Based on Semantic Similarity Analysis. In: IEEE 3rd International Conference on Control Science and Systems Engineering; Beijing,China ;2017. pp:638-641.
Razzaghnoori M, Sajedi H, Jazani IK. Question Classification in Persian Using Word Vectors and Frequencies. Cognitive Systems Research 2018;47: 16-27. doi:10.1016/j.cogsys.2017.07.002.
Bilgin M, Köktaş H. Sentiment Analysis with Term Weighting and Word Vectors. International Arab Journal of Information Technology 2019;16(5):953-959.
Szomszor MN, Kostkova P, de Quincey E. #Swineflu: Twitter predicts swine flu outbreak in 2009. In: 3rd International ICST Conference on Electronic Healthcare for the 21st Century; Casablanca, Moroco; 2010. pp:18-26.
Bian J, Topaloglu U, Yu F. Towards large-scale Twitter mining for drug-related adverse events. In: International Workshop on Smart Health and Wellbeing; Maui, Hawaii, USA; 2012. pp: 25-32.
Nguyen LE, Wu P, Chan W, Peng W, Zhang Y. Predicting collective sentiment dynamics from time-series social media. In: Workshop on Issues of Sentiment Discovery and Opinion Mining; Beijing, China; 2012. pp:6.
Claster WB, Dinh H, Cooper M. Naive bayes and unsupervised artificial neural nets for Cancun tourismsocial media data analysis. In: 2nd World Congress on Nature and Biologically Inspired Computing; Kitakyushu, Fukuoka, Japan; 2010. pp: 158-163.
Eroğul U. Sentiment Analysis in Turkish. MSc, Middle East Technical University, Ankara, Turkey, 2009.
Vural AG, Cambazoğlu BB, Şenkul P, Tokgöz ZO. A frame work for sentiment analysis in Turkish: Application to polarity detection of movie reviews in Turkish. In: 27th International Symposium on Computer and Information Sciences; Paris, France; 2012.pp: 437-445.
Meral M, Diri B. Twitter üzerinde duygu analizi. In:IEEE 22nd Signal Processing and Communications Applications Conference; Trabzon, Turkey; 2014.pp: 690-693.
Şimşek M, Özdemir S. Analysis of the relation between Turkish twitter messages and stock market index. In: 6th International Conference on Application of Information and Communication Technologies; Tbilisi, Georgia; 2012.pp: 1-4.
Türkmenoğlu C, Tantuğ AC. Sentiment analysis in Turkish media. In: Workshop on Issues of Sentiment Discovery and Opinion Mining; Beijing, China; 2014. pp: 1-11.
Yilmaz, S., Toklu, S. (2020). A deep learning analysis on question classification task using Word2vec representations. Neural Computing and Applications, 1-20.
Grohe, M. (2020, June). word2vec, node2vec, graph2vec, x2vec: Towards a theory of vector embeddings of structured data. In Proceedings of the 39th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems (pp. 1-16).
Thavareesan, S., Mahesan, S. (2020, July). Sentiment Lexicon Expansion using Word2vec and fastText for Sentiment Prediction in Tamil texts. In 2020 Moratuwa Engineering Research Conference (MERCon) (pp. 272-276). IEEE.
Esmeli, R., Bader-El-Den, M., & Abdullahi, H. (2020, July). Using Word2Vec recommendation for improved purchase prediction. In 2020 International Joint Conference on Neural Networks (IJCNN) (pp. 1-8). IEEE.
Kim HK, Kim H, Cho S. Bag-of-Concepts: Comprehending Document Representation Through Clustering Words in Distributed Representation 2017;266: 336-352. doi:10.1016/j.neucom.2017.05.046.
Yuan Y, He L, Peng L, Huang Z. A New Study Based on Word2vec and Cluster for Document Categorization. Journal of Computational Information Systems 2014;10(21): 9301-9308. doi: 10.12733/jcis12500
Campr M, Ježek K. Comparing Semantic Models for Evaluating Automatic Document Summarization. In: International Conference on Text, Speech, and Dialogue; Pilsen, Czech Republic; 2015. pp: 252-260.
Kamkarhaghighi M, Makrehchi M. Content Tree Word Embedding for Document Representation. Expert Systems with Applications 2017;90:241-249. doi:10.1016/j.eswa.2017.08.021.
Berthold MR, Cebron N, Dill F, Gabriel TR, Kötter T et al. KNIME-The Konstanz Information Miner: version 2.0 and beyond." ACM SIGKDD Explorations Newsletter 2009;11(1): 26-31. doi: 10.1145/1656274.1656280.
Mikolov T, Chen K, Corrado G, Dean J. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 2013;1-12.
Le Q, Mikolov T. Distributed Representations of Sentences and Documents. In: 31st International Conference on Machine Learning; Beijing, China; 2014. pp:1188-1196.
Amazon url{https:// www.dropbox.com/s/aji68llxmtcuu5l/data.zip} [accessed 15 September 2021]
Umich url{https://www.kaggle.com/c/si650winter11/data} [accessed 15 September 2021]
Article Metrics
Metrics powered by PLOS ALM
Refbacks
- There are currently no refbacks.
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.