2025 | |
[1] | "Emotion Classification on Software Engineering Q&A Websites", In e-Informatica Software Engineering Journal, vol. 19, no. 1, pp. 250104, 2025.
DOI: , 10.37190/e-Inf250104. Download article (PDF)Get article BibTeX file |
Authors
Didi Awovi Ahavi-Tete, Sangeeta Sangeeta
Abstract
Background: With the rapid proliferation of question-and-answer websites for software developers like Stack Overflow, there is an increasing need to discern developers’ emotions from their posts to assess the influence of these emotions on their productivity such as efficiency in bug fixing.
Aim: We aimed to develop a reliable emotion classification tool capable of accurately categorizing emotions in Software Engineering (SE) websites using data augmentation techniques to address the data scarcity problem because previous research has shown that tools trained on other domains can perform poorly when applied to SE domain directly.
Method: We utilized four machine learning techniques, namely BERT, CodeBERT, RFC (Random Forest Classifier), and LSTM. Taking an innovative approach to dataset augmentation, we employed word substitution, back translation, and easy data augmentation methods. Using these we developed sixteen unique emotion classification models: EmoClassBERT–Original, EmoClassRFC-Original, EmoClassLSTMOriginal, EmoClass-CodeBERT-Original EmoClassLSTM-Substitution, EmoClassBERT-Substitution, EmoClassRFC-Substitution, EmoClassCodeBERT-Substitution, Emo-ClassBERT-Translation, EmoClassLSTM-Translation, EmoClassRFC-Translation, EmoClassCodeBERT-Translation, EmoClassBERT-EDA, EmoClass-LSTM-EDA, EmoClassCodeBERT-EDA, and EmoClassRFC-EDA. We compared the performance of this model on a gold standard state-of-the-art database and techniques (Multi-label SO BERT and EmoTxt).
Results: An initial investigation of models trained on the augmented datasets demonstrated superior performance to those trained on the original dataset. EmoClassLSTM-Substitution, EmoClassBERT-Substitution, EmoClassCodeBERT-Substitution, and EmoClassRFC-Substitution models show improvements of 13%, 5%, 5%, and 10% as compared to EmoClass-LSTM-Original, EmoClassBERT-Original, EmoClassCodeBERT-Original, and EmoClassRFC-Original, respectively, in average F1 score. The Emo-ClassCodeBERT-Substitution performed the best and outperformed the Multi-label SO BERT and Emotxt by 2.37% and 21.17%, respectively, in average F1-score. A detailed investigation of the models on 100 runs of the dataset shows that BERT-based and CodeBERT-based models gave the best performance. This detailed investigation reveals no significant differences in the performance of models trained on augmented datasets and the original dataset on multiple runs of the dataset.
Conclusion: This research not only underlines the strengths and weaknesses of each architecture but also highlights the pivotal role of data augmentation in refining model performance, especially in the software engineering domain.
Keywords
empirical and experimental studies in software engineering, data mining in software engineering, prediction models in software engineering, AI and knowledge based software engineering
References
1. D. Girardi, F. Lanubile, N. Novielli, and A. Serebrenik, “Emotions and perceived productivity of software developers at the workplace,” IEEE Transactions on Software Engineering, Vol. 48, No. 9, 2021, pp. 3326–3341.
2. M.M. Imran, Y. Jain, P. Chatterjee, and K. Damevski, “Data augmentation for improving emotion recognition in software engineering communication,” in Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering, 2022, pp. 1–13.
3. M. Ortu, B. Adams, G. Destefanis, P. Tourani, M. Marchesi et al., “Are bullies more productive? empirical study of affectiveness vs. issue fixing time,” in 2015 IEEE/ACM 12th Working Conference on Mining Software Repositories. IEEE, 2015, pp. 303–313.
4. S. Cagnoni, L. Cozzini, G. Lombardo, M. Mordonini, A. Poggi et al., “Emotion-based analysis of programming languages on stack overflow,” ICT Express, Vol. 6, No. 3, 2020, pp. 238–242.
5. N. Forsgren, M.A. Storey, C. Maddila, T. Zimmermann, B. Houck et al., “The space of developer productivity: There’s more to it than you think.” Queue, Vol. 19, No. 1, 2021, pp. 20–48.
6. D. Girardi, N. Novielli, D. Fucci, and F. Lanubile, “Recognizing developers’ emotions while programming,” in Proceedings of the ACM/IEEE 42nd international conference on software engineering, 2020, pp. 666–677.
7. D. Graziotin, F. Fagerholm, X. Wang, and P. Abrahamsson, “What happens when software developers are (un) happy,” Journal of Systems and Software, Vol. 140, 2018, pp. 32–47.
8. N. Novielli and A. Serebrenik, “Sentiment and emotion in software engineering,” IEEE Software, Vol. 36, No. 5, 2019, pp. 6–23.
9. B. Lin, N. Cassee, A. Serebrenik, G. Bavota, N. Novielli et al., “Opinion mining for software development: a systematic literature review,” ACM Transactions on Software Engineering and Methodology (TOSEM), Vol. 31, No. 3, 2022, pp. 1–41.
10. N. Imtiaz, J. Middleton, P. Girouard, and E. Murphy-Hill, “Sentiment and politeness analysis tools on developer discussions are unreliable, but so are people,” in Proceedings of the 3rd International Workshop on Emotion Awareness in Software Engineering, 2018, pp. 55–61.
11. F. Calefato, F. Lanubile, F. Maiorano, and N. Novielli, “Sentiment polarity detection for software development,” in Proceedings of the 40th International Conference on Software Engineering, 2018, pp. 128–128.
12. L. Yao and Y. Guan, “An improved lstm structure for natural language processing,” in 2018 IEEE International Conference of Safety Produce Informatization (IICSPI). IEEE, 2018, pp. 565–569.
13. J. Antony Vijay, H. Anwar Basha, and J. Arun Nehru, “A dynamic approach for detecting the fake news using random forest classifier and nlp,” in Computational Methods and Data Engineering: Proceedings of ICMDE 2020, Volume 2. Springer, 2020, pp. 331–341.
14. D. Bleyl and E.K. Buxton, “Emotion recognition on stackoverflow posts using bert,” in 2022 IEEE International Conference on Big Data (Big Data). IEEE, 2022, pp. 5881–5885.
15. T. Ahmed, A. Bosu, A. Iqbal, and S. Rahimi, “Senticr: A customized sentiment analysis tool for code review interactions,” in 2017 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 2017, pp. 106–111.
16. V. Kumar, A. Choudhary, and E. Cho, “Data augmentation using pre-trained transformer models,” arXiv preprint arXiv:2003.02245, 2020.
17. S. Shleifer, “Low resource text classification with ulmfit and backtranslation,” CoRR, Vol. abs/1903.09244, 2019. [Online]. http://arxiv.org/abs/1903.09244
18. A. Koufakou, D. Grisales, O. Fox et al., “Data augmentation for emotion detection in small imbalanced text data,” arXiv preprint arXiv:2310.17015, 2023.
19. N. Novielli, F. Calefato, and F. Lanubile, “A gold standard for emotion annotation in stack overflow,” in Proceedings of the 15th International Conference on Mining Software Repositories, MSR ’18. New York, NY, USA: Association for Computing Machinery, 2018, p. 14–17. [Online]. https://doi.org/10.1145/3196398.3196453
20. A. Murgia, M. Ortu, P. Tourani, B. Adams, and S. Demeyer, “An exploratory qualitative and quantitative analysis of emotions in issue report comments of open source systems,” Empirical Software Engineering, Vol. 23, 2017, pp. 521–564.
21. B. Liu, Sentiment analysis and opinion mining. Springer Nature, 2022.
22. P. Sudhir and V.D. Suresh, “Comparative study of various approaches, applications and classifiers for sentiment analysis,” Global Transitions Proceedings, Vol. 2, No. 2, 2021, pp. 205–211.
23. O. Bruna, H. Avetisyan, and J. Holub, “Emotion models for textual emotion classification,” Journal of Physics: Conference Series, Vol. 772, No. 1, nov 2016, p. 012063. [Online]. https://dx.doi.org/10.1088/1742-6596/772/1/012063
24. Z. Teng, F. Ren, and S. Kuroiwa, “Retracted: recognition of emotion with svms,” in Computational Intelligence: International Conference on Intelligent Computing, ICIC 2006 Kunming, China, August 16-19, 2006 Proceedings, Part II 2. Springer, 2006, pp. 701–710.
25. E. Guzman and B. Bruegge, “Towards emotional awareness in software development teams,” in Proceedings of the 2013 9th Joint Meeting on Foundations of Software Engineering, ESEC/FSE 2013. New York, NY, USA: Association for Computing Machinery, 2013, p. 671–674. [Online]. https://doi.org/10.1145/2491411.2494578
26. A. Fontão, O.M. Ekwoge, R. Santos, and A.C. Dias-Neto, “Facing up the primary emotions in mobile software ecosystems from developer experience,” in Proceedings of the 2nd Workshop on Social, Human, and Economic Aspects of Software, WASHES ’17. New York, NY, USA: Association for Computing Machinery, 2017, p. 5–11. [Online]. https://doi.org/10.1145/3098322.3098325
27. E. Guzman, D. Azócar, and Y. Li, “Sentiment analysis of commit comments in github: An empirical study,” in Proceedings of the 11th Working Conference on Mining Software Repositories, MSR 2014. New York, NY, USA: Association for Computing Machinery, 2014, p. 352–355. [Online]. https://doi.org/10.1145/2597073.2597118
28. A. Murgia, P. Tourani, B. Adams, and M. Ortu, “Do developers feel emotions? an exploratory analysis of emotions in software artifacts,” in Proceedings of the 11th Working Conference on Mining Software Repositories, MSR 2014. New York, NY, USA: Association for Computing Machinery, 2014, p. 262–271. [Online]. https://doi.org/10.1145/2597073.2597086
29. G. Uddin and F. Khomh, “Automatic mining of opinions expressed about apis in stack overflow,” IEEE Transactions on Software Engineering, Vol. 47, No. 3, 2019, pp. 522–559.
30. A. Ciurumelea, A. Schaufelbühl, S. Panichella, and H.C. Gall, “Analyzing reviews and code of mobile apps for better release planning,” in 2017 IEEE 24th International Conference on Software Analysis, Evolution and Reengineering (SANER). IEEE, 2017, pp. 91–102.
31. X. Gu and S. Kim, “” what parts of your apps are loved by users?”(t),” in 2015 30th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 2015, pp. 760–770.
32. S. Panichella, A. Di Sorbo, E. Guzman, C.A. Visaggio, G. Canfora et al., “How can i improve my app? classifying user reviews for software maintenance and evolution,” in 2015 IEEE international conference on software maintenance and evolution (ICSME). IEEE, 2015, pp. 281–290.
33. M.M. Rahman, C.K. Roy, and I. Keivanloo, “Recommending insightful comments for source code using crowdsourced knowledge,” in 2015 IEEE 15th International Working Conference on Source Code Analysis and Manipulation (SCAM). IEEE, 2015, pp. 81–90.
34. R. Jongeling, S. Datta, and A. Serebrenik, “Choosing your weapons: On sentiment analysis tools for software engineering research,” in 2015 IEEE International Conference on Software Maintenance and Evolution (ICSME), 2015, pp. 531–535.
35. M. Thelwall, K. Buckley, G. Paltoglou, A. Kappas, and D. Cai, “Sentiment strength detection in short informal text,” Journal of the American Society for Information Science and Technology, 2010.
36. E. Loper and S. Bird, “Nltk: The natural language toolkit,” in Proceedings of the ACL-02 Workshop on Effective Tools and Methodologies for Teaching Natural Language Processing and Computational Linguistics – Volume 1, ETMTNLP ’02. USA: Association for Computational Linguistics, 2002, p. 63–70. [Online]. https://doi.org/10.3115/1118108.1118117
37. M.R. Islam and M.F. Zibran, “Leveraging automated sentiment analysis in software engineering,” in 2017 IEEE/ACM 14th International Conference on Mining Software Repositories (MSR), 2017, pp. 203–214.
38. F. Calefato, F. Lanubile, F. Maiorano, and N. Novielli, “Sentiment polarity detection for software development,” Empirical Software Engineering, Vol. 23, No. 3, 2017, pp. 1352–1382.
39. D. Graziotin, X. Wang, and P. Abrahamsson, “Do feelings matter? on the correlation of affects and the self-assessed productivity in software engineering,” Journal of Software: Evolution and Process, Vol. 27, No. 7, 2015, pp. 467–487. [Online]. https://onlinelibrary.wiley.com/doi/abs/10.1002/smr.1673
40. F. Calefato, F. Lanubile, and N. Novielli, “Emotxt: A toolkit for emotion recognition from text,” CoRR, Vol. abs/1708.03892, 2017. [Online]. http://arxiv.org/abs/1708.03892
41. N. Boucher, I. Shumailov, R.J. Anderson, and N. Papernot, “Bad characters: Imperceptible NLP attacks,” CoRR, Vol. abs/2106.09898, 2021. [Online]. https://arxiv.org/abs/2106.09898
42. Y. HaCohen-Kerner, D. Miller, and Y. Yigal, “The influence of preprocessing on text classification using a bag-of-words representation.” PloS one, Vol. 15, 2020, p. e0232525.
43. J.J. Webster and C. Kit, “Tokenization as the initial phase in nlp,” in COLING 1992 volume 4: The 14th international conference on computational linguistics, 1992.
44. N. Rahimi, F. Eassa, and L. Elrefaei, “An ensemble machine learning technique for functional requirement classification,” symmetry, Vol. 12, No. 10, 2020, p. 1601.
45. A. Humphreys and R.J.H. Wang, “Automated text analysis for consumer research,” Journal of Consumer Research, Vol. 44, 2018, pp. 1274–1306. [Online]. https://api.semanticscholar.org/CorpusID:168854843
46. L. Tian, C. Lai, and J.D. Moore, “Polarity and intensity: the two aspects of sentiment analysis,” arXiv preprint arXiv:1807.01466, 2018.
47. A. Ali, S.M. Shamsuddin, and A.L. Ralescu, “Classification with class imbalance problem,” Int. J. Advance Soft Compu. Appl, Vol. 5, No. 3, 2013, pp. 176–204.
48. X.Y. Liu, J. Wu, and Z.H. Zhou, “Exploratory undersampling for class-imbalance learning,” IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), Vol. 39, No. 2, 2009, pp. 539–550.
49. Q. Liu, M.J. Kusner, and P. Blunsom, “A survey on contextual embeddings,” 2020.
50. P. Bahad, P. Saxena, and R. Kamal, “Fake news detection using bi-directional lstm-recurrent neural network,” Procedia Computer Science, Vol. 165, 2019, pp. 74–82, 2nd International Conference on Recent Trends in Advanced Computing ICRTAC -DISRUP – TIV INNOVATION , 2019 November 11-12, 2019. [Online]. https://www.sciencedirect.com/science/article/pii/S1877050920300806
51. C. Zhou, C. Sun, Z. Liu, and F.C.M. Lau, “A C-LSTM neural network for text classification,” CoRR, Vol. abs/1511.08630, 2015. [Online]. http://arxiv.org/abs/1511.08630
52. Y. Zhang, “Research on text classification method based on lstm neural network model,” in 2021 IEEE Asia-Pacific Conference on Image Processing, Electronics and Computers (IPEC), 2021, pp. 1019–1022.
53. R. Adarsh, A. Patil, S. Rayar, and K. Veena, “Comparison of vader and lstm for sentiment analysis,” International Journal of Recent Technology and Engineering, Vol. 7, No. 6, Mar. 2019, pp. 540–543.
54. S. Hochreiter and J. Schmidhuber, “Long Short-Term Memory,” Neural Computation, Vol. 9, No. 8, 11 1997, pp. 1735–1780. [Online]. https://doi.org/10.1162/neco.1997.9.8.1735
55. B. Lindemann, T. Müller, H. Vietz, N. Jazdi, and M. Weyrich, “A survey on long short-term memory networks for time series prediction,” Procedia CIRP, Vol. 99, 2021, pp. 650–655, 14th CIRP Conference on Intelligent Computation in Manufacturing Engineering, 15-17 July 2020. [Online]. https://www.sciencedirect.com/science/article/pii/S2212827121003796
56. H. Batra, N.S. Punn, S.K. Sonbhadra, and S. Agarwal, “BERT-based sentiment analysis: A software engineering perspective,” in Lecture Notes in Computer Science. Springer International Publishing, 2021, pp. 138–148. [Online]. https://doi.org/10.1007%2F978-3-030-86472-9_13
57. J. Devlin, M. Chang, K. Lee, and K. Toutanova, “BERT: pre-training of deep bidirectional transformers for language understanding,” CoRR, Vol. abs/1810.04805, 2018. [Online]. http://arxiv.org/abs/1810.04805
58. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones et al., “Attention is all you need,” CoRR, Vol. abs/1706.03762, 2017. [Online]. http://arxiv.org/abs/1706.03762
59. Y. Al Amrani, M. Lazaar, and K.E. El Kadiri, “Random forest and support vector machine based hybrid approach to sentiment analysis,” Procedia Computer Science, Vol. 127, 2018, pp. 511–520, proceedings of the first International Conference On Intelligent Computing in Data Sciences, ICDS2017. [Online]. https://www.sciencedirect.com/science/article/pii/S1877050918301625
60. L. Breiman, “Random forests,” Machine Learning, Vol. 45, No. 1, 2001, pp. 5–32. [Online]. https://doi.org/10.1023/A:1010933404324
61. M. Wu, Y. Yang, H. Wang, and Y. Xu, “A deep learning method to more accurately recall known lysine acetylation sites,” BMC Bioinformatics, Vol. 20, No. 1, 2019, p. 49. [Online]. https://doi.org/10.1186/s12859-019-2632-9
62. Z. Feng, D. Guo, D. Tang, N. Duan, X. Feng et al., “Codebert: A pre-trained model for programming and natural languages,” arXiv preprint arXiv:2002.08155, 2020.
63. J. Ramos et al., “Using tf-idf to determine word relevance in document queries,” in Proceedings of the first instructional conference on machine learning. Citeseer, 2003, pp. 29–48.
64. C. Tantithamthavorn, S. McIntosh, A.E. Hassan, and K. Matsumoto, “An empirical comparison of model validation techniques for defect prediction models,” IEEE Transactions on Software Engineering, Vol. 43, No. 1, 2016, pp. 1–18.
65. J. Romano, J.D. Kromrey, J. Coraggio, and J. Skowronek, “Appropriate statistics for ordinal level data: Should we really be using t-test and cohen’sd for evaluatng group differences on the nsse and other surveys,” in annual meeting of the Florida Association of Institutional Research, Vol. 177, 2006, p. 34.
66. L. Baldwin, “Internal and external validity and threats to validity,” in Research concepts for the practitioner of educational leadership. Brill, 2018, pp. 31–36.
67. X. Ying, “An overview of overfitting and its solutions,” Journal of Physics: Conference Series, Vol. 1168, No. 2, feb 2019, p. 022022. [Online]. https://dx.doi.org/10.1088/1742-6596/1168/2/022022
68. L.A. Cabrera-Diego, N. Bessis, and I. Korkontzelos, “Classifying emotions in stack overflow and jira using a multi-label approach,” Knowledge-Based Systems, Vol. 195, 2020, p. 105633. [Online]. https://www.sciencedirect.com/science/article/pii/S0950705120300939