2025 | |
[1] | "ACoRA – A Platform for Automating Code Review Tasks", In e-Informatica Software Engineering Journal, vol. 19, no. 1, pp. 250102, 2025.
DOI: , 10.37190/e-Inf250102. Download article (PDF)Get article BibTeX file |
Authors
Mirosław Ochodek, Miroslaw Staron
Abstract
Background: Modern Code Reviews (MCR) are frequently adopted when assuring code and design quality in continuous integration and deployment projects. Although tiresome, they serve a secondary purpose of learning about the software product.
Aim: Our objective is to design and evaluate a support tool to help software developers focus on the most important code fragments to review and provide them with suggestions on what should be reviewed in this code.
Method: We used design science research to develop and evaluate a tool for automating code reviews by providing recommendations for code reviewers. The tool is based on Transformer-based machine learning models for natural language processing, applied to both programming language code (patch content) and the review comments. We evaluate both the ability of the language model to match similar lines and the ability to correctly indicate the nature of the potential problems encoded in a set of categories. We evaluated the tool on two open-source projects and one industry project.
Results: The proposed tool was able to correctly annotate (only true positives) 35%–41% and partially correctly annotate 76%–84% of code fragments to be reviewed with labels corresponding to different aspects of code the reviewer should focus on.
Conclusion: By comparing our study to similar solutions, we conclude that indicating lines to be reviewed and suggesting the nature of the potential problems in the code allows us to achieve higher accuracy than suggesting entire changes in the code considered in other studies. Also, we have found that the differences depend more on the consistency of commenting rather than on the ability of the model to find similar lines.
Keywords
Code Reviews, Continous Integration, BERT, Machine Learning
References
1. P.C. Rigby and C. Bird, “Convergent contemporary software peer review practices,” in Proceedings of the 9th Joint Meeting on Foundations of Software Engineering, 2013, pp. 202–212.
2. L. MacLeod, M. Greiler, M.A. Storey, C. Bird, and J. Czerwonka, “Code reviewing in the trenches: Challenges and best practices,” IEEE Software, Vol. 35, No. 4, 2017, pp. 34–42.
3. M.E. Fagan, “Design and code inspections to reduce errors in program development,” IBM Systems Journal, Vol. 15, No. 3, 1976, pp. 182–211.
4. L. Milanesio, Learning Gerrit Code Review. Packt Publishing Ltd, 2013.
5. M. Meyer, “Continuous integration and its tools,” IEEE Software, Vol. 31, No. 3, 2014, pp. 14–16.
6. M. Staron, W. Meding, O. Söder, and M. Bäck, “Measurement and impact factors of speed of reviews and integration in continuous software engineering,” Foundations of Computing and Decision Sciences, Vol. 43, No. 4, 2018, pp. 281–303.
7. J. Czerwonka, M. Greiler, and J. Tilford, “Code reviews do not find bugs. How the current code review best practice slows us down,” in IEEE/ACM 37th International Conference on Software Engineering, Vol. 2. IEEE, 2015, pp. 27–28.
8. F. Huq, M. Hasan, M.M.A. Haque, S. Mahbub, A. Iqbal et al., “Review4Repair: Code review aided automatic program repairing,” Information and Software Technology, Vol. 143, 2022, p. 106765.
9. M. Hasan, A. Iqbal, M.R.U. Islam, A. Rahman, and A. Bosu, “Using a balanced scorecard to identify opportunities to improve code review effectiveness: An industrial experience report,” Empirical Software Engineering, Vol. 26, No. 6, 2021, pp. 1–34.
10. M. Staron, M. Ochodek, W. Meding, and O. Söder, “Using machine learning to identify code fragments for manual review,” in 46th Euromicro Conference on Software Engineering and Advanced Applications (SEAA). IEEE, 2020, pp. 513–516.
11. D.S. Mendonça and M. Kalinowski, “An empirical investigation on the challenges of creating custom static analysis rules for defect localization,” Software Quality Journal, 2022, pp. 1–28.
12. A. Bacchelli and C. Bird, “Expectations, outcomes, and challenges of modern code review,” in 2013 35th International Conference on Software Engineering (ICSE). IEEE, 2013, pp. 712–721.
13. N. Fatima, S. Nazir, and S. Chuprat, “Knowledge sharing, a key sustainable practice is on risk: An insight from modern code review,” in IEEE 6th International Conference on Engineering Technologies and Applied Sciences (ICETAS). IEEE, 2019, pp. 1–6.
14. A. Hindle, E.T. Barr, M. Gabel, Z. Su, and P. Devanbu, “On the naturalness of software,” Communications of the ACM, Vol. 59, No. 5, 2016, pp. 122–131.
15. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones et al., “Attention is all you need,” arXiv preprint arXiv:1706.03762, 2017.
16. M. Ochodek, M. Staron, W. Meding, and O. Söder, “Automated code review comment classification to improve modern code reviews,” in International Conference on Software Quality. Springer, 2022, pp. 23–40.
17. R. Wieringa, Design Science Methodology for Information Systems and Software Engineering, 2014. [Online]. http://portal.acm.org/citation.cfm?doid=1810295.1810446
18. M. Allamanis, E.T. Barr, P. Devanbu, and C. Sutton, “A survey of machine learning for big code and naturalness,” ACM Computing Surveys (CSUR), Vol. 51, No. 4, 2018, pp. 1–37.
19. U. Alon, M. Zilberstein, O. Levy, and E. Yahav, “code2vec: Learning distributed representations of code,” Proceedings of the ACM on Programming Languages, Vol. 3, No. POPL, 2019, pp. 1–29.
20. J. Devlin, M.W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.
21. Z. Feng, D. Guo, D. Tang, N. Duan, X. Feng et al., “Codebert: A pre-trained model for programming and natural languages,” arXiv preprint arXiv:2002.08155, 2020.
22. B. Roziere, M.A. Lachaux, M. Szafraniec, and G. Lample, “DOBF: A deobfuscation pre-training objective for programming languages,” arXiv preprint arXiv:2102.07492, 2021.
23. Y. Li, D. Choi, J. Chung, N. Kushman, J. Schrittwieser et al., “Competition-level code generation with alphacode,” arXiv preprint arXiv:2203.07814, 2022.
24. F. Huq, M. Hasan, M.M.A. Haque, S. Mahbub, A. Iqbal et al., “Review4repair: Code review aided automatic program repairing,” Information and Software Technology, Vol. 143, 2022, p. 106765. [Online]. https://www.sciencedirect.com/science/article/pii/S0950584921002111
25. R. Tufano, S. Masiero, A. Mastropaolo, L. Pascarella, D. Poshyvanyk et al., “Using pre-trained models to boost code review automation,” in Proceedings of the 44th International Conference on Software Engineering, 2022, pp. 2291–2302.
26. C. Sadowski, E. Söderberg, L. Church, M. Sipko, and A. Bacchelli, “Modern code review: a case study at google,” in Proceedings of the 40th International Conference on Software Engineering: Software Engineering in Practice, 2018, pp. 181–190.
27. A. Ram, A.A. Sawant, M. Castelluccio, and A. Bacchelli, “What makes a code change easier to review: An empirical investigation on code change reviewability,” in Proceedings of the 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 2018, pp. 201–212.
28. Y. Arafat, S. Sumbul, and H. Shamma, “Categorizing code review comments using machine learning,” in Proceedings of Sixth International Congress on Information and Communication Technology. Springer, 2022, pp. 195–206.
29. Z. Li, Y. Yu, G. Yin, T. Wang, Q. Fan et al., “Automatic classification of review comments in pull-based development model,” in International Conferences on Software Engineering and Knowledge Engineering. KSI Research Inc. and Knowledge Systems Institute Graduate School, 2017.
30. Z.X. Li, Y. Yu, G. Yin, T. Wang, and H.M. Wang, “What are they talking about? Analyzing code reviews in pull-based development model,” Journal of Computer Science and Technology, Vol. 32, 2017, pp. 1060–1075.
31. L. Yang, J. Xu, Y. Zhang, H. Zhang, and A. Bacchelli, “EvaCRC: evaluating code review comments,” in Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 2023, pp. 275–287.
32. J.K. Siow, C. Gao, L. Fan, S. Chen, and Y. Liu, “Core: Automating review recommendation for code changes,” in IEEE 27th International Conference on Software Analysis, Evolution and Reengineering (SANER). IEEE, 2020, pp. 284–295.
33. R. Brito and M.T. Valente, “RAID: Tool support for refactoring-aware code reviews,” in IEEE/ACM 29th International Conference on Program Comprehension (ICPC). IEEE, 2021, pp. 265–275.
34. R. Tufano, O. Dabić, A. Mastropaolo, M. Ciniselli, and G. Bavota, “Code review automation: Strengths and weaknesses of the state of the art,” IEEE Transactions on Software Engineering, 2024.
35. Y. Hong, C. Tantithamthavorn, P. Thongtanunam, and A. Aleti, “Commentfinder: a simpler, faster, more accurate code review comments recommendation,” in Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 2022, pp. 507–519.
36. D. Badampudi, R. Britto, and M. Unterkalmsteiner, “Modern code reviews – Preliminary results of a systematic mapping study,” Proceedings of the Evaluation and Assessment on Software Engineering, 2019, pp. 340–345.
37. D. Badampudi, M. Unterkalmsteiner, and R. Britto, “Modern code reviews – Survey of literature and practice,” ACM Transactions on Software Engineering and Methodology, Vol. 32, No. 4, 2023, pp. 1–61.
38. N. Davila and I. Nunes, “A systematic literature review and taxonomy of modern code review,” Journal of Systems and Software, Vol. 177, 2021, p. 110951.
39. H.A. Çetin, E. Doğan, and E. Tüzün, “A review of code reviewer recommendation studies: Challenges and future directions,” Science of Computer Programming, Vol. 208, 2021, p. 102652.
40. B. Roziere, M.A. Lachaux, L. Chanussot, and G. Lample, “Unsupervised translation of programming languages,” Advances in Neural Information Processing Systems, Vol. 33, 2020, pp. 20 601–20 611.
41. Y. Wu, M. Schuster, Z. Chen, Q.V. Le, M. Norouzi et al., “Google’s neural machine translation system: Bridging the gap between human and machine translation,” arXiv preprint arXiv:1609.08144, 2016.
42. M. Ochodek, M. Staron, D. Bargowski, W. Meding, and R. Hebig, “Using machine learning to design a flexible loc counter,” in 2017 IEEE Workshop on Machine Learning Techniques for Software Quality Evaluation (MaLTeSQuE). IEEE, 2017, pp. 14–20.
43. M. Ochodek, R. Hebig, W. Meding, G. Frost, and M. Staron, “Recognizing lines of code violating company-specific coding guidelines using machine learning,” Empirical Software Engineering, Vol. 25, No. 1, 2020, pp. 220–265.
44. Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi et al., “Roberta: A robustly optimized bert pretraining approach,” arXiv preprint arXiv:1907.11692, 2019.
45. I. Turc, M.W. Chang, K. Lee, and K. Toutanova, “Well-read students learn better: On the importance of pre-training compact models,” arXiv preprint arXiv:1908.08962, 2019.
46. H. Akoglu, “User’s guide to correlation coefficients,” Turkish journal of emergency medicine, Vol. 18, No. 3, 2018, pp. 91–93.
47. J.N. Mandrekar, “Receiver operating characteristic curve in diagnostic test assessment,” Journal of Thoracic Oncology, Vol. 5, No. 9, 2010, pp. 1315–1316.
48. M. Staron, Action research in software engineering. Springer, 2020.
49. S.K. Pandey, M. Staron, J. Horkoff, M. Ochodek, N. Mucci et al., “TransDPR: design pattern recognition using programming language models,” in ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM). IEEE, 2023, pp. 1–7.
50. D. Guo, S. Ren, S. Lu, Z. Feng, D. Tang et al., “Graphcodebert: Pre-training code representations with data flow,” arXiv preprint arXiv:2009.08366, 2020.
51. Y. Wang, H. Le, A.D. Gotmare, N.D. Bui, J. Li et al., “CodeT5+: Open code large language models for code understanding and generation,” arXiv preprint, 2023.
52. V. Antinyan, M. Staron, A. Sandberg, and J. Hansson, “Validating software measures using action research a method and industrial experiences,” in Proceedings of the 20th International Conference on Evaluation and Assessment in Software Engineering, 2016, pp. 1–10.
53. M. Chen, J. Tworek, H. Jun, Q. Yuan, H.P.d.O. Pinto et al., “Evaluating large language models trained on code,” arXiv preprint arXiv:2107.03374, 2021.
54. C. Wohlin, P. Runeson, M. Höst, M.C. Ohlsson, B. Regnell et al., Experimentation in software engineering. Springer Science & Business Media, 2012.
55. K.W. Al-Sabbagh, M. Staron, M. Ochodek, R. Hebig, and W. Meding, “Selective regression testing based on big data: Comparing feature extraction techniques,” in IEEE International Conference on Software Testing, Verification and Validation Workshops (ICSTW). IEEE, 2020, pp. 322–329.
56. M. Staron, W. Meding, O. Söder, and M. Ochodek, “Improving quality of code review datasets – Token-based feature extraction method,” in International Conference on Software Quality. Springer, 2021, pp. 81–93.