2025 | |
[1] | "A Comparative Analysis of Metaheuristic Feature Selection Methods in Software Vulnerability Prediction", In e-Informatica Software Engineering Journal, vol. 19, no. 1, pp. 250103, 2025.
DOI: , 10.37190/e-Inf250103. Download article (PDF)Get article BibTeX file |
Authors
Deepali Bassi, Hardeep Singh
Abstract
Background: Early identification of software vulnerabilities is an intrinsic step in achieving software security. In the era of artificial intelligence, software vulnerability prediction models (VPMs) are created using machine learning and deep learning approaches. The effectiveness of these models aids in increasing the quality of the software. The handling of imbalanced datasets and dimensionality reduction are important aspects that affect the performance of VPMs.
Aim: The current study applies novel metaheuristic approaches for feature subset selection. Method: This paper performs a comparative analysis of forty-eight combinations of eight machine learning techniques and six metaheuristic feature selection methods on four public datasets.
Results: The experimental results reveal that VPMs productivity is upgraded after the application of the feature selection methods for both metrics-based and text-mining-based datasets. Additionally, the study has applied Wilcoxon signed-rank test to the results of metrics-based and text-features-based VPMs to evaluate which outperformed the other. Furthermore, it discovers the best-performing feature selection algorithm based on AUC for each dataset. Finally, this paper has performed better than the benchmark studies in terms of F1-Score.
Conclusion: The results conclude that GWO has performed satisfactorily for all the datasets.
Keywords
software vulnerability prediction, metaheuristic feature selection, SMOTE, machine learning algorithms
References
1. S. Matteson, “Report: Software failure caused $1.7 trillion in financial losses in 2017,” TechRepublic.[Online]. Available: https://www. techrepublic. com/article/report-software-failure-caused-1-7-trillion-in-financial-losses-in-2017, 2018.
2. H. Hanif, M.H.N.M. Nasir, M.F. Ab Razak, A. Firdaus, and N.B. Anuar, “The rise of software vulnerability: Taxonomy of software vulnerabilities detection and machine learning approaches,” Journal of Network and Computer Applications, Vol. 179, 2021, p. 103009.
3. D.R. Kuhn, M.S. Raunak, and R. Kacker, “An analysis of vulnerability trends, 2008-2016,” in 2017 IEEE International Conference on Software Quality, Reliability and Security Companion (QRS-C). IEEE, 2017, pp. 587–588.
4. S. Frei, “The known unknowns: Empirical analysis of publicly unknown security vulnerabilities,” NSS Labs, December, 2013.
5. T. Zimmermann, N. Nagappan, and L. Williams, “Searching for a needle in a haystack: Predicting security vulnerabilities for windows vista,” in 2010 Third international conference on software testing, verification and validation. IEEE, 2010, pp. 421–428.
6. B. Arkin, S. Stender, and G. McGraw, “Software penetration testing,” IEEE Security & Privacy, Vol. 3, No. 1, 2005, pp. 84–87.
7. S.M. Ghaffarian and H.R. Shahriari, “Software vulnerability analysis and discovery using machine-learning and data-mining techniques: A survey,” ACM Computing Surveys (CSUR), Vol. 50, No. 4, 2017, pp. 1–36.
8. A. Kaya, A.S. Keceli, C. Catal, and B. Tekinerdogan, “The impact of feature types, classifiers, and data balancing techniques on software vulnerability prediction models,” Journal of Software: Evolution and Process, Vol. 31, No. 9, 2019, p. e2164.
9. I. Chowdhury and M. Zulkernine, “Using complexity, coupling, and cohesion metrics as early indicators of vulnerabilities,” Journal of Systems Architecture, Vol. 57, No. 3, 2011, pp. 294–313.
10. J. Walden, J. Stuckman, and R. Scandariato, “Predicting vulnerable components: Software metrics vs text mining,” in 2014 IEEE 25th international symposium on software reliability engineering. IEEE, 2014, pp. 23–33.
11. K. Borowska and J. Stepaniuk, “Imbalanced data classification: A novel re-sampling approach combining versatile improved smote and rough sets,” in Computer Information Systems and Industrial Management: 15th IFIP TC8 International Conference, CISIM 2016, Vilnius, Lithuania, September 14-16, 2016, Proceedings 15. Springer, 2016, pp. 31–42.
12. R. Ferenc, P. Hegedűs, P. Gyimesi, G. Antal, D. Bán et al., “Challenging machine learning algorithms in predicting vulnerable javascript functions,” in 2019 IEEE/ACM 7th International Workshop on Realizing Artificial Intelligence Synergies in Software Engineering (RAISE). IEEE, 2019, pp. 8–14.
13. P.K. Kudjo, S.B. Aformaley, S. Mensah, and J. Chen, “The significant effect of parameter tuning on software vulnerability prediction models,” in 2019 IEEE 19th International Conference on Software Quality, Reliability and Security Companion (QRS-C). IEEE, 2019, pp. 526–527.
14. E. Sara, C. Laila, and I. Ali, “The impact of smote and grid search on maintainability prediction models,” in 2019 IEEE/ACS 16th International Conference on Computer Systems and Applications (AICCSA). IEEE, 2019, pp. 1–8.
15. H. Osman, M. Ghafari, and O. Nierstrasz, “Hyperparameter optimization to improve bug prediction accuracy,” in 2017 IEEE Workshop on Machine Learning Techniques for Software Quality Evaluation (MaLTeSQuE). IEEE, 2017, pp. 33–38.
16. R. Shu, T. Xia, L. Williams, and T. Menzies, “Better security bug report classification via hyperparameter optimization,” arXiv preprint arXiv:1905.06872, 2019.
17. D. Bassi and H. Singh, “Optimizing hyperparameters for improvement in software vulnerability prediction models,” in Advances in Distributed Computing and Machine Learning: Proceedings of ICADCML 2022. Springer, 2022, pp. 533–544.
18. H. Wang, Y. Zhang, J. Zhang, T. Li, and L. Peng, “A factor graph model for unsupervised feature selection,” Information Sciences, Vol. 480, 2019, pp. 144–159.
19. X. Tang, Y. Dai, and Y. Xiang, “Feature selection based on feature interactions with application to text categorization,” Expert Systems with Applications, Vol. 120, 2019, pp. 207–216.
20. J. Stuckman, J. Walden, and R. Scandariato, “The effect of dimensionality reduction on software vulnerability prediction models,” IEEE Transactions on Reliability, Vol. 66, No. 1, 2016, pp. 17–37.
21. X. Chen, Z. Yuan, Z. Cui, D. Zhang, and X. Ju, “Empirical studies on the impact of filter-based ranking feature selection on security vulnerability prediction,” IET Software, Vol. 15, No. 1, 2021, pp. 75–89.
22. M. Rostami, K. Berahmand, E. Nasiri, and S. Forouzandeh, “Review of swarm intelligence-based feature selection methods,” Engineering Applications of Artificial Intelligence, Vol. 100, 2021, p. 104210.
23. D. Bassi and H. Singh, “The effect of dual hyperparameter optimization on software vulnerability prediction models,” e-Informatica Software Engineering Journal, Vol. 17, No. 1, 2023, p. 230102.
24. W. Rhmann, “Software vulnerability prediction using grey wolf-optimized random forest on the unbalanced data sets,” International Journal of Applied Metaheuristic Computing (IJAMC), Vol. 13, No. 1, 2022, pp. 1–15.
25. C.B. Şahin, Ö.B. Dinler, and L. Abualigah, “Prediction of software vulnerability based deep symbiotic genetic algorithms: Phenotyping of dominant-features,” Applied Intelligence, Vol. 51, 2021, pp. 8271–8287.
26. T. Viszkok, P. Hegedűs, and R. Ferenc, “Improving vulnerability prediction of javascript functions using process metrics,” arXiv preprint arXiv:2105.07527, 2021.
27. I. Kalouptsoglou, M. Siavvas, D. Kehagias, A. Chatzigeorgiou, and A. Ampatzoglou, “Examining the capacity of text mining and software metrics in vulnerability prediction,” Entropy, Vol. 24, No. 5, 2022, p. 651.
28. S. Wang and X. Yao, “Using class imbalance learning for software defect prediction,” IEEE Transactions on Reliability, Vol. 62, No. 2, 2013, pp. 434–443.
29. Y. Shin, A. Meneely, L. Williams, and J.A. Osborne, “Evaluating complexity, code churn, and developer activity metrics as indicators of software vulnerabilities,” IEEE transactions on software engineering, Vol. 37, No. 6, 2010, pp. 772–787.
30. Y. Shin and L. Williams, “Can traditional fault prediction models be used for vulnerability prediction?” Empirical Software Engineering, Vol. 18, 2013, pp. 25–59.
31. R. Lagerström, C. Baldwin, A. MacCormack, D. Sturtevant, and L. Doolan, “Exploring the relationship between architecture coupling and software vulnerabilities,” in Engineering Secure Software and Systems: 9th International Symposium, ESSoS 2017, Bonn, Germany, July 3-5, 2017, Proceedings 9. Springer, 2017, pp. 53–69.
32. Y. Zhang, D. Lo, X. Xia, B. Xu, J. Sun et al., “Combining software metrics and text features for vulnerable file prediction,” in 2015 20th International Conference on Engineering of Complex Computer Systems (ICECCS). IEEE, 2015, pp. 40–49.
33. I. Abunadi and M. Alenezi, “An empirical investigation of security vulnerabilities within web applications.” J. Univers. Comput. Sci., Vol. 22, No. 4, 2016, pp. 537–551.
34. M.N. Khalid, H. Farooq, M. Iqbal, M.T. Alam, and K. Rasheed, “Predicting web vulnerabilities in web applications based on machine learning,” in Intelligent Technologies and Applications: First International Conference, INTAP 2018, Bahawalpur, Pakistan, October 23-25, 2018, Revised Selected Papers 1. Springer, 2019, pp. 473–484.
35. C. Catal, A. Akbulut, E. Ekenoglu, and M. Alemdaroglu, “Development of a software vulnerability prediction web service based on artificial neural networks,” in Trends and Applications in Knowledge Discovery and Data Mining: PAKDD 2017 Workshops, MLSDA, BDM, DM-BPM Jeju, South Korea, May 23, 2017, Revised Selected Papers 21. Springer, 2017, pp. 59–67.
36. Z. Li and Y. Shao, “A survey of feature selection for vulnerability prediction using feature-based machine learning,” in Proceedings of the 2019 11th International Conference on Machine Learning and Computing, 2019, pp. 36–42.
37. T. Sonnekalb, T.S. Heinze, and P. Mäder, “Deep security analysis of program code: A systematic literature review,” Empirical Software Engineering, Vol. 27, No. 1, 2022, p. 2.
38. Z. Li, D. Zou, S. Xu, X. Ou, H. Jin et al., “Vuldeepecker: A deep learning-based system for vulnerability detection,” arXiv preprint arXiv:1801.01681, 2018.
39. Y. Zhou, S. Liu, J. Siow, X. Du, and Y. Liu, “Devign: Effective vulnerability identification by learning comprehensive program semantics via graph neural networks,” Advances in neural information processing systems, Vol. 32, 2019.
40. Z. Li, D. Zou, S. Xu, Z. Chen, Y. Zhu et al., “Vuldeelocator: a deep learning-based fine-grained vulnerability detector,” IEEE Transactions on Dependable and Secure Computing, Vol. 19, No. 4, 2021, pp. 2821–2837.
41. I. Kalouptsoglou, M. Siavvas, D. Tsoukalas, and D. Kehagias, “Cross-project vulnerability prediction based on software metrics and deep learning,” in Computational Science and Its Applications–ICCSA 2020: 20th International Conference, Cagliari, Italy, July 1–4, 2020, Proceedings, Part IV 20. Springer, 2020, pp. 877–893.
42. H.H. Patel and P. Prajapati, “Study and analysis of decision tree based classification algorithms,” International Journal of Computer Sciences and Engineering, Vol. 6, No. 10, 2018, pp. 74–78.
43. Z. Jin, J. Shang, Q. Zhu, C. Ling, W. Xie et al., “Rfrsf: Employee turnover prediction based on random forests and survival analysis,” in Web Information Systems Engineering–WISE 2020: 21st International Conference, Amsterdam, The Netherlands, October 20–24, 2020, Proceedings, Part II 21. Springer, 2020, pp. 503–515.
44. M. Martinez-Arroyo and L.E. Sucar, “Learning an optimal naive bayes classifier,” in 18th international conference on pattern recognition (ICPR’06), Vol. 3. IEEE, 2006, pp. 1236–1239.
45. R.E. Schapire, “The boosting approach to machine learning: An overview,” Nonlinear estimation and classification, 2003, pp. 149–171.
46. C.C. Chang and C.J. Lin, “Libsvm: A library for support vector machines,” Vol. 2, No. 3, 2011. [Online]. https://doi.org/10.1145/1961189.1961199
47. J.M. Keller, M.R. Gray, and J.A. Givens, “A fuzzy k-nearest neighbor algorithm,” IEEE transactions on systems, man, and cybernetics, No. 4, 1985, pp. 580–585.
48. H.F. Yu, F.L. Huang, and C.J. Lin, “Dual coordinate descent methods for logistic regression and maximum entropy models,” Machine Learning, Vol. 85, 2011, pp. 41–75.
49. M.C. Popescu, V.E. Balas, L. Perescu-Popescu, and N. Mastorakis, “Multilayer perceptron and neural networks,” WSEAS Transactions on Circuits and Systems, Vol. 8, No. 7, 2009, pp. 579–588.
50. N.V. Chawla, K.W. Bowyer, L.O. Hall, and W.P. Kegelmeyer, “Smote: synthetic minority over-sampling technique,” Journal of artificial intelligence research, Vol. 16, 2002, pp. 321–357.
51. G. Douzas, F. Bacao, and F. Last, “Improving imbalanced learning through a heuristic oversampling method based on k-means and smote,” Information Sciences, Vol. 465, 2018, pp. 1–20.
52. A. Zeb, F. Din, M. Fayaz, G. Mehmood, K.Z. Zamli et al., “A systematic literature review on robust swarm intelligence algorithms in search-based software engineering,” Complexity, Vol. 2023, 2023.
53. S. Kassaymeh, S. Abdullah, M.A. Al-Betar, and M. Alweshah, “Salp swarm optimizer for modeling the software fault prediction problem,” Journal of King Saud University-Computer and Information Sciences, Vol. 34, No. 6, 2022, pp. 3365–3378.
54. F.S. Gharehchopogh and H. Gholizadeh, “A comprehensive survey: Whale optimization algorithm and its applications,” Swarm and Evolutionary Computation, Vol. 48, 2019, pp. 1–24.
55. S. Mirjalili, A.H. Gandomi, S.Z. Mirjalili, S. Saremi, H. Faris et al., “Salp swarm algorithm: A bio-inspired optimizer for engineering design problems,” Advances in engineering software, Vol. 114, 2017, pp. 163–191.
56. J. Huang and C.X. Ling, “Using auc and accuracy in evaluating learning algorithms,” IEEE Transactions on knowledge and Data Engineering, Vol. 17, No. 3, 2005, pp. 299–310.
57. K.Z. Sultana, B.J. Williams, and A. Bosu, “A comparison of nano-patterns vs. software metrics in vulnerability prediction,” in 2018 25th Asia-Pacific Software Engineering Conference (APSEC). IEEE, 2018, pp. 355–364.
58. D. Wijayasekara, M. Manic, and M. McQueen, “Vulnerability identification and classification via text mining bug databases,” in IECON 2014-40th Annual Conference of the IEEE Industrial Electronics Society. IEEE, 2014, pp. 3612–3618.
59. D.H. Wolpert, “The supervised learning no-free-lunch theorems,” Soft computing and industry: Recent applications, 2002, pp. 25–42.
60. D. Tomar and S. Agarwal, “Prediction of defective software modules using class imbalance learning,” Applied Computational Intelligence and Soft Computing, Vol. 2016, 2016, pp. 6–6.