Labelled vulnerability dataset on Android source code (LVDAndro) to develop AI-based code vulnerability detection models
Conference paper
Senanayake, J., Kalutarage, H., Al-Kadri, M.O., Piras, L. and Petrovski, A. 2023. Labelled vulnerability dataset on Android source code (LVDAndro) to develop AI-based code vulnerability detection models. Vimercati, S. and Samarati, P. (ed.) International Conference on Security and Cryptography (SECRYPT) 2023. Rome, Italy 10 - 12 Jul 2023 SCITEPRESS - Science and Technology Publications. pp. 659-666 https://doi.org/10.5220/0012060400003555
Type | Conference paper |
---|---|
Title | Labelled vulnerability dataset on Android source code (LVDAndro) to develop AI-based code vulnerability detection models |
Authors | Senanayake, J., Kalutarage, H., Al-Kadri, M.O., Piras, L. and Petrovski, A. |
Abstract | Ensuring the security of Android applications is a vital and intricate aspect requiring careful consideration during development. Unfortunately, many apps are published without sufficient security measures, possibly due to a lack of early vulnerability identification. One possible solution is to employ machine learning models trained on a labelled dataset, but currently, available datasets are suboptimal. This study creates a sequence of datasets of Android source code vulnerabilities, named LVDAndro, labelled based on Common Weakness Enumeration (CWE). Three datasets were generated through app scanning by altering the number of apps and their sources. The LVDAndro, includes over 2,000,000 unique code samples, obtained by scanning over 15,000 apps. The AutoML technique was then applied to each dataset, as a proof of concept to evaluate the applicability of LVDAndro, in detecting vulnerable source code using machine learning. The AutoML model, trained on the dataset, achieved accuracy of 94% and F1-Score of 0.94 in binary classification, and accuracy of 94% and F1-Score of 0.93 in CWE-based multi-class classification. The LVDAndro dataset is publicly available, and continues to expand as more apps are scanned and added to the dataset regularly. The LVDAndro GitHub Repository also includes the source code for dataset generation, and model training. |
Keywords | Android Application Security; Code Vulnerability; Labelled Dataset; Artificial Intelligence; Auto Machine Learning. |
Sustainable Development Goals | 9 Industry, innovation and infrastructure |
Middlesex University Theme | Creativity, Culture & Enterprise |
Research Group | Software Engineering, Theory & Algorithms (SETA) |
Conference | International Conference on Security and Cryptography (SECRYPT) 2023 |
Page range | 659-666 |
Proceedings Title | Proceedings of the 20th International Conference on Security and Cryptography, SECRYPT - Volume 1 |
Series | SECRYPT |
Editors | Vimercati, S. and Samarati, P. |
ISSN | 2184-7711 |
ISBN | 9789897586668 |
Publisher | SCITEPRESS - Science and Technology Publications |
Publication dates | |
10 Jul 2023 | |
Publication process dates | |
Accepted | 23 Apr 2023 |
Deposited | 18 Jul 2023 |
Output status | Published |
Publisher's version | License File Access Level Open |
Copyright Statement | Senanayake, J., Kalutarage, H., Al-Kadri, M., Piras, L. and Petrovski, A., Labelled Vulnerability Dataset on Android Source Code (LVDAndro) to Develop AI-Based Code Vulnerability Detection Models. DOI: 10.5220/0012060400003555 |
Digital Object Identifier (DOI) | https://doi.org/10.5220/0012060400003555 |
Web of Science identifier | WOS:001072829100063 |
Web address (URL) of conference proceedings | https://doi.org/10.5220/0000167900003555 |
Language | English |
https://repository.mdx.ac.uk/item/8q739
Download files
126
total views9
total downloads2
views this month1
downloads this month