Large-scale and language-oblivious code authorship identification

Mohammed Abuhamad, Tamer AbuHmed, Aziz Mohaisen, Dae Hun Nyang

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

64 Scopus citations

Abstract

Eicient extraction of code authorship attributes is key for successful identiication. However, the extraction of such attributes is very challenging, due to various programming language speciics, the limited number of available code samples per author, and the average code lines per ile, among others. To this end, this work proposes a Deep Learning-based Code Authorship Identiication System (DL-CAIS) for code authorship attribution that facilitates large-scale, language-oblivious, and obfuscation-resilient code authorship identiication. The deep learning architecture adopted in this work includes TF-IDF-based deep representation using multiple Recurrent Neural Network (RNN) layers and fully-connected layers dedicated to authorship attribution learning. The deep representation then feeds into a random forest classiier for scalability to de-anonymize the author. Comprehensive experiments are conducted to evaluate DL-CAIS over the entire Google Code Jam (GCJ) dataset across all years (from 2008 to 2016) and over real-world code samples from 1987 public repositories on GitHub. The results of our work show the high accuracy despite requiring a smaller number of iles per author. Namely, we achieve an accuracy of 96% when experimenting with 1,600 authors for GCJ, and 94.38% for the real-world dataset for 745 C programmers. Our system also allows us to identify 8,903 authors, the largest-scale dataset used by far, with an accuracy of 92.3%. Moreover, our technique is resilient to language-speciics, and thus it can identify authors of four programming languages (e.g., C, C++, Java, and Python), and authors writing in mixed languages (e.g., Java/C++, Python/C++). Finally, our system is resistant to sophisticated obfuscation (e.g., using C Tigress) with an accuracy of 93.42% for a set of 120 authors.

Original languageEnglish
Title of host publicationCCS 2018 - Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security
PublisherAssociation for Computing Machinery
Pages101-114
Number of pages14
ISBN (Electronic)9781450356930
DOIs
StatePublished - 15 Oct 2018
Event25th ACM Conference on Computer and Communications Security, CCS 2018 - Toronto, Canada
Duration: 15 Oct 2018 → …

Publication series

NameProceedings of the ACM Conference on Computer and Communications Security
ISSN (Print)1543-7221

Conference

Conference25th ACM Conference on Computer and Communications Security, CCS 2018
Country/TerritoryCanada
CityToronto
Period15/10/18 → …

Bibliographical note

Publisher Copyright:
© 2018 Association for Computing Machinery.

Keywords

  • Code Authorship Identiication
  • Deep learning identiication
  • Program features
  • Software forensics

Fingerprint

Dive into the research topics of 'Large-scale and language-oblivious code authorship identification'. Together they form a unique fingerprint.

Cite this