- Title
- Preprocessing for Source Code Similarity Detection in Introductory Programming
- Creator
- Karnalim, Oscar; Simon; Chivers, William
- Relation
- Koli Calling '20. Proceedings of the 20th Koli Calling International Conference on Computing Education Research (Koli, Finland 19-22 November, 2020)
- Publisher Link
- http://dx.doi.org/10.1145/3428029.3428065
- Publisher
- Association for Computing Machinery (ACM)
- Resource Type
- conference paper
- Date
- 2020
- Description
- It is well documented that some students either work together on programming assessments when required to work individually (collusion) or make unauthorised use of existing code from external sources (plagiarism). One approach used in the detection of these violations of academic integrity is source code similarity detection, the automatic checking of student programs for unduly high levels of similarity. Preprocessing of source code files has the potential to increase the effectiveness, the efficiency, or both, of the source code comparison process. There are many possible steps in the preprocessing, and examination of the literature suggests that these steps are selected and implemented without any empirical evidence as to their value. This paper lists 19 preprocessing steps that have been used in code similarity detection, and assesses the effectiveness and the efficiency of 16 of these steps on data sets of student programs from introductory programming courses. The results should help researchers to decide what preprocessing steps to include when designing source code similarity detection techniques or software. According to the study, identifier removal increases both effectiveness and efficiency. Token renaming and syntax tree linearisation increase effectiveness at a cost of efficiency. Other preprocessing steps are dependent upon characteristics of the data set and should ideally be empirically tested before being applied. The paper should also help alert programming educators to the sorts of disguise that students can apply to copied programs.
- Subject
- collusion; computing education; plagiarism; programming; source code similarity detection
- Identifier
- http://hdl.handle.net/1959.13/1439091
- Identifier
- uon:40820
- Identifier
- ISBN:9781450389211
- Language
- eng
- Reviewed
- Hits: 523
- Visitors: 520
- Downloads: 0
Thumbnail | File | Description | Size | Format |
---|