<%@ page language="java" contentType="text/html" %> <%-- Include common initialisation code --%> <%@ include file="/arch/common.jsp" %> <%-- The current tab --%> <% String currentTab = "Research"; %> <%-- Content of navigation pane --%> <%@ include file="nav.jsp" %> <% showCurrentLink=true; %> <%-- Current navigation location --%> <% String currentNav = "Reports and Theses"; %> <%-- Include the code for the document header --%> <%@ include file="/arch/header.jsp" %>

Research Report CS-RR-440

<%-- Include the code for the lines and navigation --%> <%@ include file="/arch/middle.jsp" %>

Georgina Cosma, An Approach to Source-Code Plagiarism Detection and Investigation Using Latent Semantic Analysis

This thesis looks at three aspects of source-code plagiarism. The first aspect of the thesis is concerned with creating a definition of source-code plagiarism; the second aspect is concerned with describing the findings gathered from investigating the Latent Semantic Analysis information retrieval algorithm for source-code similarity detection; and the final aspect of the thesis is concerned with the proposal and evaluation of a new algorithm that combines Latent Semantic Analysis with plagiarism detection tools.

A recent review of the literature revealed that there is no commonly agreed definition of what constitutes source-code plagiarism in the context of student assignments. This thesis first analyses the findings from a survey carried out to gather an insight into the perspectives of UK Higher Education academics who teach programming on computing courses. Based on the survey findings, a detailed definition of source-code plagiarism is proposed.

Secondly, the thesis investigates the application of an information retrieval technique, Latent Semantic Analysis, to derive semantic information from source-code files. Various parameters drive the effectiveness of Latent Semantic Analysis. The performance of Latent Semantic Analysis using various parameter settings and its effectiveness in retrieving similar source-code files when optimising those parameters are evaluated.

Finally, an algorithm for combining Latent Semantic Analysis with plagiarism detection tools is proposed and a tool is created and evaluated. The proposed tool, PlaGate, is a hybrid model that allows for the integration of Latent Semantic Analysis with plagiarism detection tools in order to enhance plagiarism detection. In addition, PlaGate has a facility for investigating the importance of source-code fragments with regards to their contribution towards proving plagiarism. PlaGate provides graphical output that indicates the clusters of suspicious files and source-code fragments.

Download

cs-rr-440.pdf

<%-- Include the code for the document footer --%> <%@ include file="/arch/footer.jsp" %>