SID - Plagiarism Detection



Home | Login | Account Application



   What is SID?

SID stands for Shared Information Distance or Software Integrity Detection . It detects similarity between programs by computing the shared information between them. It was originally an algorithm developed for comparing how similar or dissimilar genomes are [1]. It was then realized that this algorithm could be extended to many other applications including finding chain letter history and detecting plagiarism. We have applied it in the detection of plagiarism in source code. We hope to extend the use of this algorithm to other applications in the future.

   Why use SID?

SID is the easiest software to use to detect plagiairism within source code and has shown to be the most effective at catching cheaters [4]. Here are some key comparisons between our closest rival, MOSS:

Some Key Differences Between SID and MOSS
  Web Interface     Approximate Matching     Maintain Account Info Online  
SID Yes Yes Yes
MOSS No No No

   Who is SID for?

SID is for anybody looking for an easy way to detect plagiarism in source code. We currently support Java and C++ source code.

   How Do We Do It?

Whether you are a teacher or a student who's program is going to be tested by your teacher, we have no intention of hiding our methods from you. Here is what we do: for two programs to be compared, we compute the shared amount of information between two programs. The shared information distance between two programs X and Y is defined as:
D(X,Y) = K(X) - K(X|Y)
K(XY)


where K(X|Y) is the Kolmogorov complexity [2,3] of X given Y. When Y is empty, it becomes K(X). K(X)-K(X|Y) is mutual information between X and Y --- i.e. this measures the amount of information Y "knows" about X. It is well-known that mutual information is symmetric. However it does not satisfy triangle inequality and is not a distance. D(X,Y) is a well-defined distance and it is provable that it is also universal. The universality aspect allows us to advertize our methods openly since in theory if we are actually able to compute this distance precisely --- it is not cheatable! We are confident that with the approximation we have of this distance, cheating our system will be very difficult to do.

   Using SID

Step 1.

To use our service you first need to create an account. To do this simply click on the Account Application link at the top or bottom of this page and fill in the required information. Note that if you wish to receive email confirmations you must include a valid email address at account creation time because the system does not currently support information changes.

Step 2.

Enter your login and password in the two corresponding text fields at the upper right of the home page and click the login button. The following page will have a list of previous submissions sorted by date, which you may view by clicking on the link.

Step 3.

Click the link at the top of the page, which will take you to the file submission page. Use the file submission page to submit a carefully formatted zip file. A detailed description of how the zip file should be formatted is available on the file upload page.

Step 4.

Wait for comparison of the projects to finish. If you have provided a valid email address, you will receive an email notification with a link to the page with your results as soon as the results are ready. You can also click on the "View Results" button located on the top of the page to take you to a page with all of your submission results. This page provides links to the results pages for all of your processed submissions and displays the status of all the submissions that are currently processing. Click on the submission that you want to view results for and you will be brought to a self explanatory results display page.

   Experiment Results

If you wish to download some of the experiments that we used when comparing our system to others you can do so by clicking here. The results of the experiments are discussed in [4] .

   Authors

The system has been developed by X. Chen, B. Francia, B. Mckinnon, A. Seker, and M. Li of the UCSB and UW Bioinformatics groups, Computer Science Department. All rights reserved.

   Contact Information

Send bug reports to d57wang@uwaterloo.ca and general questions to mli@uwaterloo.ca.

   References

[1] M. Li, J. Badger, X. Chen, S. Kwong, P. Kearney, H. Zhang, An information based sequence distance and its application to whole (mitochondria) genome distance. Bioinformatics 17: 149-154, 2001. [ http://monod.uwaterloo.ca/papers/01infodist.pdf ]

[2] M. Li and P. Vitanyi, An introduction to Kolmogorov complexity and its applications, Springer-Verlag, 1st Ed. 1993, 2nd Ed. 1997.

[3] Dana MacKenzie, On a roll, New Scientist, Nov. 6, 1999,44-47.

[4] X. Chen, B. Francia, M. Li, B. McKinnon, A. Seker. Shared Information and Program Plagiarism Detection IEEE Trans. Information Theory, July 2004, 1545-1550.[ http://monod.uwaterloo.ca/papers/04sid.pdf ]

[5] C. Bennett, M. Li and B. Ma, Chain letters and evolutionary histories. Scientific American, 288:6, June 2003, 76-81.

Registered Users Login Here
Login


Password





Home | Login | Account Application

SID version 1.1, released on December 10, 2003