What is SID?
SID stands for Shared Information Distance or Software Integrity Detection
. It detects similarity between programs by computing the shared information between them. It was originally
an algorithm developed for comparing how similar
or dissimilar genomes are [1]. It was then realized that this algorithm
could be extended to many other applications including finding chain
letter history and detecting plagiarism. We have applied it in the detection
of plagiarism in source code. We hope to extend the use of this algorithm
to other applications in the future.
Why use SID?
SID is the easiest software to use to detect plagiairism within source code and has shown to be the
most effective at catching cheaters [4]. Here are some key comparisons between our closest rival, MOSS:
Some Key Differences Between SID and MOSS
|
|
|---|
| Web Interface |
Approximate Matching |
Maintain Account Info Online |
| SID |
Yes |
Yes |
Yes |
| MOSS |
No |
No |
No |
Who is SID for?
SID is for anybody looking for an easy way to detect plagiarism in source code. We
currently support Java and C++ source code.
How Do We Do It?
Whether you are a teacher or a student who's program is going
to be tested by your teacher, we have no intention of hiding our methods
from you. Here is what we do: for two programs to be compared, we compute
the shared amount of information between two programs. The shared information distance
between two programs X and Y is defined as:
| D(X,Y) = | K(X) - K(X|Y) K(XY)
| |
where K(X|Y) is the Kolmogorov complexity [2,3] of X given Y. When Y is
empty, it becomes K(X). K(X)-K(X|Y) is mutual information between X and
Y --- i.e. this measures the amount of information Y "knows" about X. It
is well-known that mutual information is symmetric. However it does not
satisfy triangle inequality and is not a distance. D(X,Y) is a well-defined
distance and it is provable that it is also universal. The universality
aspect allows us to advertize our methods openly since in theory if we are actually able to compute this distance precisely --- it is not cheatable! We are confident that with the approximation we have of this distance, cheating our system will be very difficult to do.
Using SID
Step 1.
To use our service you first need to create an account. To do this
simply click on the Account Application link at the top or bottom of
this page and fill in the required information. Note that if you wish
to receive email confirmations you must include a valid email address
at account creation time because the system does not currently support
information changes.
Step 2.
Enter your login and password in the two corresponding text fields
at the upper right of the home page and click the login button. The
following page will have a list of previous submissions sorted by date,
which you may view by clicking on the link.
Step 3.
Click the link at the top of the page, which will take you to the file
submission page. Use the file submission page to submit a carefully formatted
zip file. A detailed description of how the zip file should be formatted is
available on the file upload page.
Step 4.
Wait for comparison of the projects to finish. If you have provided a valid
email address, you will receive an email notification with a link to the page
with your results as soon as the results are ready. You can also click on the
"View Results" button located on the top of the page to take you to a page with
all of your submission results. This page provides links
to the results pages for all of your processed submissions and displays the status
of all the submissions that are currently processing. Click on the submission that you want to view results for and you will be brought to a self explanatory results display page.
Experiment Results
If you wish to download some of the experiments that we used when comparing our system
to others you can do so by clicking here.
The results of the experiments are discussed in
[4] .
Authors
The system has been developed by X. Chen, B. Francia, B. Mckinnon, A. Seker, and
M. Li of the
UCSB and
UW Bioinformatics groups, Computer Science Department. All
rights reserved.
Contact Information
Send bug reports to d57wang@uwaterloo.ca and general questions to
mli@uwaterloo.ca.
References
[1] M. Li, J. Badger, X. Chen, S. Kwong, P. Kearney, H. Zhang,
An information based sequence distance and its application to
whole (mitochondria) genome distance. Bioinformatics 17: 149-154, 2001.
[ http://monod.uwaterloo.ca/papers/01infodist.pdf ]
[2] M. Li and P. Vitanyi, An introduction to Kolmogorov complexity
and its applications, Springer-Verlag, 1st Ed. 1993, 2nd Ed. 1997.
[3] Dana MacKenzie, On a roll, New Scientist, Nov. 6, 1999,44-47.
[4] X. Chen, B. Francia, M. Li, B. McKinnon, A. Seker. Shared Information and Program Plagiarism Detection IEEE Trans. Information Theory, July 2004, 1545-1550.[ http://monod.uwaterloo.ca/papers/04sid.pdf ]
[5] C. Bennett, M. Li and B. Ma, Chain letters and evolutionary histories. Scientific American, 288:6, June 2003, 76-81.
|
|