Sunday, April 7, 2013

Detect Exact and Similar Code Duplicates Using ConQat

[Update: this post is related to ConQAT version 2011.9. However, the conQat blocks provided in this post will not work on the latest version which is 2014.1. Dr. Hummel (one of ConQat developers) has kindly shared with us the set of blocks for the newer version here. As per Dr. Hummel, the equality and similarity properties (described below) will be part of all future releases]

Part of my work on refactoring legacy code was to find a way to identify code clones (duplicates) in large data repositories; and during my experiments on refactoring, I found that it would be much faster to follow this order:
  1. Remove dead code (CSRS ~ 600 LPH, also reduces 10% of duplicate code)
  2. Remove exact clones (CSRS ~ 300-400 LPH)
  3. Remove similar clones 
Where:
CSRS is the Cost Size Reduction Speed, measured in LOC per Hour (LPH)
Exact clones: are clones which are typically the same (probably they may differ in white space only)
Similar clones: are exact ones but with renamed variables.
The best tool I found was ConQAT, an excellent open source and very flexible tool. However, I had to adjust it a bit to enable detecting exact and similar clones.

Detecting Exact and Similar Code Duplicates:

If you would like to try this, download this adjusted set of blocks, then follow these steps to replace the old ConQat blocks on your machine.
  1. Unzip the file “ConQat blocks - vX.Y.rar”, and replace the following folder: <conqat eclipse folder>\ conqat-2011.9\conqat\bundles\org.conqat.engine.code_clones\blocks
  2. Press the “Enforce Full Model Rebuild”:
  3. Open ConQat Runtime view and press “New”. You will find new parameters called “equality” and “similarity”: 
  4. Add the “equality” parameter, and set the equality threshold to 1 (meaning that the clones should be 100% identical): 

See the Result!

This is an example of the result, only exact clones are listed:


Works for Which Language?

This applies on (and has been tested against):
  • Java code using the JavaCloneAnalysis block
  • .Net code using the CsCloneAnalysis block
  • C/C++ code using the StatementCloneAnalysis block. Actually, the StatementCloneAnalysis may be applied on 10 different languages. Review the ConQat documentation for more information. 
Enjoy!

10 comments:

Oleksandr said...

Great tutorial. Great addition. Works for me. Thanks.

Oleksandr said...

And what is 'similarity' parameter for ?

Amr Noaman said...

Thanks!

Regarding similarity, it's a percentage of renamed and equal tokens to the total number of tokens in the code chunk.

So, if you have two code chunks of 100 tokens, 30 of them are equal, 20 are renamed, and 50 are different. In this case, if you set similarity threshold to 0.5, it will be detected at the clone analysis. if you set it to less than 0.5, it will not be detected.

Anonymous said...

I installed ConQAT-2013.10 and had the parameters “equality” and “similarity” there by default.

No need to copy your files over to "\conqat\bundles\org.conqat.engine.code_clones\blocks".

Maybe they included your changes - or did I miss something?

Amr Noaman said...

I tried the latest version of ConQAT, but the equality and similarity parameters didn't work.

It seems that it needs more investigation. If anyone had sometime to investigate, please post here for us to know.

Benjamin Hummel said...

Sorry for the late reply, but I was notified of this issue only today from Amr. I am one of the developers of ConQAT and wasn't aware of this application of our analysis tool, but really like the idea.

Indeed the JavaCloneAnalysis should already have the correct parameters in the current release, the other mentioned blocks do not yet. I updated all blocks to expose the equality/similarity parameter, so our next release should provide this out of the box. For the current release (2014.1) I prepared an updated set of blocks (same layout as described in the post) here.

If this does not work, please feel free to contact me.

Benjamin Hummel said...

Sorry, used the wrong (internal) URL to my profile. If you want to contact me, find my details here

Amr Noaman said...

Thanks Dr. Benjamin for your valuable update!

Unknown said...

Hi,

I have conqat-2015.2 and I am trying to figure out the differences between equality and similarity parameters (at 0, 0.5, and 1), I checked the posts but still not clear.

Regards,

Unknown said...

Hi Benjamin,

I have been working on Feature oriented programming.Now I want to study the code clones behaviour in FOP.But I am unable to understand how to create .cqb and .cqr files for FOP projects.If you can provide some small example to make it easy to use the tool.