Thursday 3 December 2009

Type V clones

Clone detection has been an active research topic for decades by now, but it’s among those that never wither. We all know the basic classification of clone types: Type I is for two pieces of code that are identical in all aspects except perhaps for whitespace (indentation and comments), Type II is for two structurally identical pieces of code with variations only in whitespace and naming, Type III is for two pieces of code that have syntactically mapping constructs but can bear additional statements/expressions somewhere in the middle, and Type IV is for two semantically equivalent pieces of code that have the same functional behaviour but can be implemented differently.


Copy-paste programming is by far not the only cause for clones, we all know that too. And recently there has been another cause evolving: syndication and aggregation. There are just too many web services and RIAs, no-one can register on each one of them. (In fact, very rare ones go half as far as I do). Thus, in order to broaden one's potential audience, the users let the services propagate the same pieces of data: blog posts are fed into twitter updates, they become facebook status updates, etc. These updates are hyperlinked and heavily annotated, so I can’t help thinking about them as strictly structured grammar-abiding data (better known as “code”). The rules for propagation vary from bi-directional synchronisation to quite obfuscated schemes of one-directional non-information-preserving transformations. One the other hand, front-end grammarware (web-2.0-ware) like TweetDeck allows end users to aggregate updates from different sources on one screen (in the case of TweetDeck, we’re talking about Twitter, Facebook, MySpace and LinkedIn). In this case, the end users can receive the same information multiple times through different paths.


This leads us to the necessity of introducing Type V clones as two pieces of differently structured data representing the same information. The main difference is that such clones will most of the time be non-equivalent, with one derived from the other in a known (or partially unknown) manner. Some other scenarios exemplifying the non-triviality of this, follow:



  • “Identity X is connected to identity Y” coming from service A does not mean “identity X is connected to identity Y” on service B as well. However, these identities will appreciate being notified about the possibility to connect on service B as well (if not to be automatically connected).

  • “Identity X posted text T” is the same as “identity X posted text T with link L”, if L links to one of the clones, otherwise the second one is more complete.

  • “Identity X posted text T1 with link L” is a neglectable clone of “identity X posted text T2”, if T1 is a truncated version of T2 and L links to the second one.

  • If “identity X posted text T” often occurs together with “identity Y posted text T”, then X and Y might be the same entity.

  • When we have two streams which are known to be clones, we can try to establish the mapping by automated inference.

  • If we know the transformation R that makes an update U' on service B from an update U on service A, and we have U' at hand but U is unavailable (security issues, service is down, etc), we need to [partially] reverse R, as we did in our hacking days.


There is much more than that to be done, I’m just providing you with the most obvious raw ideas. Of more advanced topics one can immediately name identity clone detection, data mining, topic analysis and coverage metrics.

No comments:

Post a Comment