Sunday 27 December 2009

Webdesign and Supercompilation

With supercompilation being a long-forgotten technique invented decades ago, and with “webdesign” term being usurped by graphic artists & HCI experts, I doubt this post will be anything close to popular, but as always, it will not stop me from expressing my opinion. But let’s take it slowly now.

Let us assume the “web design” in a good, broad sense now: not just the omnipresent “logo on the right vs logo on the left” & “10 tips to get more clicks”. Just as software design comprises multiple heterogeneous activities concerning the making of a piece of software, just as language design is about how to create a good language suited for the target domain, web design is in general about how to make a web site, a web service or a web app well.

Super-compilation is a program transformation method of aggressive optimisation: it refactors the code based on the most possible assumptions, throwing away all dead code, unused options and inactivated functionalities. If was irrelevant or at least unproductive during the structured programming epoch, but the results of super-compilation were promising before that and remain promising in our time, during the epoch of multi-purpose factory factory frameworks.

The current (at least since 1999) trend in web design is dynamics and more dynamics. The content and its presentation is separated, and most of the time what the end-user sees is what is being generated from the actual content stored somewhere in a database by using the representation rules expressed in anything from XSL to AJAX (in software we would call such process “pretty-printing”). However, this is necessary only for truly dynamic applications such as Google Wave. In most of the other rich internet applications the content is being accessed (pretty-printed) much more often than being changed. When the super-compilation philosophy is applied here, we quickly see that it is possible to store the pre-generated data ready for immediate end-user demonstration. If the dependencies are known, one can easily design an infrastructure that would respond to any change of data with re-generation of all the visible data that depend on it. And that is the way it can and should be — I’m running several websites, ranging from my personal page to a popular contest syndication facility, all completed with this approach: the end-user always sees the statically generated solid XHTML, which is being updated on the server whenever the need arises, be it once a minute or once a month. Being static is not necessarily a bad thing, especially if you provide the end-user with all the expected buttons and links. Saves time and computational effort on all the on-the-fly processing requests.

When will it not work: for web apps that are essentially front-ends for a volatile database access; for web apps that are truly dynamic in nature; for web apps where user preferences are inexpressible in CSS & access rights. When will it work: pretty much everywhere else. Think about it. Have fun.

Thursday 3 December 2009

Type V clones

Clone detection has been an active research topic for decades by now, but it’s among those that never wither. We all know the basic classification of clone types: Type I is for two pieces of code that are identical in all aspects except perhaps for whitespace (indentation and comments), Type II is for two structurally identical pieces of code with variations only in whitespace and naming, Type III is for two pieces of code that have syntactically mapping constructs but can bear additional statements/expressions somewhere in the middle, and Type IV is for two semantically equivalent pieces of code that have the same functional behaviour but can be implemented differently.


Copy-paste programming is by far not the only cause for clones, we all know that too. And recently there has been another cause evolving: syndication and aggregation. There are just too many web services and RIAs, no-one can register on each one of them. (In fact, very rare ones go half as far as I do). Thus, in order to broaden one's potential audience, the users let the services propagate the same pieces of data: blog posts are fed into twitter updates, they become facebook status updates, etc. These updates are hyperlinked and heavily annotated, so I can’t help thinking about them as strictly structured grammar-abiding data (better known as “code”). The rules for propagation vary from bi-directional synchronisation to quite obfuscated schemes of one-directional non-information-preserving transformations. One the other hand, front-end grammarware (web-2.0-ware) like TweetDeck allows end users to aggregate updates from different sources on one screen (in the case of TweetDeck, we’re talking about Twitter, Facebook, MySpace and LinkedIn). In this case, the end users can receive the same information multiple times through different paths.


This leads us to the necessity of introducing Type V clones as two pieces of differently structured data representing the same information. The main difference is that such clones will most of the time be non-equivalent, with one derived from the other in a known (or partially unknown) manner. Some other scenarios exemplifying the non-triviality of this, follow:



  • “Identity X is connected to identity Y” coming from service A does not mean “identity X is connected to identity Y” on service B as well. However, these identities will appreciate being notified about the possibility to connect on service B as well (if not to be automatically connected).

  • “Identity X posted text T” is the same as “identity X posted text T with link L”, if L links to one of the clones, otherwise the second one is more complete.

  • “Identity X posted text T1 with link L” is a neglectable clone of “identity X posted text T2”, if T1 is a truncated version of T2 and L links to the second one.

  • If “identity X posted text T” often occurs together with “identity Y posted text T”, then X and Y might be the same entity.

  • When we have two streams which are known to be clones, we can try to establish the mapping by automated inference.

  • If we know the transformation R that makes an update U' on service B from an update U on service A, and we have U' at hand but U is unavailable (security issues, service is down, etc), we need to [partially] reverse R, as we did in our hacking days.


There is much more than that to be done, I’m just providing you with the most obvious raw ideas. Of more advanced topics one can immediately name identity clone detection, data mining, topic analysis and coverage metrics.