Spoonful of Hacking

Friday 31 August 2012

Wiki Loves Monuments

September is about to start in a few minutes. It is going to be a busy month: several important conference deadlines, speaking at Software Freedom Day, remotely participating in SLE workshops, traveling to MoDELS, putting together the programme of WCN, being on “holidays in a broad sense” for half of the month, etc. However, this all pales in comparison to the last year’s September, when in addition to all WCN emailing, I was taking some time almost every day to go out on the street and make pictures of Dutch state monuments.

WLM, or Wiki Loves Monuments, was an initiative that I occasionally learned about by hanging out with wikipedians at Wikimania 2011. It sounded nice: you could pick up a camera, walk around and create some information that is: (1) useful, (2) structured and (3) free. Being @grammarware, I am obviously a fan of structured data — noblesse oblige; being a hacker, I support that all information must be free; entertaining myself with dreams of being a practical person, I should like the idea of creating something useful as well. And indeed, I found myself doing something I usually don’t (i.e., taking pictures); engaging in it regularly, through rain, wind, cold and tooth pain; taking away precious time from hobbies, girlfriend, and even from my own birthday. At night, instead of hacking, I was spending lots of time on classifying and uploading said pictures to Wikimedia Commons. In the morning, instead of watching an episode of Dr.Who, I was walking around the Jordaan neighbourhood and documenting the monuments there. In the evening, instead of hurrying home to have a nice gratifying dinner, I was taking lengthy detours on my bike to visit places I’ve never been to, and to take pictures of houses I’ve never seen.

In fact, participating in WLM 2011 turned out to be much more fun than I could anticipate. I was gaining incredible amounts of knowledge about Amsterdam and its history: way more than I could consume at the spot, but all of it was fresh and tangible: I was physically present at all places of interest. I started to distinguish better between clock gables, neck gables, stepped gables and the likes of them. I improved immensely on my orientation skills. My admiration with gable stones deepened significantly as well. I started to notice so many little things that I was missing before. I knew it was cool to live in the city centre before, but I learned to truly appreciate it only after taking a camera with me on long dedicated walks.

I did not become a photographer after this, and I never will (even though it should sound strange in our era when everyone thinks of him or herself as one due to having a camera). I did not become a GLAM expert, and I never will (even though I did probably visit more musea than restaurants in Amsterdam). It is probably even safe to assume that I enjoyed it that much exactly because I was a total newbie in all the things WLM stood for, and still took up the challenge.

In the end, I uploaded more than a thousand pictures and won the 3rd prize in quantity (would’ve won the 1st if they judged by the number of pictures and not the number of monuments — Tropenmuseum counts as one even though it took me half a day and a hundred of photographs). This activitly utterly overshadowed everything I have been doing on Wikimedia Commons for years and years before that (creating vector graphics illustrations, retouching old scans and only occasionally making pictures of rare objects). Still, I regret nothing.

Sunday 27 December 2009

Webdesign and Supercompilation

With supercompilation being a long-forgotten technique invented decades ago, and with “webdesign” term being usurped by graphic artists & HCI experts, I doubt this post will be anything close to popular, but as always, it will not stop me from expressing my opinion. But let’s take it slowly now.

Let us assume the “web design” in a good, broad sense now: not just the omnipresent “logo on the right vs logo on the left” & “10 tips to get more clicks”. Just as software design comprises multiple heterogeneous activities concerning the making of a piece of software, just as language design is about how to create a good language suited for the target domain, web design is in general about how to make a web site, a web service or a web app well.

Super-compilation is a program transformation method of aggressive optimisation: it refactors the code based on the most possible assumptions, throwing away all dead code, unused options and inactivated functionalities. If was irrelevant or at least unproductive during the structured programming epoch, but the results of super-compilation were promising before that and remain promising in our time, during the epoch of multi-purpose factory factory frameworks.

The current (at least since 1999) trend in web design is dynamics and more dynamics. The content and its presentation is separated, and most of the time what the end-user sees is what is being generated from the actual content stored somewhere in a database by using the representation rules expressed in anything from XSL to AJAX (in software we would call such process “pretty-printing”). However, this is necessary only for truly dynamic applications such as Google Wave. In most of the other rich internet applications the content is being accessed (pretty-printed) much more often than being changed. When the super-compilation philosophy is applied here, we quickly see that it is possible to store the pre-generated data ready for immediate end-user demonstration. If the dependencies are known, one can easily design an infrastructure that would respond to any change of data with re-generation of all the visible data that depend on it. And that is the way it can and should be — I’m running several websites, ranging from my personal page to a popular contest syndication facility, all completed with this approach: the end-user always sees the statically generated solid XHTML, which is being updated on the server whenever the need arises, be it once a minute or once a month. Being static is not necessarily a bad thing, especially if you provide the end-user with all the expected buttons and links. Saves time and computational effort on all the on-the-fly processing requests.

When will it not work: for web apps that are essentially front-ends for a volatile database access; for web apps that are truly dynamic in nature; for web apps where user preferences are inexpressible in CSS & access rights. When will it work: pretty much everywhere else. Think about it. Have fun.

Thursday 3 December 2009

Type V clones

Clone detection has been an active research topic for decades by now, but it’s among those that never wither. We all know the basic classification of clone types: Type I is for two pieces of code that are identical in all aspects except perhaps for whitespace (indentation and comments), Type II is for two structurally identical pieces of code with variations only in whitespace and naming, Type III is for two pieces of code that have syntactically mapping constructs but can bear additional statements/expressions somewhere in the middle, and Type IV is for two semantically equivalent pieces of code that have the same functional behaviour but can be implemented differently.

Copy-paste programming is by far not the only cause for clones, we all know that too. And recently there has been another cause evolving: syndication and aggregation. There are just too many web services and RIAs, no-one can register on each one of them. (In fact, very rare ones go half as far as I do). Thus, in order to broaden one's potential audience, the users let the services propagate the same pieces of data: blog posts are fed into twitter updates, they become facebook status updates, etc. These updates are hyperlinked and heavily annotated, so I can’t help thinking about them as strictly structured grammar-abiding data (better known as “code”). The rules for propagation vary from bi-directional synchronisation to quite obfuscated schemes of one-directional non-information-preserving transformations. One the other hand, front-end grammarware (web-2.0-ware) like TweetDeck allows end users to aggregate updates from different sources on one screen (in the case of TweetDeck, we’re talking about Twitter, Facebook, MySpace and LinkedIn). In this case, the end users can receive the same information multiple times through different paths.

This leads us to the necessity of introducing Type V clones as two pieces of differently structured data representing the same information. The main difference is that such clones will most of the time be non-equivalent, with one derived from the other in a known (or partially unknown) manner. Some other scenarios exemplifying the non-triviality of this, follow:

“Identity X is connected to identity Y” coming from service A does not mean “identity X is connected to identity Y” on service B as well. However, these identities will appreciate being notified about the possibility to connect on service B as well (if not to be automatically connected).

“Identity X posted text T” is the same as “identity X posted text T with link L”, if L links to one of the clones, otherwise the second one is more complete.

“Identity X posted text T1 with link L” is a neglectable clone of “identity X posted text T2”, if T1 is a truncated version of T2 and L links to the second one.

If “identity X posted text T” often occurs together with “identity Y posted text T”, then X and Y might be the same entity.

When we have two streams which are known to be clones, we can try to establish the mapping by automated inference.

If we know the transformation R that makes an update U' on service B from an update U on service A, and we have U' at hand but U is unavailable (security issues, service is down, etc), we need to [partially] reverse R, as we did in our hacking days.

There is much more than that to be done, I’m just providing you with the most obvious raw ideas. Of more advanced topics one can immediately name identity clone detection, data mining, topic analysis and coverage metrics.

Friday 25 September 2009

SCAM/ICSM/Twitter mapping

@abramh — Abram Hindle, PhD student, University of Waterloo, Canada
@avandeursen — Arie van Deursen, Software Engineering Research Group, Delft University of Technology, The Netherlands
@SebDanicic — Sebastian Danicic, Goldsmiths College, University of London, UK
@davema — David Ma, Calgary, Canada
@frama_c — Pascal Cuoq, INRIA, France
@grammarware — Vadim Zaytsev, PhD student, Koblenz, Germany
@ICSMconf — consolidated account set up by Jamie Starke
@jamiestarke — Jamie Starke, University of Calgary, Canada
@j_ham3 — James Hamilton, PhD student, University of London, UK
@JurgenVinju — Jurgen Vinju, CWI, Amsterdam, The Netherlands
@nicbet — Nicolas Bettenburg, PhD student, Software Analysis and Intelligence Lab, Queen’s University, Canada
@quinndupont — Quinn DuPont, Algorithmics Inc., Canada
@ssepotsdam — ?
@tkobabo — Takashi Kobayashi, Nagoya, Japan
@taoxiease — Tao Xie, North Carolina State University, USA
@tiagomlalves — Tiago Alves, PhD student, SIG, Amsterdam, The Netherlands
@tomzimmermann — Thomas Zimmermann, Microsoft Research, University of Calgary, Canada
@yk2805 — Yiannis Kanellopoulos, SIG, Amsterdam, The Netherlands

Please send updates or comment them here if necessary.

Thursday 24 September 2009

Architecture Evaluation

During the ICSM presentation of Eric Bouwers about criteria for assessing implemented architectures I asked a question that raised a discussion that was proposed by Yuanfang Cai to be taken off-line. Since I already left Edmonton, I’m taking it on-line instead. There is no doubt Eric has made considerable contribution by analysing SIG expert opinions, reports and interviews, my question was more about the relation between architecture evaluation and architecture evolution and his proposal to integrate regular architecture re-evaluation into maintenance activities.

One of the definition of architecture that I remember from the time working in the same department with Hans van Vliet is that it comprises those components, dependencies, properties, configuration elements, etc—in other words, those parts of a system design that do not change with time or are the most reluctant to change with time. I.e., the easier it is to discard or to change something, the less place it deserves in the architecture. If you think the problem is purely terminological, please direct me to a perfect definition, and I will shut up. However, I believe there are some deeper issues here.

Can architecture re-evaluation be used as a system analysis tool that can deliver useful and non-trivial results?

So far I can imagine three scenarios: (1) the software system evolves without changing its architecture; hence, re-evaluation is redundant since it will provide the same results we already obtained; (2) the software system is redesigned in the meantime in such a way that its architecture changes as well; hence, re-evaluation is needed since we can no longer rely on the outdated data; (3) the software system evolves in such a way that the properties of its architecture can shift without noticing; hence, the answer to the question from the previous paragraph is definitely "YES". The first two scenarios are trivial, the third one is not, and I call for examples. So far I can think of only external ones, like when a new technology is introduced and makes parts of the existing system outdated/obsolete/incompatible/… Are there internal ones?

Thursday 9 July 2009

GTTSE/Twitter mapping

@1TTechnologies — Denis Avrilionis, One Tree Technologies, Luxembourg
@BBasten — Bas Basten, CWI, Amsterdam, The Netherlands
@Elsvene — Sven Jörges, Ruhrpott, Germany
@Felienne — Felienne Hermans, PhD student, Delft, The Netherlands
@GorelHedin — Görel Hedin, Lund University, Sweden
@GorkaZubia — Gorka Puente, PhD student, University of the Basque Count, Spain
@grammarware — Vadim Zaytsev, PhD student, Koblenz, Germany
@inkytonik — Anthony Sloane, Macquarie University, Sydney, Australia
@JeanMarieFavre — Jean-Marie Favre, University of Grenoble, France
@JurgenVinju — Jurgen Vinju, CWI, Amsterdam, The Netherlands
@MedeaMelana — Martijn van Steenbergen, MSc student, Utrecht, The Netherlands
@MichalPise — Michal Pise, Czech Technical University, Czech Republic
@notquiteabba — Ralf Lämmel, Koblenz, Germany
@PaulKlint — Paul Klint, CWI, Amsterdam, The Netherlands
@PauloBorba — Paulo Borba, Software Productivity Group, Pernambuco, Brazil
@radkat — Ekaterina Pek, PhD student, Koblenz, Germany
@TerjeGj — Terje Gjøsæter, PhD student, Grimstad, Norway
@TvdStorm — Tijs van der Storm, CWI, Amsterdam, The Netherlands

Please send updates or comment them here if necessary.

Wednesday 10 June 2009

Floating code snippets in LaTeX

Last week I decided to separate “figures” that contain diagrams, graphs and parse trees with “figures” that contained source code snippets. In LaTeX it means the former kind stays in figure environment, while the latter needs to reside within its own. I googled for solution and was quite surprised how the web was full of cluttered random hacks that were done without any understanding of TeX internals. My solution is 9 lines long, and it solves three problems: the environment itself, the list of them and referencing issues.

\usepackage{float}
\usepackage{tocloft}
\newcommand{\listofsnippetname}{List of Listings}
\newlistof{snippet}{lol}{\listofsnippetname}
\floatstyle{boxed}
\newfloat{snippet}{thp}{lol}[chapter]
\floatname{snippet}{Listing}
\newcommand{\snippetautorefname}{Listing}
\renewcommand{\thesnippet}{\thechapter.\arabic{snippet}}

The first two lines connect two packages: one for a mechanism of defining new floating object types, one for toc-ish lists. On the next two lines we define the new list — at this point the new LaTeX counter is already created but not used anywhere. floatstyle can be plain, boxed or ruled — I decided for boxed since I was using boxedminipage inside the old-style figures anyway. Then we define a new float type which fails to define a new counter and has to use the one for the list we already made, just as planned. We finish up by giving the new floating environment some names.

That’s it, we’re done. Just use \begin{snippet}…\end{snippet} and \autoref{…} as you usually would with figures and tables. I see no need to create more counters, to brutally mess with @addtoreset and theHsnippet, etc. Hacks must be simple, effective and beautiful.