October 5, 2005
I have finally found an interesting topic for my main research. or rather, an area from which I will be able to select an interesting topic. The area is information extraction from web documents. Although I might be getting ahead of myself - it might be a little too specific for now. What’s interesting is how I arrived at this topic.
I wrote my first line of executable code only four years ago (2001), so until recently I thought that still had to learn a lot of basic stuff before moving on to really interesting things. However, somewhere in 2004/05 I realized that in fact I knew enough to start playing around with some advanced topics. All my little victories suddenly fell into place: understanding of data structures enabled me to easily design any data representations for my projects, programming languages and, later, the compilers class taught me about writing context free grammars, parsing text and translating input into a different language, etc… many classes taught me many things; but the point is - I suddenly had the tools at my disposal to build interesting stuff and to research concepts which are not yet fully studied and problems which are not yet solved. (Very soon after that I decided to get a PhD in computer science.)
So, on to uncharted territory!
An area I was interested in was managing content. Why - I couldn’t tell. Maybe because I build web-based content management systems for a living; maybe because I had significant domain knowledge up-front. Whatever it was that triggered my interest, I spent about a year experimenting with text-based storage in XML-like structures and outputting the data in different formats. As a learning exercise it was great - I got quite a bit of practice with parsing tagged input and traversing the constructed tree to produce different kinds of output. I suppose I can say that I got very close to discovering some big ideas on my own - ideas which have been implemented countless number of times, as I later discovered. However, the fact that I was thinking of template-based processing with a syntax like XSLT and a query language to extract data from XML structures like XPath - before knowing about the existence of XSLT and XPath - means that I was on the right track.
However, I found working in this direction to be very frustrating. Every time I would come up with an idea which pointed me in a promising direction, I would discover that it has been explored inside and out, and the results have been implemented in every language imaginable. Do I, honestly, hope to discover something new at this stage of my education? Hell, yes! Will I? Most likely, not at this time; but at least I want an open field in front of me - something to explore and to watch others explore it. My plan for the content management project was to come up with a simple language which could describe storage and the output formats of structured data. Eventually I put it on hold - the excitement was not there anymore.
Luckily, I found a direction in which I can expand the original concept. My original project was about describing existing data; but what if I were to collect this data?
This idea came from my studies in information storage and retrieval. I’m working on a project with Dr. Kevin O’Kane (for more information check my ISR research diary). Initially the project topic was very broad, but after reducing it to building a Wikipedia category browser and information retrieval system, I was left with at least half of my ideas unclaimed! that’s when I realized that I could easily connect the two - my content management experiments with structured data and syntax describing it - and ways to retrieve this structured data from existing information collections. Right now I see it as a web site analyzing and generation tool: the program will explore a web site (or a web domain) and construct a hierarchy of its content, describing it in a syntax I design. After extracting and analyzing the content, the system will be capable of producing output based on templates or other description specified by the user (again, in a syntax I will design for that). In other words, it may become a web site refactoring machine: taking an existing web site as input, processing it into a well-structured, manually editable form and producing output in a different format. Want to see what cnn.com will look like in the format of mugglenet.com? - no problem!
The body of knowledge is huge! I have found at least 30 papers which are directly related to this area. However, most of them (at least the ones I’ve read so far) focus on content extraction from a single web page (or the Web in general). None of them explore the idea of extracting structure from a web site (or generating it). Can a program discover an hierarchy in a web collection? I think it can. Am I on to something?
Posted by Sergei on October 5, 2005 at 12:10 am under Information Extraction, Research. Comments Off.
October 4, 2005
Everybody knows that 42 is the Answer to Life, the Universe and Everything. Even Google Calculator knows that. But who can boast having their picture taken in deep space on board of the Heart of Rust (the predecessor of the Heart of Gold)!

do click for full picture
I must post several pages of research diaries tonight… I just had to give myself a break
Posted by Sergei on October 4, 2005 at 11:10 pm under Personal. Comments Off.
October 4, 2005
Yesterday I came up with an idea which should help me post more often. As I have mentioned on several occasions, writing about my classes and research helps me better understand the concepts I’m trying to master in computer science. The more I write about it, the easier it gets. However, I don’t write as often as I’d like to (big surprise!). The reason for that is not the absence of material - there’s plenty! - but rather my fear to post poorly structured entries, entries I could’ve worked on and eventually perfected (to a certain extent of course). That’s the journalist in me speaking.
Another reason which holds me back from posting is quite ridiculous. Often my thoughts float in area only a few nerds like myself care about. True, there’re millions of us somewhere out there; but still, I fear that students like myself have their own problems to solve and my entry would be just too subjective to provide any kind of interest to anyone other than myself. I think I’m wrong.
All this forces me to reconsider every time I think of posting a random thought on this or that area of my studies. So I’ve found a compromise. I’ve added Research Diaries to my blog – it’s a blog within a blog. These entries will be random, they will be geared towards …hmm… myself, they will be related to my classes and research. To spare my readers from encounting my streams of thought in the raw (what a dreadful thing that is!), I’ll be posting the full text version of my diaries in the Research section of this site. Only the first paragraph will appear in this blog (with a link to the fuill text), and the posting will be tagged as a “research diaries entry.”
It is possible that I am entirely wrong, and I do indeed have readers who come to this site to read about my research. If so, this new feature might be annoying (two clicks instead of one!) – in which case, please let me know, and I’ll continue posting full-text entries, no matter whether it’s an essay about Life, the Universe and Everything, or a frustraited post about my inability to cope with the Pumping lemma. By the way, over the past month I’ve had a little over 500 unique visitors (excluding known robots) from all over the world. Thanks for reading, folks! I’ll post more often.
Update from 12/6/2007: it didn’t work. Posting regular entries still makes most sense.
Posted by Sergei on October 4, 2005 at 9:10 pm under Personal. Comments Off.
July 31, 2005
For the past several weeks I have been working on a project in Information Storage and Retrieval (ISR). The problem is… I haven’t selected a specific project yet. I have an idea; but first I’ll write about the class in ISR I took last semester - that will help me organize my thoughts on the subject a little better before I go into details about my further studies.

The class I took last semester was my first introduction to the field - I learned quite a bit about the vector space model, document indexing, building document-term matrices, stop lists, automatic clustering, etc… Quite interesting and quite complicated. Gone were the times when I saw retrieving information as a simple matter of sending a query to a database… The vector space model was particularly hard to grasp at first, but then - it became clear and obvious: think of it as multi-dimensional hyperspace with an axis for each term in the collection. Documents are plotted as points in this hyperspace whose location is determined by the terms (including their relative weights) that occur in them. Very cool… Understanding it was in way like understanding how language translation works: there was this area I knew nothing about, then after being introduced to it I tried to grasp it but failed miserably; but eventually, after much effort, it revealed itself. Very satisfying; it’s like mastering an impossibly difficult spell…
I also learned a new language - MUMPS. It’s a weird hairy beast which takes forever getting used to; but it has some advantages, and it does offer some unique opportunities, like providing the means to efficiently storing virtually unlimited amounts of data in multi-dimensional arrays, with relatively little code involved. The language tends to be annoying at times, but, quite honestly, I have no idea how I would’ve dealt with a 300MB text-based database in Java…
There was a class project at the end of the semester. The winner was excused from taking the final exam. Two projects were selected as winners - one was by my fellow student Andrew Friedley, the other one was built by me and my partner Alex Stevenson.
Out project was an ISR system called Giggle (Alex thinks it’s a ridiculous name). Giggle is a searching/browsing interface for the MEDLINE database (we used a 30MB subset of the document collection) implemented in MUMPS, C, JavaScript and HTML. The indexing was done mostly by using existing code provided in the lecture notes. The system used data from several generated matrices (term-term, doc-term, etc…), a stop list, the Medline text file …and quite possibly, more stuff which I’m forgetting but have no inclination in searching for in the 700+ lines of ugly code we wrote. The system was, actually, not bad at all. It was my first serious attempt at MUMPS and C, so there was much learning along the way - that caused some ugly design decisions, and yet, as my professor observed, “surprisingly, it worked.”
Giggle provided a rich search and browsing experience. It included the following:
- a standard search interface with a soundex feature (converting words that sound like one another into common codes: “did you mean ‘xyz’?”);
- a document view page which highlighted the query terms, converted other terms in the document to links to further queries and a ranked list of related documents;
- a browsing page with a list of terms, expanding in some cases to related terms combinations;
- a dynamic clustering feature that took the top 100 results and generated clusters, identifying most common terms for each cluster;
- a search history page which displayed a list of past searches and a list of viewed documents - both implemented as lists of links.

The best part of it was - it was all on one web page. There were 5 tile-shaped links (like on amazon.com) which, when clicked, displayed alternative views of the data. The data in the hidden views was preserved across the session. No, we didn’t store anything on the server - that would be prohibitive in a real-world application. The state management model was borrowed from .NET - the entire html page was treated as a form with the persistent data being passed through a hidden form field. Click events were managed with JavaScript which dynamically changed the values of 2 other hidden fields, setting the action and the target of the event. Now that I think of it, the concept was elementary, I have no idea why nobody had thought of it before asp.net.I like what we did with Giggle, and I’d like to further develop that system, eventually building a simple state/event-management framework. However, for my ISR Project class I’d like to focus more on the subject of information storage and retrieval - not state management in web applications. I suppose I might be able to do both.
Giggle is temporarily available at http://neamh.cns.uni.edu/~wizard/giggle.html.
Posted by Sergei on July 31, 2005 at 1:07 am under Computer Science Classes, Information Retrieval. Comments Off.
July 26, 2005
In one of my previous posts I mentioned that my posts will become less entertaining and more focused on my daily studies. So, “here come one now!”
While building a parsing table, I’ve discovered that the algorithm for building a table described in the Dragon book does not work for right-recursive grammars! Consider this simple example:
expression -> number
expression -> expression + expression
Eliminating left recursion will result in the following grammar:
expression -> number expression-tail
expression-tail -> + expression expression-tail
expression-tail -> Ø
Now let’s build the first and follows functions:
first(expression) = { number }
first(expression-tail) = { + null }
follows(expression) = { + $ }
follows(expression-tail) = { + $ }
A predictive parser doesn’t backtrack: it uses the combination of its current state and the next token to determine what production to apply. Therefore, if the parser is in state ‘A’ and the next token is ‘b’ there should be only one production to follow. Let’s construct a parsing table for the grammar we’re looking at using the algorithm from the book:
For each production A := a, do the following:
- For each terminal a in FIRST(a), add A := a to M[A, a].
- If Ø is in FIRST(a), add A := a to M[A, b] for each terminal b in FOLLOW(A).
This algorithm will result in 2 entries for the combination of expression-tail and the token ‘+’. However, since our parser allows only one production per combination, the first will be overridden by the second. As a result, whenever the parser sees a ‘+’ and is in the expression-tail state, it will use the “expression-tail -> Ø” production.
I think the problem lies in right recursion. Our expression-tail can start with ‘+’, derive to nothing and be followed by ‘+’. Since the following symbol can be only an expression-tail, the grammar isn’t ambiguous. However, to build a correct parsing table, the algorithm needs a minor addition:
- If Ø is in FIRST(a), add A := a to M[A, b] for each terminal b in FOLLOW(A) only if b is not in FIRST(A)
Is it correct? I don’t know for sure. Adding a production like
expression-tail -> + anything
would kill it. But it works for the Player grammar.
Posted by Sergei on July 26, 2005 at 11:07 pm under Computer Science Classes. Comments Off.
July 24, 2005
Do you know why we didn’t see any tigers or dragons in that movie? Because they were all crouching and hiding
I’m currently taking a class on compilers. I tried taking it a couple years ago, but it was a bad idea: a data structures class should come before language translation, or else you fail miserably. I failed miserably. But this time, having taken data structures, programming languages and algorithms I feel much better about this class: for one thing, I know what I’m doing this time, for another – I like what I’m learning.

I’m writing a compiler for “Player”, a small imperative language with boolean, integer, real, and string values; arrays and records with implicit pointers; user-defined types; nested procedures; and a few simple control structures. It’s relatively small, yet not trivial. The book I’ve been using until recently was Appel’s “Modern Compiler Implementation in Java,” also known among my fellow students as “the tiger book” (it has a tiger on the cover). Two years ago this book left me with the feeling that my quest for knowledge is beyond hopeless – it would take me weeks to finish a chapter, and yet I wouldn’t understand half of it. Now that I know a lot more, I thought it would be different – but it’s not.

So I tried another book.
The lecture notes (provided on the course website) often refer to “the dragon book” - “Compilers” by Alfred V. Aho, Ravi Sethi and Jeffrey D. Ullman (with a dragon on the cover). My professor warned me that the book is a lot more detailed than what we need for this class. I tried it. More in-depth? Certainly. More complicated? Hell no! It does provide a very detailed analysis of each phase of the compiler-building process. However, these details build-up a very clear picture of how the compiler works.
I’m not even halfway through – I’ve just finished building a parser for the language. But I could’ve never built it relying on Appel’s book. I’ve been getting an impression that the “tiger book,” even when offering a detailed analysis, sort of jumps to conclusions; it’s like a bad derivation – it claims that A -> D without showing the intermediate steps: B -> C and C -> D. I may be wrong, but that’s exactly the impression I got from this book.
My vote is for the dragon book. Not only did it help me with this class (or should I say – it made the class assignments possible for me!); it also enabled me to take a step further and write my first parser-generator, and that’s way beyond anything I even understood a year ago. Thank you, Dragon Book!
Posted by Sergei on July 24, 2005 at 6:07 pm under Books, Computer Science Classes. Comments Off.
July 11, 2005
It’s been three months since my last entry. I’m concerned that I don’t post more often, for I was hoping that my blog would help me in my studies. Reflecting on my day-to-day progress in computer science would cause me to rethink the concepts I have learned (or failed to learn), thus helping me get a better understanding of the complicated subjects I’m trying to master. And yet, I hardly ever post an entry.
My problem is not time. My problem lies in my expectations. Maybe I should blame my background in journalism, maybe - the high quality essays I read on other blogs; but the fact is - every time I sit down to write a post, I write something, deem it not good enough – and that’s the end of it – for it goes to the “drafts” folder where it is forgotten and eventually quietly dies, alone, in the rain…
Therefore, I have to remind myself, that this blog is mine, that “this task has been appointed to me, and if I do not find a way – no one will…” (naturally, being a Tolkien geek, I can’t help inserting quotes from The LOTR, both where they belong, and where they don’t). I apologize to my readers (believe it or not, I’ve had a few over the past months!), for this blog from now on will be, most likely, less entertaining. I certainly promise to stay away from the “daily routine” style of blogging; but I’ll try to focus a lot more on my daily studies, discussing projects only myself and my professors care about, describing programs of great value to me (and zero value for the rest of the world), and sharing my frustrations about simple algorithms which the rest of the world understands – but I don’t.
There is a reason for me being so concerned about my studies, a reason besides me liking one class or disliking another. There has been a change of plans recently – I have decided to shoot for a PhD in computer science. Therefore, my studies have taken a completely new meaning: instead of trying to learn this or that language or technology which will help me get a job, I’m now trying to understand this or that concept – so that sometime later I can invent my own language or technology.
I’ve toyed with the idea of getting a PhD every once in a while over the past several years, never seriously enough to give it a second thought. My former fellow students were enjoying “the real world” out in Chicago, New York, DC, London – I was too busy dreaming of a “real life,” I couldn’t afford spending extra years in school. Then it happened – I was about to be offered a chance to escape into the world. For reasons I wrote about in my previous entry, I decided to stay in school.
Then interesting things started happening.
First, it became clear to me that what I was doing on my University job was more interesting than what I would’ve ended up doing had I taken the offer from the “real world.” I work for the University; I build multi-tier data-driven applications, mostly using the .Net framework. Sometime soon I’ll blog about the projects I’m working on - I find most of them to be very exciting. My point is this: I’ve noticed that the more I study, the more interesting my work becomes, the more complex systems I’m entrusted to design. Several years ago I designed web sites. Then I started building simple script-based content management systems. Then I moved on to writing real programs. Today I build accounting applications, online stores, communities; I design frameworks for building other applications. Thinking back, it is a bit more exciting than writing html. Thinking even more back, I think I find it way more interesting than negotiating contracts or managing strategies (unless of course we’re not talking of the strategy pattern!) Anyway, the more I learn - the more exciting my day-to-day life becomes. That was a discovery!
I also teach training sessions in .Net for my colleagues (which reminds me that I should have posted the next assignment 2 weeks ago!) - and that probably had an even greater impact on my decision. You see, up until now (or maybe until a few months ago) I did not really know that I knew enough programming to become a professional (not just write code of questionable quality). I certainly didn’t expect I had enough knowledge to share it with others… I may be the worst instructor my colleagues have ever had (teaching OOP and .Net is much more challenging than teaching advertising and PR, which I briefly taught in the past), but the mere fact that people whom I respect as professionals come to class to listen to what I have to say, is a great inspiration for me.
My decision not to look into the possibility of moving on to a “real job” right now, the training sessions I teach, as well as my little personal victories in class led me to the realization that I can, in fact, pursue further studies; that I can, in fact, become a “master programmer” someday, invent my own language, write my own books, teach advanced classes, maybe “launch a google or two” - and in any case, spend my time doing very interesting things in the company of very smart people. Pursuing a PhD in computer science sounds quite fitting.

…That is, until I’m stopped in my tracks by the next algorithm I can’t understand. Then I’ll pack my bags and will run off to New York City to become a cabbie
Posted by Sergei on July 11, 2005 at 3:07 am under Personal. Comments Off.
April 8, 2005
It’s been a very long time since I posted my last entry. I have been very busy both with my studies and my many jobs, at least too busy to write a meaningful entry. I meant for this blog to be an interesting diary of sorts, which primary purpose would be to reflect on my progress in computer science, certainly with some diversions to other topics which I find interesting. As time goes by, I realize that if I try to make each entry a masterpiece in one way or another, there’ll be no entries. On the other hand, I don’t want to fall to the level of “today was Wednesday, I ate breakfast, went to class, etc…” I’d like to find the perfect, or as Russians say - the Golden Middle, a compromise between writing great stuff and writing at all.

The other day I was looking at a job opportunity. To be quite honest, I’m not even sure it was as close as an actual offer – it looks like I turned down the opportunity before it materialized as a specific set of options; but it was damn close. However, the job was mostly about updating web sites. HTML, scripting, occasional programming here and there… but who knows where it would’ve taken me? You know, I would’ve jumped on that job three, maybe even two years ago, no questions asked. I remember once I was so terrified with my worse than pathetic progress in my studies, I even ran away to Chicago to seek my fortune. Very soon I got hungry, but most importantly – bored – and before too long returned back to my studies. But had I been offered this job at that time – I would’ve taken it no matter what!
I didn’t take it.
There are still areas in computer science I shy away from, some I find boring (well, not really boring, I just can’t catch that spark of excitement which I did in areas like design patterns, programming languages, OOP… In fact, one of my professors who teaches computer systems/architecture has all the right to grin and say “progress indeed!” – especially after grading the test I took today.
And yet, I suppose I’ve been making progress. And no matter how much time and effort my jobs and my classes demand, I end up enjoying most of it (once I get it of course…) I suspect that if I stay on it, I might have a chance of having a rather exciting life, at least more exciting than arranging text on a webpage.
Back to the job opportunity. Have you read The Hitchhiker’s Guide to the Galaxy? In the last book, Mostly Harmless, Arthur Dent visits an oracle on planet Hawalius (the entire planet is inhabited by prophets and oracles). He asks her for advice - “…just sort of general advice… what to do with life, that sort of thing…” So she photocopies some notes for him. Is that the advice?
“No,” said the old lady. “It’s the story of my life. You see, the quality of any advice anybody has to offer has to be judged against the quality of life they actually lead. Now, as you look through this document you’ll see that I’ve underlined all the major decisions I ever made to make them stand out. They’re all indexed and cross-referenced. See? All I can suggest is that if you take decisions that are exactly opposite to the sort of decisions that I’ve taken, then maybe you won’t finish up at the end of your life” — she paused, and filled her lungs for a good shout — “in a smelly old cave like this!”
What about the turn I didn’t take a few days ago? Will it be part of a similar story of MY life?
Posted by Sergei on April 8, 2005 at 1:04 am under Personal. Comments Off.
February 14, 2005
The other day I was discussing my blog with my academic advisor. He suggested that seeing the evolving thinking of a graduate student about his research might make a good serial read. So, I suppose instead of trying to impress my readers with ground-breaking ideas, I’ll simply reflect on my learning process. I know for certain that I will find this both interesting and useful. Maybe I’ll be not the only one
Speaking of “evolving thinking” - here’s how I got here: ages ago, when I was young and hopeful, after experiencing major disappointment with my career (as well as Life, the Universe and Everything), I decided to switch from web design and public relations to something completely new - computer science. Why? Because I wanted to live my life doing something cool. Since I wasn’t about to go to Mars, become James Bond or find a secret passage to Middle-earth (I’m still looking though), I figured becoming a computer scientist might be as close to “cool” as I can get.
Here’s my application essay which I submitted to the Department of Computer Science at UNI in June 2001. It’ll give you an idea of just how clueless I was about the field at that time; but hey - my “thinking has evolved” 
Today, in 2005, after retaking numerous classes, spending what seems like centuries in front of the computer screen and watching generations of college students come and go, I’m terrified with the possibility of NOT having made that switch. It has become clear to me that most likely I will be studying/learning new stuff all my life. It has also become clear to me that that’s okay.
Posted by Sergei on February 14, 2005 at 2:02 am under Personal. Comment on this post.
February 11, 2005
I chose this quote from Socrates for today’s title because it reflects exactly the way I’ve been feeling over the past several days. I was even going to use a slightly different wording before I realized that I was about to paraphrase a 2400 year old quote.
Last semester I had a class “Agile software development”. It was a very interesting class, I’ll definitely blog about it some time soon. During that semester we built a simple content management system for processing xml-like text files. I’ve continued my work on this system taking an independent studies class with my advisor.
Speaking of which, independent studies are hard. You don’t have classes, you don’t have homework, you don’t have lectures – you only have regular meetings with your professor and usually a semester-long project you’re working on. The hard part is not motivating yourself to keep going when there’s no homework. The hard part is to know when to stop reading - to stop researching an area and move on to the next chapter or next book or move on to the implementation. I never stop in time, and as a result, my total ignorance is revealed to me over and over again.
The CMS we built last semester was supposed to be able to process files of this kind:
<book>
<chapter source=”ch1.txt”>
<title>Chapter 1</title>
<section>
<para>some text</para>
</section>
</chapter>
<chapter>… and so on.
The tags had various attributes; a book had chapters, chapters had titles and sections, sections had images, paragraphs, code references and so on and so forth. Our CMS was supposed to read a collection of source files and generate a very simple web site for the book. I prided myself with the fact that I was among the very few who decided to build an object tree while parsing the input, thus having a flexible structure which could accommodate any output formats (I didn’t get to thinking of modification which would be so very simple with an object tree). My system read the input and created a book object, populated it with chapter objects, each of them contained section objects, etc. – so my tree was a replication of the input source syntax. Want a web site? No problem! A different web site? A cvs or pdf file? That’s easy! Changes to section 3 of chapter 5? Piece of cake!
Now that I’m digging deeper into the subject, here’s what I’ve found out:

“Of the main approaches to processing, most developers find tree-based processing most obvious and easiest to understand… Tree-based processing probably seems so natural because it presents the document as something that can be “seen” and “touched,” almost as if it were a physical object.” Okay, so I didn’t invent anything new. Actually, had I bothered doing some research on XML parsing, I would’ve known it last semester. But here’s more: it turns out that the main and very significant disadvantage of this approach is performance. According to the book I’m reading, it is not unusual for a document to require 10 times as much memory in a tree-based representation as an un[processed document does on disk. So, if my parser is processing a program, that should not be a problem. However, if it is working on an XML document with multiple entities which represent a 1000 page book with thousands of tags which must be implemented as objects in my tree, I think I might be very short of memory.
So what did I learn today? I have learned something about XML parsing, but I still know nothing except the fact of my ignorance.
Posted by Sergei on February 11, 2005 at 2:02 am under Computer Science Classes. Comments Off.
|
|