reflections of a wizard in training

Archive for the ‘Computer Science Classes’ Category

For the past several weeks I have been working on a project in Information Storage and Retrieval (ISR). The problem is… I haven’t selected a specific project yet. I have an idea; but first I’ll write about the class in ISR I took last semester - that will help me organize my thoughts on the subject a little better before I go into details about my further studies.

vector space model

The class I took last semester was my first introduction to the field - I learned quite a bit about the vector space model, document indexing, building document-term matrices, stop lists, automatic clustering, etc… Quite interesting and quite complicated. Gone were the times when I saw retrieving information as a simple matter of sending a query to a database… The vector space model was particularly hard to grasp at first, but then - it became clear and obvious: think of it as multi-dimensional hyperspace with an axis for each term in the collection. Documents are plotted as points in this hyperspace whose location is determined by the terms (including their relative weights) that occur in them. Very cool… Understanding it was in way like understanding how language translation works: there was this area I knew nothing about, then after being introduced to it I tried to grasp it but failed miserably; but eventually, after much effort, it revealed itself. Very satisfying; it’s like mastering an impossibly difficult spell…

I also learned a new language - MUMPS. It’s a weird hairy beast which takes forever getting used to; but it has some advantages, and it does offer some unique opportunities, like providing the means to efficiently storing virtually unlimited amounts of data in multi-dimensional arrays, with relatively little code involved. The language tends to be annoying at times, but, quite honestly, I have no idea how I would’ve dealt with a 300MB text-based database in Java…

There was a class project at the end of the semester. The winner was excused from taking the final exam. Two projects were selected as winners - one was by my fellow student Andrew Friedley, the other one was built by me and my partner Alex Stevenson.

Out project was an ISR system called Giggle (Alex thinks it’s a ridiculous name). Giggle is a searching/browsing interface for the MEDLINE database (we used a 30MB subset of the document collection) implemented in MUMPS, C, JavaScript and HTML. The indexing was done mostly by using existing code provided in the lecture notes. The system used data from several generated matrices (term-term, doc-term, etc…), a stop list, the Medline text file …and quite possibly, more stuff which I’m forgetting but have no inclination in searching for in the 700+ lines of ugly code we wrote. The system was, actually, not bad at all. It was my first serious attempt at MUMPS and C, so there was much learning along the way - that caused some ugly design decisions, and yet, as my professor observed, “surprisingly, it worked.”

Giggle provided a rich search and browsing experience. It included the following:

  • a standard search interface with a soundex feature (converting words that sound like one another into common codes: “did you mean ‘xyz’?”);
  • a document view page which highlighted the query terms, converted other terms in the document to links to further queries and a ranked list of related documents;
  • a browsing page with a list of terms, expanding in some cases to related terms combinations;
  • a dynamic clustering feature that took the top 100 results and generated clusters, identifying most common terms for each cluster;
  • a search history page which displayed a list of past searches and a list of viewed documents - both implemented as lists of links.

giggle1s.gif giggle2s.gif giggle3s.gif giggle4s.gif giggle5s.gif
The best part of it was - it was all on one web page. There were 5 tile-shaped links (like on amazon.com) which, when clicked, displayed alternative views of the data. The data in the hidden views was preserved across the session. No, we didn’t store anything on the server - that would be prohibitive in a real-world application. The state management model was borrowed from .NET - the entire html page was treated as a form with the persistent data being passed through a hidden form field. Click events were managed with JavaScript which dynamically changed the values of 2 other hidden fields, setting the action and the target of the event. Now that I think of it, the concept was elementary, I have no idea why nobody had thought of it before asp.net.I like what we did with Giggle, and I’d like to further develop that system, eventually building a simple state/event-management framework. However, for my ISR Project class I’d like to focus more on the subject of information storage and retrieval - not state management in web applications. I suppose I might be able to do both.

Giggle is temporarily available at http://neamh.cns.uni.edu/~wizard/giggle.html.

In one of my previous posts I mentioned that my posts will become less entertaining and more focused on my daily studies. So, “here come one now!”

While building a parsing table, I’ve discovered that the algorithm for building a table described in the Dragon book does not work for right-recursive grammars! Consider this simple example:

expression -> number
expression -> expression + expression

Eliminating left recursion will result in the following grammar:

expression -> number expression-tail
expression-tail -> + expression expression-tail
expression-tail -> Ø

Now let’s build the first and follows functions:

first(expression) = { number }
first(expression-tail) = { + null }

follows(expression) = { + $ }
follows(expression-tail) = { + $ }

A predictive parser doesn’t backtrack: it uses the combination of its current state and the next token to determine what production to apply. Therefore, if the parser is in state ‘A’ and the next token is ‘b’ there should be only one production to follow. Let’s construct a parsing table for the grammar we’re looking at using the algorithm from the book:

For each production A := a, do the following:

  • For each terminal a in FIRST(a), add A := a to M[A, a].
  • If Ø is in FIRST(a), add A := a to M[A, b] for each terminal b in FOLLOW(A).

This algorithm will result in 2 entries for the combination of expression-tail and the token ‘+’. However, since our parser allows only one production per combination, the first will be overridden by the second. As a result, whenever the parser sees a ‘+’ and is in the expression-tail state, it will use the “expression-tail -> Ø” production.

I think the problem lies in right recursion. Our expression-tail can start with ‘+’, derive to nothing and be followed by ‘+’. Since the following symbol can be only an expression-tail, the grammar isn’t ambiguous. However, to build a correct parsing table, the algorithm needs a minor addition:

- If Ø is in FIRST(a), add A := a to M[A, b] for each terminal b in FOLLOW(A) only if b is not in FIRST(A)

Is it correct? I don’t know for sure. Adding a production like

expression-tail -> + anything

would kill it. But it works for the Player grammar.

Do you know why we didn’t see any tigers or dragons in that movie? Because they were all crouching and hiding :-)

I’m currently taking a class on compilers. I tried taking it a couple years ago, but it was a bad idea: a data structures class should come before language translation, or else you fail miserably. I failed miserably. But this time, having taken data structures, programming languages and algorithms I feel much better about this class: for one thing, I know what I’m doing this time, for another – I like what I’m learning.

dragonbook.jpg

I’m writing a compiler for “Player”, a small imperative language with boolean, integer, real, and string values; arrays and records with implicit pointers; user-defined types; nested procedures; and a few simple control structures. It’s relatively small, yet not trivial. The book I’ve been using until recently was Appel’s “Modern Compiler Implementation in Java,” also known among my fellow students as “the tiger book” (it has a tiger on the cover). Two years ago this book left me with the feeling that my quest for knowledge is beyond hopeless – it would take me weeks to finish a chapter, and yet I wouldn’t understand half of it. Now that I know a lot more, I thought it would be different – but it’s not.

tigerbook.jpg

So I tried another book.

The lecture notes (provided on the course website) often refer to “the dragon book” - “Compilers” by Alfred V. Aho, Ravi Sethi and Jeffrey D. Ullman (with a dragon on the cover). My professor warned me that the book is a lot more detailed than what we need for this class. I tried it. More in-depth? Certainly. More complicated? Hell no! It does provide a very detailed analysis of each phase of the compiler-building process. However, these details build-up a very clear picture of how the compiler works.

I’m not even halfway through – I’ve just finished building a parser for the language. But I could’ve never built it relying on Appel’s book. I’ve been getting an impression that the “tiger book,” even when offering a detailed analysis, sort of jumps to conclusions; it’s like a bad derivation – it claims that A -> D without showing the intermediate steps: B -> C and C -> D. I may be wrong, but that’s exactly the impression I got from this book.

My vote is for the dragon book. Not only did it help me with this class (or should I say – it made the class assignments possible for me!); it also enabled me to take a step further and write my first parser-generator, and that’s way beyond anything I even understood a year ago. Thank you, Dragon Book!

I chose this quote from Socrates for today’s title because it reflects exactly the way I’ve been feeling over the past several days. I was even going to use a slightly different wording before I realized that I was about to paraphrase a 2400 year old quote.

Last semester I had a class “Agile software development”. It was a very interesting class, I’ll definitely blog about it some time soon. During that semester we built a simple content management system for processing xml-like text files. I’ve continued my work on this system taking an independent studies class with my advisor.

Speaking of which, independent studies are hard. You don’t have classes, you don’t have homework, you don’t have lectures – you only have regular meetings with your professor and usually a semester-long project you’re working on. The hard part is not motivating yourself to keep going when there’s no homework. The hard part is to know when to stop reading - to stop researching an area and move on to the next chapter or next book or move on to the implementation. I never stop in time, and as a result, my total ignorance is revealed to me over and over again.

The CMS we built last semester was supposed to be able to process files of this kind:


<book>
  <chapter source=”ch1.txt”>
    <title>Chapter 1</title>
      <section>
        <para>some text</para>
      </section>
  </chapter>
  <chapter>… and so on.

The tags had various attributes; a book had chapters, chapters had titles and sections, sections had images, paragraphs, code references and so on and so forth. Our CMS was supposed to read a collection of source files and generate a very simple web site for the book. I prided myself with the fact that I was among the very few who decided to build an object tree while parsing the input, thus having a flexible structure which could accommodate any output formats (I didn’t get to thinking of modification which would be so very simple with an object tree). My system read the input and created a book object, populated it with chapter objects, each of them contained section objects, etc. – so my tree was a replication of the input source syntax. Want a web site? No problem! A different web site? A cvs or pdf file? That’s easy! Changes to section 3 of chapter 5? Piece of cake!

Now that I’m digging deeper into the subject, here’s what I’ve found out:

xmlbook.jpg

“Of the main approaches to processing, most developers find tree-based processing most obvious and easiest to understand… Tree-based processing probably seems so natural because it presents the document as something that can be “seen” and “touched,” almost as if it were a physical object.” Okay, so I didn’t invent anything new. Actually, had I bothered doing some research on XML parsing, I would’ve known it last semester. But here’s more: it turns out that the main and very significant disadvantage of this approach is performance. According to the book I’m reading, it is not unusual for a document to require 10 times as much memory in a tree-based representation as an un[processed document does on disk. So, if my parser is processing a program, that should not be a problem. However, if it is working on an XML document with multiple entities which represent a 1000 page book with thousands of tags which must be implemented as objects in my tree, I think I might be very short of memory.

So what did I learn today? I have learned something about XML parsing, but I still know nothing except the fact of my ignorance.

contact me
blog
research
about