reflections of a wizard in training

As I mentioned in one of my previous posts, last semester I learned the basics of information storage and retrieval and implemented a web-based retrieval system for searching through the medline.osu database. The collection contained well-structured data, so I was able to build a rather interesting application, which even included a dynamic clustering feature. However, to determine whether my system was indeed efficient, especially in generating clusters of related documents, I would need significant domain knowledge, but the medline database is …medical. I wanted something where I could see the result and judge its correctness, at least to some extent.

The solution was obvious - use the Web! However, I didn’t want to implement yet another search engine - I wanted something relatively new. No, not because I wanted to be the first or the best - it’s only a learning exercise; I wanted something new because it is more exciting to explore a topic you can’t read about in every textbook. Or maybe a search engine simply didn’t sparkle my interest. So I decided to try and implement an online category browser for a large hierarchy of documents. My inspiration came from the Cat-a-Cone (Marti Hearst, Chandu Karadi) and Cha-Cha (Marti Hearst, Michael Chen).

Catacone

Cat-a-Cone is an example of a category browser (if I may put it that way). According to the project authors, “one key insight is the separation of the representation of category labels from documents, which allows the display of multiple categories per document. Another key component is the display of multiple selected categories simultaneously, complete with their hierarchical context.” In other words, before searching the collection, the user can explore the collection’s categories and their hierarchical structure. In a large hierarchy that helps both to disambiguate ambiguous category names and, since the categories are displayed in an hierarchy, it may help to identify related terms and improve the search query.

Cha-cha

Cha-Cha is a “system for organizing intranet search results,” or a search engine. However, it also “organizes web search results in such a way as to reflect the underlying structure of the intranet.” It is really neat - in addition to displaying list of results, it can display an hierarchy of the underlying web collection and list the search results in the context of this hierarchy. Very cool!
So, I decided I would combine the two in the following way. I would use an existing web crawler to explore a limited web domain (for example, the web space of a university) and build the document collection. Then I would explore the collection (or explore it while crawling) and would try to identify an hierarchy. Once the hierarchy was created, I would implement a web-based Cat-a-Cone to browse this hierarchy. Certainly, it would be based on a completely different user interface; but the key concept of a browseable hierarchy would remain.

This project is a little too large to be implemented in one semester. After discussing it with Dr. O’Kane, we decided to focus on specific web collection which already had a rather well-defined structure behind it and a large enough homogeneous collection of documents - Wikipedia. The system will explore the information space and in addition to indexing it will attempt to build an hierarchy. the retrieval system will be, as initially planned, web-based with a category browser analogous to Cat-a-Cone (without the 3D features of course). That’s the plan.

contact me
blog
research
about