reflections of a wizard in training

Archive for the ‘Information Retrieval’ Category

ABOUT MY RESEARCH
I am interested in finding new ways of extracting content from web documents, discovering their logical structure, and mapping them to existing or automatically generated ontologies. This has led me to the study of different models of information retrieval, machine learning and natural language processing techniques applied to text summarization and classification.

Coming from a web development background, I am also interested in building better tools to help people understand and use complex information environments. Up to now, my work in this area has been limited mostly to application development: I have been building better content management systems for bigger web sites; however, as I explore ways to write programs that “understand” web sites, my interest is shifting towards building intelligent user interfaces and web sites, which can adapt their structure to different navigation patterns – i.e., systems that “understand” their users.

ADVENTURES IN INFORMATION RETRIEVAL
I acquired an interest in information retrieval while working on a project, which implemented the vector space model and provided a web-based searching/browsing interface for the Medline database, with dynamic clustering of search results based on the scatter/gather method (similar to the Grouper project, implemented at the University of Washington).

As an additional benefit of working on this project, I had the opportunity to experiment with the back-end of web-based user interfaces. Having only C in my toolbox, I implemented the basic parts of the .Net event-handling and state-management models, thus creating my own framework from scratch and learning how .Net really works, without looking at its sources.

The project was voted best in class and later was demonstrated at the College of Natural Sciences Preview Day; but more importantly – it triggered my interest in the field and led me to my new work: a system which maps an existing ontology to a set of 1.3 million documents from Wikipedia (or any other set) and provides a web-based searching/browsing interface for the generated hierarchy. My category browser is loosely based on the Cat-a-Cone project by Marti Hearst (Xerox PARC, now UC Berkeley) and Chandu Karadi (Stanford University). Although Dr. Hearst (whom I contacted with a question regarding this program) suggested that I consider going the “Flamenco” way, which is a newer browsing interface project, I am still convinced that the Cat-a-Cone concept has some untapped possibilities, which I am determined to uncover.

WEB SITE STORY
My interest in web sites and their logical organization is based on my past work as a web developer. Two years after launching my first web site, I discovered the concept of content management; since then I have built content/web site management systems for more than 20 web sites, including online stores, accounting applications, career and community web sites. However, as I kept adding more features to my systems, I became concerned, that my “all-in-one” solution was trying to be “everything” – which is usually a sign of terrible design. That led me to thinking about a web site in more general terms: what exactly am I trying to manage?

I concluded that any web site, whether static or dynamic, is, in fact, a structured collection of information. Unlike most literature suggests, this structure is not a tree. Consider a faculty web page: we have a set of courses, a set of lecture notes for each course, a set of research interests, graduate students, projects, etc., with most of these collections interrelated and, therefore, interlinked, which constitutes a set of connected hierarchies, i.e. – a graph. It is my strong opinion, that a web site management system should be primarily concerned with managing this graph, not the content within its nodes.

My coding experiments led me to attempting the design of a simple XML-based language for maintaining a set of connected hierarchical collections of data. However, once I got to the point of inventing template-based processing and a syntax to query my tag-based hierarchy, I discovered that I was about to reinvent XSLT and XPath. Nevertheless, my frustration was short-lived: I turned to the ideas from my project in information retrieval, but applied them in a different way: instead of mapping an existing ontology to a set of documents, why not use similar methods to derive a brand new ontology from a given set? Would it be possible to discover a web site’s logical organization? Moreover, would it be possible to derive structures for collections combined from multiple sources, making the boundaries of a single web site irrelevant?

DISCOVERING STRUCTURAL PATTERNS …OR MAYBE CREATING THE SEMANTIC WEB?
The starting point of my new research became analyzing the content and structure of individual web pages, which, according to current research, may provide information about the page semantics. I found that most of the suggested information extraction models depended heavily on specific syntax; however, inferring a grammar from the HTML source and then using it to parse the pages was applicable only to a small subset of my target information space. Yet, other researchers (like William Cohen of Carnegie Mellon University) argue that people employ general-purpose, page-independent strategies for recognizing structure in documents – which is exactly the approach I am looking for.

My goal at this stage is to automatically recognize the logical parts of a single web page: its content area, title and menus, which, together with the set of its external and internal links and any meta-information, will be used to assign it a place in the collection’s logical structure. I am applying a combination of existing techniques, including mining for repetitive elements (lists, tables, headings, etc…), using a large set of heuristics based on formatting patterns, and analyzing inbound links together with the surrounding textual context in the parent pages. However, I am convinced that there is another method, which is to let the machine “observe” the rendered page as a two-dimensional image, applying heuristics based on its geometry, color and a set of other visual characteristics – thus, emulating, to some extent, a human approach. This idea may be far-fetched; but it’s a new approach, I haven’t seen it used and I am excited to explore it.

At this point, I am not sure about the practical application of my research. In general, if a logical structure can be automatically derived from any collection of web documents, that means we have a machine, which is capable of “understanding” web pages – which, besides the obvious applications, like adaptive and auto-generated web sites, may help in creating the Semantic Web – so the sky is the limit.

As I mentioned in one of my previous posts, last semester I learned the basics of information storage and retrieval and implemented a web-based retrieval system for searching through the medline.osu database. The collection contained well-structured data, so I was able to build a rather interesting application, which even included a dynamic clustering feature. However, to determine whether my system was indeed efficient, especially in generating clusters of related documents, I would need significant domain knowledge, but the medline database is …medical. I wanted something where I could see the result and judge its correctness, at least to some extent.

The solution was obvious - use the Web! However, I didn’t want to implement yet another search engine - I wanted something relatively new. No, not because I wanted to be the first or the best - it’s only a learning exercise; I wanted something new because it is more exciting to explore a topic you can’t read about in every textbook. Or maybe a search engine simply didn’t sparkle my interest. So I decided to try and implement an online category browser for a large hierarchy of documents. My inspiration came from the Cat-a-Cone (Marti Hearst, Chandu Karadi) and Cha-Cha (Marti Hearst, Michael Chen).

Catacone

Cat-a-Cone is an example of a category browser (if I may put it that way). According to the project authors, “one key insight is the separation of the representation of category labels from documents, which allows the display of multiple categories per document. Another key component is the display of multiple selected categories simultaneously, complete with their hierarchical context.” In other words, before searching the collection, the user can explore the collection’s categories and their hierarchical structure. In a large hierarchy that helps both to disambiguate ambiguous category names and, since the categories are displayed in an hierarchy, it may help to identify related terms and improve the search query.

Cha-cha

Cha-Cha is a “system for organizing intranet search results,” or a search engine. However, it also “organizes web search results in such a way as to reflect the underlying structure of the intranet.” It is really neat - in addition to displaying list of results, it can display an hierarchy of the underlying web collection and list the search results in the context of this hierarchy. Very cool!
So, I decided I would combine the two in the following way. I would use an existing web crawler to explore a limited web domain (for example, the web space of a university) and build the document collection. Then I would explore the collection (or explore it while crawling) and would try to identify an hierarchy. Once the hierarchy was created, I would implement a web-based Cat-a-Cone to browse this hierarchy. Certainly, it would be based on a completely different user interface; but the key concept of a browseable hierarchy would remain.

This project is a little too large to be implemented in one semester. After discussing it with Dr. O’Kane, we decided to focus on specific web collection which already had a rather well-defined structure behind it and a large enough homogeneous collection of documents - Wikipedia. The system will explore the information space and in addition to indexing it will attempt to build an hierarchy. the retrieval system will be, as initially planned, web-based with a category browser analogous to Cat-a-Cone (without the 3D features of course). That’s the plan.

For the past several weeks I have been working on a project in Information Storage and Retrieval (ISR). The problem is… I haven’t selected a specific project yet. I have an idea; but first I’ll write about the class in ISR I took last semester - that will help me organize my thoughts on the subject a little better before I go into details about my further studies.

vector space model

The class I took last semester was my first introduction to the field - I learned quite a bit about the vector space model, document indexing, building document-term matrices, stop lists, automatic clustering, etc… Quite interesting and quite complicated. Gone were the times when I saw retrieving information as a simple matter of sending a query to a database… The vector space model was particularly hard to grasp at first, but then - it became clear and obvious: think of it as multi-dimensional hyperspace with an axis for each term in the collection. Documents are plotted as points in this hyperspace whose location is determined by the terms (including their relative weights) that occur in them. Very cool… Understanding it was in way like understanding how language translation works: there was this area I knew nothing about, then after being introduced to it I tried to grasp it but failed miserably; but eventually, after much effort, it revealed itself. Very satisfying; it’s like mastering an impossibly difficult spell…

I also learned a new language - MUMPS. It’s a weird hairy beast which takes forever getting used to; but it has some advantages, and it does offer some unique opportunities, like providing the means to efficiently storing virtually unlimited amounts of data in multi-dimensional arrays, with relatively little code involved. The language tends to be annoying at times, but, quite honestly, I have no idea how I would’ve dealt with a 300MB text-based database in Java…

There was a class project at the end of the semester. The winner was excused from taking the final exam. Two projects were selected as winners - one was by my fellow student Andrew Friedley, the other one was built by me and my partner Alex Stevenson.

Out project was an ISR system called Giggle (Alex thinks it’s a ridiculous name). Giggle is a searching/browsing interface for the MEDLINE database (we used a 30MB subset of the document collection) implemented in MUMPS, C, JavaScript and HTML. The indexing was done mostly by using existing code provided in the lecture notes. The system used data from several generated matrices (term-term, doc-term, etc…), a stop list, the Medline text file …and quite possibly, more stuff which I’m forgetting but have no inclination in searching for in the 700+ lines of ugly code we wrote. The system was, actually, not bad at all. It was my first serious attempt at MUMPS and C, so there was much learning along the way - that caused some ugly design decisions, and yet, as my professor observed, “surprisingly, it worked.”

Giggle provided a rich search and browsing experience. It included the following:

  • a standard search interface with a soundex feature (converting words that sound like one another into common codes: “did you mean ‘xyz’?”);
  • a document view page which highlighted the query terms, converted other terms in the document to links to further queries and a ranked list of related documents;
  • a browsing page with a list of terms, expanding in some cases to related terms combinations;
  • a dynamic clustering feature that took the top 100 results and generated clusters, identifying most common terms for each cluster;
  • a search history page which displayed a list of past searches and a list of viewed documents - both implemented as lists of links.

giggle1s.gif giggle2s.gif giggle3s.gif giggle4s.gif giggle5s.gif
The best part of it was - it was all on one web page. There were 5 tile-shaped links (like on amazon.com) which, when clicked, displayed alternative views of the data. The data in the hidden views was preserved across the session. No, we didn’t store anything on the server - that would be prohibitive in a real-world application. The state management model was borrowed from .NET - the entire html page was treated as a form with the persistent data being passed through a hidden form field. Click events were managed with JavaScript which dynamically changed the values of 2 other hidden fields, setting the action and the target of the event. Now that I think of it, the concept was elementary, I have no idea why nobody had thought of it before asp.net.I like what we did with Giggle, and I’d like to further develop that system, eventually building a simple state/event-management framework. However, for my ISR Project class I’d like to focus more on the subject of information storage and retrieval - not state management in web applications. I suppose I might be able to do both.

Giggle is temporarily available at http://neamh.cns.uni.edu/~wizard/giggle.html.

contact me
blog
research
about