reflections of a wizard in training

ABOUT MY RESEARCH
I am interested in finding new ways of extracting content from web documents, discovering their logical structure, and mapping them to existing or automatically generated ontologies. This has led me to the study of different models of information retrieval, machine learning and natural language processing techniques applied to text summarization and classification.

Coming from a web development background, I am also interested in building better tools to help people understand and use complex information environments. Up to now, my work in this area has been limited mostly to application development: I have been building better content management systems for bigger web sites; however, as I explore ways to write programs that “understand” web sites, my interest is shifting towards building intelligent user interfaces and web sites, which can adapt their structure to different navigation patterns – i.e., systems that “understand” their users.

ADVENTURES IN INFORMATION RETRIEVAL
I acquired an interest in information retrieval while working on a project, which implemented the vector space model and provided a web-based searching/browsing interface for the Medline database, with dynamic clustering of search results based on the scatter/gather method (similar to the Grouper project, implemented at the University of Washington).

As an additional benefit of working on this project, I had the opportunity to experiment with the back-end of web-based user interfaces. Having only C in my toolbox, I implemented the basic parts of the .Net event-handling and state-management models, thus creating my own framework from scratch and learning how .Net really works, without looking at its sources.

The project was voted best in class and later was demonstrated at the College of Natural Sciences Preview Day; but more importantly – it triggered my interest in the field and led me to my new work: a system which maps an existing ontology to a set of 1.3 million documents from Wikipedia (or any other set) and provides a web-based searching/browsing interface for the generated hierarchy. My category browser is loosely based on the Cat-a-Cone project by Marti Hearst (Xerox PARC, now UC Berkeley) and Chandu Karadi (Stanford University). Although Dr. Hearst (whom I contacted with a question regarding this program) suggested that I consider going the “Flamenco” way, which is a newer browsing interface project, I am still convinced that the Cat-a-Cone concept has some untapped possibilities, which I am determined to uncover.

WEB SITE STORY
My interest in web sites and their logical organization is based on my past work as a web developer. Two years after launching my first web site, I discovered the concept of content management; since then I have built content/web site management systems for more than 20 web sites, including online stores, accounting applications, career and community web sites. However, as I kept adding more features to my systems, I became concerned, that my “all-in-one” solution was trying to be “everything” – which is usually a sign of terrible design. That led me to thinking about a web site in more general terms: what exactly am I trying to manage?

I concluded that any web site, whether static or dynamic, is, in fact, a structured collection of information. Unlike most literature suggests, this structure is not a tree. Consider a faculty web page: we have a set of courses, a set of lecture notes for each course, a set of research interests, graduate students, projects, etc., with most of these collections interrelated and, therefore, interlinked, which constitutes a set of connected hierarchies, i.e. – a graph. It is my strong opinion, that a web site management system should be primarily concerned with managing this graph, not the content within its nodes.

My coding experiments led me to attempting the design of a simple XML-based language for maintaining a set of connected hierarchical collections of data. However, once I got to the point of inventing template-based processing and a syntax to query my tag-based hierarchy, I discovered that I was about to reinvent XSLT and XPath. Nevertheless, my frustration was short-lived: I turned to the ideas from my project in information retrieval, but applied them in a different way: instead of mapping an existing ontology to a set of documents, why not use similar methods to derive a brand new ontology from a given set? Would it be possible to discover a web site’s logical organization? Moreover, would it be possible to derive structures for collections combined from multiple sources, making the boundaries of a single web site irrelevant?

DISCOVERING STRUCTURAL PATTERNS …OR MAYBE CREATING THE SEMANTIC WEB?
The starting point of my new research became analyzing the content and structure of individual web pages, which, according to current research, may provide information about the page semantics. I found that most of the suggested information extraction models depended heavily on specific syntax; however, inferring a grammar from the HTML source and then using it to parse the pages was applicable only to a small subset of my target information space. Yet, other researchers (like William Cohen of Carnegie Mellon University) argue that people employ general-purpose, page-independent strategies for recognizing structure in documents – which is exactly the approach I am looking for.

My goal at this stage is to automatically recognize the logical parts of a single web page: its content area, title and menus, which, together with the set of its external and internal links and any meta-information, will be used to assign it a place in the collection’s logical structure. I am applying a combination of existing techniques, including mining for repetitive elements (lists, tables, headings, etc…), using a large set of heuristics based on formatting patterns, and analyzing inbound links together with the surrounding textual context in the parent pages. However, I am convinced that there is another method, which is to let the machine “observe” the rendered page as a two-dimensional image, applying heuristics based on its geometry, color and a set of other visual characteristics – thus, emulating, to some extent, a human approach. This idea may be far-fetched; but it’s a new approach, I haven’t seen it used and I am excited to explore it.

At this point, I am not sure about the practical application of my research. In general, if a logical structure can be automatically derived from any collection of web documents, that means we have a machine, which is capable of “understanding” web pages – which, besides the obvious applications, like adaptive and auto-generated web sites, may help in creating the Semantic Web – so the sky is the limit.

contact me
blog
research
about