reflections of a wizard in training

Archive for the ‘Information Extraction’ Category

ABOUT MY RESEARCH
I am interested in finding new ways of extracting content from web documents, discovering their logical structure, and mapping them to existing or automatically generated ontologies. This has led me to the study of different models of information retrieval, machine learning and natural language processing techniques applied to text summarization and classification.

Coming from a web development background, I am also interested in building better tools to help people understand and use complex information environments. Up to now, my work in this area has been limited mostly to application development: I have been building better content management systems for bigger web sites; however, as I explore ways to write programs that “understand” web sites, my interest is shifting towards building intelligent user interfaces and web sites, which can adapt their structure to different navigation patterns – i.e., systems that “understand” their users.

ADVENTURES IN INFORMATION RETRIEVAL
I acquired an interest in information retrieval while working on a project, which implemented the vector space model and provided a web-based searching/browsing interface for the Medline database, with dynamic clustering of search results based on the scatter/gather method (similar to the Grouper project, implemented at the University of Washington).

As an additional benefit of working on this project, I had the opportunity to experiment with the back-end of web-based user interfaces. Having only C in my toolbox, I implemented the basic parts of the .Net event-handling and state-management models, thus creating my own framework from scratch and learning how .Net really works, without looking at its sources.

The project was voted best in class and later was demonstrated at the College of Natural Sciences Preview Day; but more importantly – it triggered my interest in the field and led me to my new work: a system which maps an existing ontology to a set of 1.3 million documents from Wikipedia (or any other set) and provides a web-based searching/browsing interface for the generated hierarchy. My category browser is loosely based on the Cat-a-Cone project by Marti Hearst (Xerox PARC, now UC Berkeley) and Chandu Karadi (Stanford University). Although Dr. Hearst (whom I contacted with a question regarding this program) suggested that I consider going the “Flamenco” way, which is a newer browsing interface project, I am still convinced that the Cat-a-Cone concept has some untapped possibilities, which I am determined to uncover.

WEB SITE STORY
My interest in web sites and their logical organization is based on my past work as a web developer. Two years after launching my first web site, I discovered the concept of content management; since then I have built content/web site management systems for more than 20 web sites, including online stores, accounting applications, career and community web sites. However, as I kept adding more features to my systems, I became concerned, that my “all-in-one” solution was trying to be “everything” – which is usually a sign of terrible design. That led me to thinking about a web site in more general terms: what exactly am I trying to manage?

I concluded that any web site, whether static or dynamic, is, in fact, a structured collection of information. Unlike most literature suggests, this structure is not a tree. Consider a faculty web page: we have a set of courses, a set of lecture notes for each course, a set of research interests, graduate students, projects, etc., with most of these collections interrelated and, therefore, interlinked, which constitutes a set of connected hierarchies, i.e. – a graph. It is my strong opinion, that a web site management system should be primarily concerned with managing this graph, not the content within its nodes.

My coding experiments led me to attempting the design of a simple XML-based language for maintaining a set of connected hierarchical collections of data. However, once I got to the point of inventing template-based processing and a syntax to query my tag-based hierarchy, I discovered that I was about to reinvent XSLT and XPath. Nevertheless, my frustration was short-lived: I turned to the ideas from my project in information retrieval, but applied them in a different way: instead of mapping an existing ontology to a set of documents, why not use similar methods to derive a brand new ontology from a given set? Would it be possible to discover a web site’s logical organization? Moreover, would it be possible to derive structures for collections combined from multiple sources, making the boundaries of a single web site irrelevant?

DISCOVERING STRUCTURAL PATTERNS …OR MAYBE CREATING THE SEMANTIC WEB?
The starting point of my new research became analyzing the content and structure of individual web pages, which, according to current research, may provide information about the page semantics. I found that most of the suggested information extraction models depended heavily on specific syntax; however, inferring a grammar from the HTML source and then using it to parse the pages was applicable only to a small subset of my target information space. Yet, other researchers (like William Cohen of Carnegie Mellon University) argue that people employ general-purpose, page-independent strategies for recognizing structure in documents – which is exactly the approach I am looking for.

My goal at this stage is to automatically recognize the logical parts of a single web page: its content area, title and menus, which, together with the set of its external and internal links and any meta-information, will be used to assign it a place in the collection’s logical structure. I am applying a combination of existing techniques, including mining for repetitive elements (lists, tables, headings, etc…), using a large set of heuristics based on formatting patterns, and analyzing inbound links together with the surrounding textual context in the parent pages. However, I am convinced that there is another method, which is to let the machine “observe” the rendered page as a two-dimensional image, applying heuristics based on its geometry, color and a set of other visual characteristics – thus, emulating, to some extent, a human approach. This idea may be far-fetched; but it’s a new approach, I haven’t seen it used and I am excited to explore it.

At this point, I am not sure about the practical application of my research. In general, if a logical structure can be automatically derived from any collection of web documents, that means we have a machine, which is capable of “understanding” web pages – which, besides the obvious applications, like adaptive and auto-generated web sites, may help in creating the Semantic Web – so the sky is the limit.

I have been studying research papers all weekend (nobody seems to believe though). A few were a waste of time, but the rest were really good. Before reflecting on papers I like, I will summarize once again what I am trying to achieve.

The system I am trying to design will explore a web site (or a collection of web sites) and will try to identify a semantic structure for it (like a site map including all web documents). After that the user will have the option to change the structure, provide a layout template (or 100 templates) and have the web site generated based on the adjusted structure and provided templates. At this point, I have no idea what would be the practical application of such a system, but I’m sure that if I build one - they will come :-) Although as a veteran web designer, I can see an obvious application right away: instead of having your web development team spend time building sample web sites to illustrate the new design and/or site navigation model - have a machine do it for you, just specify a structure and provide some templates. Or maybe you’re just curious to see what your college web site would look like if it used the design template of cnn.com - my system will help.

Far fetched? Maybe. But I’ve found that there is a lot of research taking place in the area of information extraction from web documents. A significant part of it is targeted at identifying content blocks on a web page in order to reformat it to be displayed on smaller screens. Other researchers are trying to find a way to extract specific information from web documents for storing it in structured format for further processing. Both of these topics are already providing me with a great deal of information and ideas which may help me. I will be extracting content from a collection of web documents, in order to find similarities between these documents which I will then use to discover a semantic structure of the entire collection.

Another body of knowledge deals with web document clustering. Most of this research is about improving search engine results and usability (a clue for the uninitiated: browsing a categorized set of results is much easier than looking at a list; think about a table of contents vs. the entire text of the book - that’s what clustering is about, more or less). A related topic is summarizing the web page content - that is often discussed in web clustering papers. Both of these areas - web clustering, summarization of content (as well as all those I haven’t discovered yet) - will aid me in discovering the semantic structure of my information space.

On a side note, clustering is somewhat a controversial area for me: on the one hand, it’s full of math, which scares the living jeebies out of me (I know about 10 times less math than I’d like to); on the other - for some strange reason which I can’t explain yet, I find it fascinating. Dynamically clustering a web collection, refactoring a program, organizing my desk - I have this strange attraction to structuring and organizing stuff. (I mentioned it here) Others have an attraction to art, or music, or poetry… Why structuring? Why me? Oh well…

Back to the topic. Existing research is focused on information extraction and building structures for improving search engines and displaying the content on other devices. I am hoping to learn from both and take my research in a new direction. So, back to the papers. One of them was a surprise: it was a short paper which I expected to be very general, yet while being general it contained some ideas and observations I found particularly useful for my research. The paper is Structuring web pages based on repetition of elements by Tomoyuki Nanno, Suguru Saito and Manabu Okumura of the Tokio Institute of Technology.

The paper grabbed my attention because the authors start out by asking exactly the same question I have been pondering:

“When people see a web page, they can easily understand the segmentation and structure of the page. What is the key to understanding the segment and structure? We consider that it is the uniformity of certain information… We consider that such a “uniformity” can be useful for detecting the repetition of elements in the web page.”

Exactly! If we, as human being, can identify elements in the page source which are of the same type, we might be able to teach a machine to do it - it’s string pattern matching (another area I’ll have to study). And if our machine can recognize these elements, we could, maybe, come up with some rules, or a large set of features which would determine the probability of a set of elements to represent the web site navigational menu.

But wait, we don’t even need an explicit menu on every page. Say, after analyzing a web collection, we discover a repetitive element, which has a high probability of being the page title (and that goes for all pages within the collection - not just one page). Having a collection of pages with possible titles we can proceed to analyzing the links between these pages, as well as the physical file structure of the collection - and there you go - we might have enough material to come up with a structure. That’s the basic idea.

The paper I am writing about discusses ways to automatically identify repetitive elements on a page, which is almost the foundation for my work. The authors suggest using a bottom-up approach: we start with identifying the most primitive repetitions (for example, a set of links using the same font style), then we replace them with tokens of the same type, after which we proceed recursively to identify more complex structures. The authors suggest an interesting approach, which I cannot use: to replace all text with a generic “text” token. The benefit of this approach is that the system may be implemented as a language-independent framework. In my case, however, I am interested in the linguistic information, because I will be using it to analyze the page semantics in order to find an appropriate place for it in the site structure.

There are parts in the paper with which I respectfully disagree. For example, when analyzing repetition structures with separators, like:

A
b
A
b
A

The authors suggest that “b” must be smaller than “A” to qualify as a separator. Wrong: with complex table-based layouts, a separator in a menu or some sort of list can in fact take much more code than the actual list/menu item.

Nevertheless, the paper turned out to be a great resource for me. It has some specific suggestions on dealing with complex repetitive sequences, transposing html tables to expose the correct flow of data, and more. As a future possibility, the paper suggests combining this bottom-up approach with a top-down approach based on analyzing the DOM tree behind the html code. That’s a very neat approach - I’ve already read several very interesting papers on it - I will discuss them in my next posts.

I have a stack of 59 research papers related to web content extraction. I have read through some very intersting papers which I will describe in my following posts, but I start with a rant (hey, what are blogs for! :-) )

Understanding the Flow of Content in Summarizing HTML Documents by A.F.R. Rahman, H. Alam and R. Hartoro of BCL Computers Inc. This is a paper about extracting content from a single web page in order to reformat it for a smaller screen (like a PDA or a cell phone). The paper focuses on finding the structural parts of the document, then creating a table of contents which provides an easy way to access to the structural parts of the document. The paper mentions the relative importance of different content objects (or structural parts) within the document but says nothing about the way to calculate it. The flow of content is supposedly understood based on these objects (or page segments) and their types. The types could be “story” (if the segment contains a lot of text), “links” (if primarily composed of links); other types could be navigation, forms and images. In other words, a paper about nothing; nothing new, or remotely interesting in any case. Why am I even writing about it? I wonder if that’s what industry papers are like - some general concepts with no substance (or all hidden from the competition?).

Found another paper by the same folks, Content Extraction from HTML Documents. It’s basically the same thing: let’s separate the page into zones based on its HTML structure, then analyze the relationship of these zones based on their proximity, content classification (Is it a bird? No! Is it a plane? No! Is it a table tag? YES!!! Let’s research again - like we did in our last paper…) Once the relationship is established, the paper suggests reflowing the content into a more meaningful and efficient manner based on the requirements of the target device. That’s basically it. Wait, I have an idea for my own paper! My paper will be about time travel. It’ simple: my key thesis will be that time travel is important and to implement it we need a time machine. I’ll even suggest an implementation technique: first we carry out a structural analysis of our target time frame. We decompose the time frame into measurable time chunks based on the results of our structural analysis. We label the chunks and then use content recognition methods… whoops, I meant time traveling methods to test our solution… Sorry guys, I’m afraid both of your papers are lacking something.

I have finally found an interesting topic for my main research. or rather, an area from which I will be able to select an interesting topic. The area is information extraction from web documents. Although I might be getting ahead of myself - it might be a little too specific for now. What’s interesting is how I arrived at this topic.

I wrote my first line of executable code only four years ago (2001), so until recently I thought that still had to learn a lot of basic stuff before moving on to really interesting things. However, somewhere in 2004/05 I realized that in fact I knew enough to start playing around with some advanced topics. All my little victories suddenly fell into place: understanding of data structures enabled me to easily design any data representations for my projects, programming languages and, later, the compilers class taught me about writing context free grammars, parsing text and translating input into a different language, etc… many classes taught me many things; but the point is - I suddenly had the tools at my disposal to build interesting stuff and to research concepts which are not yet fully studied and problems which are not yet solved. (Very soon after that I decided to get a PhD in computer science.)

So, on to uncharted territory!

An area I was interested in was managing content. Why - I couldn’t tell. Maybe because I build web-based content management systems for a living; maybe because I had significant domain knowledge up-front. Whatever it was that triggered my interest, I spent about a year experimenting with text-based storage in XML-like structures and outputting the data in different formats. As a learning exercise it was great - I got quite a bit of practice with parsing tagged input and traversing the constructed tree to produce different kinds of output. I suppose I can say that I got very close to discovering some big ideas on my own - ideas which have been implemented countless number of times, as I later discovered. However, the fact that I was thinking of template-based processing with a syntax like XSLT and a query language to extract data from XML structures like XPath - before knowing about the existence of XSLT and XPath - means that I was on the right track.

However, I found working in this direction to be very frustrating. Every time I would come up with an idea which pointed me in a promising direction, I would discover that it has been explored inside and out, and the results have been implemented in every language imaginable. Do I, honestly, hope to discover something new at this stage of my education? Hell, yes! Will I? Most likely, not at this time; but at least I want an open field in front of me - something to explore and to watch others explore it. My plan for the content management project was to come up with a simple language which could describe storage and the output formats of structured data. Eventually I put it on hold - the excitement was not there anymore.

Luckily, I found a direction in which I can expand the original concept. My original project was about describing existing data; but what if I were to collect this data?

This idea came from my studies in information storage and retrieval. I’m working on a project with Dr. Kevin O’Kane (for more information check my ISR research diary). Initially the project topic was very broad, but after reducing it to building a Wikipedia category browser and information retrieval system, I was left with at least half of my ideas unclaimed! that’s when I realized that I could easily connect the two - my content management experiments with structured data and syntax describing it - and ways to retrieve this structured data from existing information collections. Right now I see it as a web site analyzing and generation tool: the program will explore a web site (or a web domain) and construct a hierarchy of its content, describing it in a syntax I design. After extracting and analyzing the content, the system will be capable of producing output based on templates or other description specified by the user (again, in a syntax I will design for that). In other words, it may become a web site refactoring machine: taking an existing web site as input, processing it into a well-structured, manually editable form and producing output in a different format. Want to see what cnn.com will look like in the format of mugglenet.com? - no problem!

The body of knowledge is huge! I have found at least 30 papers which are directly related to this area. However, most of them (at least the ones I’ve read so far) focus on content extraction from a single web page (or the Web in general). None of them explore the idea of extracting structure from a web site (or generating it). Can a program discover an hierarchy in a web collection? I think it can. Am I on to something?

contact me
blog
research
about