I have been studying research papers all weekend (nobody seems to believe though). A few were a waste of time, but the rest were really good. Before reflecting on papers I like, I will summarize once again what I am trying to achieve.
The system I am trying to design will explore a web site (or a collection of web sites) and will try to identify a semantic structure for it (like a site map including all web documents). After that the user will have the option to change the structure, provide a layout template (or 100 templates) and have the web site generated based on the adjusted structure and provided templates. At this point, I have no idea what would be the practical application of such a system, but I’m sure that if I build one - they will come
Although as a veteran web designer, I can see an obvious application right away: instead of having your web development team spend time building sample web sites to illustrate the new design and/or site navigation model - have a machine do it for you, just specify a structure and provide some templates. Or maybe you’re just curious to see what your college web site would look like if it used the design template of cnn.com - my system will help.
Far fetched? Maybe. But I’ve found that there is a lot of research taking place in the area of information extraction from web documents. A significant part of it is targeted at identifying content blocks on a web page in order to reformat it to be displayed on smaller screens. Other researchers are trying to find a way to extract specific information from web documents for storing it in structured format for further processing. Both of these topics are already providing me with a great deal of information and ideas which may help me. I will be extracting content from a collection of web documents, in order to find similarities between these documents which I will then use to discover a semantic structure of the entire collection.
Another body of knowledge deals with web document clustering. Most of this research is about improving search engine results and usability (a clue for the uninitiated: browsing a categorized set of results is much easier than looking at a list; think about a table of contents vs. the entire text of the book - that’s what clustering is about, more or less). A related topic is summarizing the web page content - that is often discussed in web clustering papers. Both of these areas - web clustering, summarization of content (as well as all those I haven’t discovered yet) - will aid me in discovering the semantic structure of my information space.
On a side note, clustering is somewhat a controversial area for me: on the one hand, it’s full of math, which scares the living jeebies out of me (I know about 10 times less math than I’d like to); on the other - for some strange reason which I can’t explain yet, I find it fascinating. Dynamically clustering a web collection, refactoring a program, organizing my desk - I have this strange attraction to structuring and organizing stuff. (I mentioned it here) Others have an attraction to art, or music, or poetry… Why structuring? Why me? Oh well…
Back to the topic. Existing research is focused on information extraction and building structures for improving search engines and displaying the content on other devices. I am hoping to learn from both and take my research in a new direction. So, back to the papers. One of them was a surprise: it was a short paper which I expected to be very general, yet while being general it contained some ideas and observations I found particularly useful for my research. The paper is Structuring web pages based on repetition of elements by Tomoyuki Nanno, Suguru Saito and Manabu Okumura of the Tokio Institute of Technology.
The paper grabbed my attention because the authors start out by asking exactly the same question I have been pondering:
“When people see a web page, they can easily understand the segmentation and structure of the page. What is the key to understanding the segment and structure? We consider that it is the uniformity of certain information… We consider that such a “uniformity” can be useful for detecting the repetition of elements in the web page.”
Exactly! If we, as human being, can identify elements in the page source which are of the same type, we might be able to teach a machine to do it - it’s string pattern matching (another area I’ll have to study). And if our machine can recognize these elements, we could, maybe, come up with some rules, or a large set of features which would determine the probability of a set of elements to represent the web site navigational menu.
But wait, we don’t even need an explicit menu on every page. Say, after analyzing a web collection, we discover a repetitive element, which has a high probability of being the page title (and that goes for all pages within the collection - not just one page). Having a collection of pages with possible titles we can proceed to analyzing the links between these pages, as well as the physical file structure of the collection - and there you go - we might have enough material to come up with a structure. That’s the basic idea.
The paper I am writing about discusses ways to automatically identify repetitive elements on a page, which is almost the foundation for my work. The authors suggest using a bottom-up approach: we start with identifying the most primitive repetitions (for example, a set of links using the same font style), then we replace them with tokens of the same type, after which we proceed recursively to identify more complex structures. The authors suggest an interesting approach, which I cannot use: to replace all text with a generic “text” token. The benefit of this approach is that the system may be implemented as a language-independent framework. In my case, however, I am interested in the linguistic information, because I will be using it to analyze the page semantics in order to find an appropriate place for it in the site structure.
There are parts in the paper with which I respectfully disagree. For example, when analyzing repetition structures with separators, like:
| A |
| b |
| A |
| b |
| A |
The authors suggest that “b” must be smaller than “A” to qualify as a separator. Wrong: with complex table-based layouts, a separator in a menu or some sort of list can in fact take much more code than the actual list/menu item.
Nevertheless, the paper turned out to be a great resource for me. It has some specific suggestions on dealing with complex repetitive sequences, transposing html tables to expose the correct flow of data, and more. As a future possibility, the paper suggests combining this bottom-up approach with a top-down approach based on analyzing the DOM tree behind the html code. That’s a very neat approach - I’ve already read several very interesting papers on it - I will discuss them in my next posts.




