reflections of a wizard in training

I have a stack of 59 research papers related to web content extraction. I have read through some very intersting papers which I will describe in my following posts, but I start with a rant (hey, what are blogs for! :-) )

Understanding the Flow of Content in Summarizing HTML Documents by A.F.R. Rahman, H. Alam and R. Hartoro of BCL Computers Inc. This is a paper about extracting content from a single web page in order to reformat it for a smaller screen (like a PDA or a cell phone). The paper focuses on finding the structural parts of the document, then creating a table of contents which provides an easy way to access to the structural parts of the document. The paper mentions the relative importance of different content objects (or structural parts) within the document but says nothing about the way to calculate it. The flow of content is supposedly understood based on these objects (or page segments) and their types. The types could be “story” (if the segment contains a lot of text), “links” (if primarily composed of links); other types could be navigation, forms and images. In other words, a paper about nothing; nothing new, or remotely interesting in any case. Why am I even writing about it? I wonder if that’s what industry papers are like - some general concepts with no substance (or all hidden from the competition?).

Found another paper by the same folks, Content Extraction from HTML Documents. It’s basically the same thing: let’s separate the page into zones based on its HTML structure, then analyze the relationship of these zones based on their proximity, content classification (Is it a bird? No! Is it a plane? No! Is it a table tag? YES!!! Let’s research again - like we did in our last paper…) Once the relationship is established, the paper suggests reflowing the content into a more meaningful and efficient manner based on the requirements of the target device. That’s basically it. Wait, I have an idea for my own paper! My paper will be about time travel. It’ simple: my key thesis will be that time travel is important and to implement it we need a time machine. I’ll even suggest an implementation technique: first we carry out a structural analysis of our target time frame. We decompose the time frame into measurable time chunks based on the results of our structural analysis. We label the chunks and then use content recognition methods… whoops, I meant time traveling methods to test our solution… Sorry guys, I’m afraid both of your papers are lacking something.

contact me
blog
research
about