reflections of a wizard in training

Archive for the ‘Research’ Category

I have finished my thesis for a master’s degree in communication studies from the University of Northern Iowa. It is available as a set of web pages and a pdf for downloading. The data used in this study is available here.

This is my first real attempt to apply computer science to research in communication studies. I will appreciate any feedback on this study. Please use this entry for comments.

Here’s the abstract:

The focus of this study is the general public and its significance in a public relations context. The study questions one of the fundamental principles of public relations, according to which organizations focus their communication efforts on publics which have the potential to cause them negative consequences. The general public, within this context, is considered insignificant and even nonexistent.

This study’s hypothesis is that new communication technologies have given the general public the power to cause direct negative consequences for organizations. To evaluate this hypothesis, the study examines the October 2006 Edelman/Wal-Mart crisis caused by the “Wal-Marting Across America” blog. The blog was launched by Edelman as a publicity stunt on behalf of Wal-Mart, yet it was presented as an independent blog maintained by a couple traveling in their RV and writing stories about happy Wal-Mart employees. The truth was revealed in a Business Week article, which triggered a massive discussion in the blogosphere and mainstream media.

The methodology of this study involves quantitative and qualitative analysis of blog posts and mainstream media articles. Relevant blog posts were collected and processed with the help of a computer science approach, which consisted of automatically exploring the Web and discovering blogs participating in the conversation, and constructing a chronological model of this conversation for further analysis, represented as a graph, with nodes denoting blog posts, and the edges – the links between these posts. The final data set consists of 18 mainstream media articles, and 156 blogs containing 201 relevant posts, connected by 1.548 links.

The results of the study demonstrate that Edelman suffered significant negative consequences which were caused by the blogosphere. The negative consequences included significant negative publicity, as well as a negative public opinion on Edelman, formed through the discussion on the blogosphere. The study demonstrates that the consequences were caused by collective action on behalf of all blogs involved in the conversation. It is argued that the individuals and groups behind these blogs represent multiple publics, which can be described in the context of this study as the general public. Therefore, the hypothesis of this study is supported: new communication technologies, such as the blogosphere, have given the general public the power to cause direct negative consequences for organizations. The study also describes some implications these results may have for existing public relations theory.

If any of this makes sense, here is the entire thesis. Enjoy!

I have a close-to-infinite list of unanswered emails in my inbox. As Paul Graham suggests, some errands, like replying to letters, go away if you ignore them (perhaps taking friends with them). I wonder how many friends that’ll cost me… I have been, actually, following Paul’s advise for the good part of my life. I tend to set aside “less important things” in favor of devoting all my time to “more important things.” It happens, that most of the time I am very preoccupied with one of these “more important things” - like studying, doing research, and applying to PhD programs, most of which are way out of my league. Naturally, replying to my friends’ emails, reading the novels from my long list of “things to enjoy,” going to the gym and running more than once a month - all that usually goes down the drain with lots of other “less important stuff.” Is it worth it? I had better be winning the Nobel prize some time soon to justify all these sacrifices…

My friends to whom I haven’t replied in ages: I apologize. I have been racing against time this semester: I was busy applying to doctoral programs. Filling out an application form is relatively easy. Having some research experience and interesting ideas to back your personal statement is the tough one. Research becomes possible once you have the programming and theoretical background, so learning this or that language/framework/approach changes from being the goal to being only the means to something by far more complicated and time consuming. The second requirement is interesting ideas. These come (as they tell me) after A LOT of thinking about your subject. You think about it day and night - and eventually they may come. Chances are, they won’t. But if you don’t even try - they never will. No, I’m not complaining - it’s all fun… but it can take all your time - what happens to me too often.

No great ideas so far. However, for those who are interested, here’s part of my research statement …which I have to post as a separate entry. It appears that when I coded my blog software, I didn’t anticipate entries longer than 8000 characters. Whoops… How embarassing… Must recode.

ABOUT MY RESEARCH
I am interested in finding new ways of extracting content from web documents, discovering their logical structure, and mapping them to existing or automatically generated ontologies. This has led me to the study of different models of information retrieval, machine learning and natural language processing techniques applied to text summarization and classification.

Coming from a web development background, I am also interested in building better tools to help people understand and use complex information environments. Up to now, my work in this area has been limited mostly to application development: I have been building better content management systems for bigger web sites; however, as I explore ways to write programs that “understand” web sites, my interest is shifting towards building intelligent user interfaces and web sites, which can adapt their structure to different navigation patterns – i.e., systems that “understand” their users.

ADVENTURES IN INFORMATION RETRIEVAL
I acquired an interest in information retrieval while working on a project, which implemented the vector space model and provided a web-based searching/browsing interface for the Medline database, with dynamic clustering of search results based on the scatter/gather method (similar to the Grouper project, implemented at the University of Washington).

As an additional benefit of working on this project, I had the opportunity to experiment with the back-end of web-based user interfaces. Having only C in my toolbox, I implemented the basic parts of the .Net event-handling and state-management models, thus creating my own framework from scratch and learning how .Net really works, without looking at its sources.

The project was voted best in class and later was demonstrated at the College of Natural Sciences Preview Day; but more importantly – it triggered my interest in the field and led me to my new work: a system which maps an existing ontology to a set of 1.3 million documents from Wikipedia (or any other set) and provides a web-based searching/browsing interface for the generated hierarchy. My category browser is loosely based on the Cat-a-Cone project by Marti Hearst (Xerox PARC, now UC Berkeley) and Chandu Karadi (Stanford University). Although Dr. Hearst (whom I contacted with a question regarding this program) suggested that I consider going the “Flamenco” way, which is a newer browsing interface project, I am still convinced that the Cat-a-Cone concept has some untapped possibilities, which I am determined to uncover.

WEB SITE STORY
My interest in web sites and their logical organization is based on my past work as a web developer. Two years after launching my first web site, I discovered the concept of content management; since then I have built content/web site management systems for more than 20 web sites, including online stores, accounting applications, career and community web sites. However, as I kept adding more features to my systems, I became concerned, that my “all-in-one” solution was trying to be “everything” – which is usually a sign of terrible design. That led me to thinking about a web site in more general terms: what exactly am I trying to manage?

I concluded that any web site, whether static or dynamic, is, in fact, a structured collection of information. Unlike most literature suggests, this structure is not a tree. Consider a faculty web page: we have a set of courses, a set of lecture notes for each course, a set of research interests, graduate students, projects, etc., with most of these collections interrelated and, therefore, interlinked, which constitutes a set of connected hierarchies, i.e. – a graph. It is my strong opinion, that a web site management system should be primarily concerned with managing this graph, not the content within its nodes.

My coding experiments led me to attempting the design of a simple XML-based language for maintaining a set of connected hierarchical collections of data. However, once I got to the point of inventing template-based processing and a syntax to query my tag-based hierarchy, I discovered that I was about to reinvent XSLT and XPath. Nevertheless, my frustration was short-lived: I turned to the ideas from my project in information retrieval, but applied them in a different way: instead of mapping an existing ontology to a set of documents, why not use similar methods to derive a brand new ontology from a given set? Would it be possible to discover a web site’s logical organization? Moreover, would it be possible to derive structures for collections combined from multiple sources, making the boundaries of a single web site irrelevant?

DISCOVERING STRUCTURAL PATTERNS …OR MAYBE CREATING THE SEMANTIC WEB?
The starting point of my new research became analyzing the content and structure of individual web pages, which, according to current research, may provide information about the page semantics. I found that most of the suggested information extraction models depended heavily on specific syntax; however, inferring a grammar from the HTML source and then using it to parse the pages was applicable only to a small subset of my target information space. Yet, other researchers (like William Cohen of Carnegie Mellon University) argue that people employ general-purpose, page-independent strategies for recognizing structure in documents – which is exactly the approach I am looking for.

My goal at this stage is to automatically recognize the logical parts of a single web page: its content area, title and menus, which, together with the set of its external and internal links and any meta-information, will be used to assign it a place in the collection’s logical structure. I am applying a combination of existing techniques, including mining for repetitive elements (lists, tables, headings, etc…), using a large set of heuristics based on formatting patterns, and analyzing inbound links together with the surrounding textual context in the parent pages. However, I am convinced that there is another method, which is to let the machine “observe” the rendered page as a two-dimensional image, applying heuristics based on its geometry, color and a set of other visual characteristics – thus, emulating, to some extent, a human approach. This idea may be far-fetched; but it’s a new approach, I haven’t seen it used and I am excited to explore it.

At this point, I am not sure about the practical application of my research. In general, if a logical structure can be automatically derived from any collection of web documents, that means we have a machine, which is capable of “understanding” web pages – which, besides the obvious applications, like adaptive and auto-generated web sites, may help in creating the Semantic Web – so the sky is the limit.

I have been studying research papers all weekend (nobody seems to believe though). A few were a waste of time, but the rest were really good. Before reflecting on papers I like, I will summarize once again what I am trying to achieve.

The system I am trying to design will explore a web site (or a collection of web sites) and will try to identify a semantic structure for it (like a site map including all web documents). After that the user will have the option to change the structure, provide a layout template (or 100 templates) and have the web site generated based on the adjusted structure and provided templates. At this point, I have no idea what would be the practical application of such a system, but I’m sure that if I build one - they will come :-) Although as a veteran web designer, I can see an obvious application right away: instead of having your web development team spend time building sample web sites to illustrate the new design and/or site navigation model - have a machine do it for you, just specify a structure and provide some templates. Or maybe you’re just curious to see what your college web site would look like if it used the design template of cnn.com - my system will help.

Far fetched? Maybe. But I’ve found that there is a lot of research taking place in the area of information extraction from web documents. A significant part of it is targeted at identifying content blocks on a web page in order to reformat it to be displayed on smaller screens. Other researchers are trying to find a way to extract specific information from web documents for storing it in structured format for further processing. Both of these topics are already providing me with a great deal of information and ideas which may help me. I will be extracting content from a collection of web documents, in order to find similarities between these documents which I will then use to discover a semantic structure of the entire collection.

Another body of knowledge deals with web document clustering. Most of this research is about improving search engine results and usability (a clue for the uninitiated: browsing a categorized set of results is much easier than looking at a list; think about a table of contents vs. the entire text of the book - that’s what clustering is about, more or less). A related topic is summarizing the web page content - that is often discussed in web clustering papers. Both of these areas - web clustering, summarization of content (as well as all those I haven’t discovered yet) - will aid me in discovering the semantic structure of my information space.

On a side note, clustering is somewhat a controversial area for me: on the one hand, it’s full of math, which scares the living jeebies out of me (I know about 10 times less math than I’d like to); on the other - for some strange reason which I can’t explain yet, I find it fascinating. Dynamically clustering a web collection, refactoring a program, organizing my desk - I have this strange attraction to structuring and organizing stuff. (I mentioned it here) Others have an attraction to art, or music, or poetry… Why structuring? Why me? Oh well…

Back to the topic. Existing research is focused on information extraction and building structures for improving search engines and displaying the content on other devices. I am hoping to learn from both and take my research in a new direction. So, back to the papers. One of them was a surprise: it was a short paper which I expected to be very general, yet while being general it contained some ideas and observations I found particularly useful for my research. The paper is Structuring web pages based on repetition of elements by Tomoyuki Nanno, Suguru Saito and Manabu Okumura of the Tokio Institute of Technology.

The paper grabbed my attention because the authors start out by asking exactly the same question I have been pondering:

“When people see a web page, they can easily understand the segmentation and structure of the page. What is the key to understanding the segment and structure? We consider that it is the uniformity of certain information… We consider that such a “uniformity” can be useful for detecting the repetition of elements in the web page.”

Exactly! If we, as human being, can identify elements in the page source which are of the same type, we might be able to teach a machine to do it - it’s string pattern matching (another area I’ll have to study). And if our machine can recognize these elements, we could, maybe, come up with some rules, or a large set of features which would determine the probability of a set of elements to represent the web site navigational menu.

But wait, we don’t even need an explicit menu on every page. Say, after analyzing a web collection, we discover a repetitive element, which has a high probability of being the page title (and that goes for all pages within the collection - not just one page). Having a collection of pages with possible titles we can proceed to analyzing the links between these pages, as well as the physical file structure of the collection - and there you go - we might have enough material to come up with a structure. That’s the basic idea.

The paper I am writing about discusses ways to automatically identify repetitive elements on a page, which is almost the foundation for my work. The authors suggest using a bottom-up approach: we start with identifying the most primitive repetitions (for example, a set of links using the same font style), then we replace them with tokens of the same type, after which we proceed recursively to identify more complex structures. The authors suggest an interesting approach, which I cannot use: to replace all text with a generic “text” token. The benefit of this approach is that the system may be implemented as a language-independent framework. In my case, however, I am interested in the linguistic information, because I will be using it to analyze the page semantics in order to find an appropriate place for it in the site structure.

There are parts in the paper with which I respectfully disagree. For example, when analyzing repetition structures with separators, like:

A
b
A
b
A

The authors suggest that “b” must be smaller than “A” to qualify as a separator. Wrong: with complex table-based layouts, a separator in a menu or some sort of list can in fact take much more code than the actual list/menu item.

Nevertheless, the paper turned out to be a great resource for me. It has some specific suggestions on dealing with complex repetitive sequences, transposing html tables to expose the correct flow of data, and more. As a future possibility, the paper suggests combining this bottom-up approach with a top-down approach based on analyzing the DOM tree behind the html code. That’s a very neat approach - I’ve already read several very interesting papers on it - I will discuss them in my next posts.

I have a stack of 59 research papers related to web content extraction. I have read through some very intersting papers which I will describe in my following posts, but I start with a rant (hey, what are blogs for! :-) )

Understanding the Flow of Content in Summarizing HTML Documents by A.F.R. Rahman, H. Alam and R. Hartoro of BCL Computers Inc. This is a paper about extracting content from a single web page in order to reformat it for a smaller screen (like a PDA or a cell phone). The paper focuses on finding the structural parts of the document, then creating a table of contents which provides an easy way to access to the structural parts of the document. The paper mentions the relative importance of different content objects (or structural parts) within the document but says nothing about the way to calculate it. The flow of content is supposedly understood based on these objects (or page segments) and their types. The types could be “story” (if the segment contains a lot of text), “links” (if primarily composed of links); other types could be navigation, forms and images. In other words, a paper about nothing; nothing new, or remotely interesting in any case. Why am I even writing about it? I wonder if that’s what industry papers are like - some general concepts with no substance (or all hidden from the competition?).

Found another paper by the same folks, Content Extraction from HTML Documents. It’s basically the same thing: let’s separate the page into zones based on its HTML structure, then analyze the relationship of these zones based on their proximity, content classification (Is it a bird? No! Is it a plane? No! Is it a table tag? YES!!! Let’s research again - like we did in our last paper…) Once the relationship is established, the paper suggests reflowing the content into a more meaningful and efficient manner based on the requirements of the target device. That’s basically it. Wait, I have an idea for my own paper! My paper will be about time travel. It’ simple: my key thesis will be that time travel is important and to implement it we need a time machine. I’ll even suggest an implementation technique: first we carry out a structural analysis of our target time frame. We decompose the time frame into measurable time chunks based on the results of our structural analysis. We label the chunks and then use content recognition methods… whoops, I meant time traveling methods to test our solution… Sorry guys, I’m afraid both of your papers are lacking something.

As I mentioned in one of my previous posts, last semester I learned the basics of information storage and retrieval and implemented a web-based retrieval system for searching through the medline.osu database. The collection contained well-structured data, so I was able to build a rather interesting application, which even included a dynamic clustering feature. However, to determine whether my system was indeed efficient, especially in generating clusters of related documents, I would need significant domain knowledge, but the medline database is …medical. I wanted something where I could see the result and judge its correctness, at least to some extent.

The solution was obvious - use the Web! However, I didn’t want to implement yet another search engine - I wanted something relatively new. No, not because I wanted to be the first or the best - it’s only a learning exercise; I wanted something new because it is more exciting to explore a topic you can’t read about in every textbook. Or maybe a search engine simply didn’t sparkle my interest. So I decided to try and implement an online category browser for a large hierarchy of documents. My inspiration came from the Cat-a-Cone (Marti Hearst, Chandu Karadi) and Cha-Cha (Marti Hearst, Michael Chen).

Catacone

Cat-a-Cone is an example of a category browser (if I may put it that way). According to the project authors, “one key insight is the separation of the representation of category labels from documents, which allows the display of multiple categories per document. Another key component is the display of multiple selected categories simultaneously, complete with their hierarchical context.” In other words, before searching the collection, the user can explore the collection’s categories and their hierarchical structure. In a large hierarchy that helps both to disambiguate ambiguous category names and, since the categories are displayed in an hierarchy, it may help to identify related terms and improve the search query.

Cha-cha

Cha-Cha is a “system for organizing intranet search results,” or a search engine. However, it also “organizes web search results in such a way as to reflect the underlying structure of the intranet.” It is really neat - in addition to displaying list of results, it can display an hierarchy of the underlying web collection and list the search results in the context of this hierarchy. Very cool!
So, I decided I would combine the two in the following way. I would use an existing web crawler to explore a limited web domain (for example, the web space of a university) and build the document collection. Then I would explore the collection (or explore it while crawling) and would try to identify an hierarchy. Once the hierarchy was created, I would implement a web-based Cat-a-Cone to browse this hierarchy. Certainly, it would be based on a completely different user interface; but the key concept of a browseable hierarchy would remain.

This project is a little too large to be implemented in one semester. After discussing it with Dr. O’Kane, we decided to focus on specific web collection which already had a rather well-defined structure behind it and a large enough homogeneous collection of documents - Wikipedia. The system will explore the information space and in addition to indexing it will attempt to build an hierarchy. the retrieval system will be, as initially planned, web-based with a category browser analogous to Cat-a-Cone (without the 3D features of course). That’s the plan.

I have finally found an interesting topic for my main research. or rather, an area from which I will be able to select an interesting topic. The area is information extraction from web documents. Although I might be getting ahead of myself - it might be a little too specific for now. What’s interesting is how I arrived at this topic.

I wrote my first line of executable code only four years ago (2001), so until recently I thought that still had to learn a lot of basic stuff before moving on to really interesting things. However, somewhere in 2004/05 I realized that in fact I knew enough to start playing around with some advanced topics. All my little victories suddenly fell into place: understanding of data structures enabled me to easily design any data representations for my projects, programming languages and, later, the compilers class taught me about writing context free grammars, parsing text and translating input into a different language, etc… many classes taught me many things; but the point is - I suddenly had the tools at my disposal to build interesting stuff and to research concepts which are not yet fully studied and problems which are not yet solved. (Very soon after that I decided to get a PhD in computer science.)

So, on to uncharted territory!

An area I was interested in was managing content. Why - I couldn’t tell. Maybe because I build web-based content management systems for a living; maybe because I had significant domain knowledge up-front. Whatever it was that triggered my interest, I spent about a year experimenting with text-based storage in XML-like structures and outputting the data in different formats. As a learning exercise it was great - I got quite a bit of practice with parsing tagged input and traversing the constructed tree to produce different kinds of output. I suppose I can say that I got very close to discovering some big ideas on my own - ideas which have been implemented countless number of times, as I later discovered. However, the fact that I was thinking of template-based processing with a syntax like XSLT and a query language to extract data from XML structures like XPath - before knowing about the existence of XSLT and XPath - means that I was on the right track.

However, I found working in this direction to be very frustrating. Every time I would come up with an idea which pointed me in a promising direction, I would discover that it has been explored inside and out, and the results have been implemented in every language imaginable. Do I, honestly, hope to discover something new at this stage of my education? Hell, yes! Will I? Most likely, not at this time; but at least I want an open field in front of me - something to explore and to watch others explore it. My plan for the content management project was to come up with a simple language which could describe storage and the output formats of structured data. Eventually I put it on hold - the excitement was not there anymore.

Luckily, I found a direction in which I can expand the original concept. My original project was about describing existing data; but what if I were to collect this data?

This idea came from my studies in information storage and retrieval. I’m working on a project with Dr. Kevin O’Kane (for more information check my ISR research diary). Initially the project topic was very broad, but after reducing it to building a Wikipedia category browser and information retrieval system, I was left with at least half of my ideas unclaimed! that’s when I realized that I could easily connect the two - my content management experiments with structured data and syntax describing it - and ways to retrieve this structured data from existing information collections. Right now I see it as a web site analyzing and generation tool: the program will explore a web site (or a web domain) and construct a hierarchy of its content, describing it in a syntax I design. After extracting and analyzing the content, the system will be capable of producing output based on templates or other description specified by the user (again, in a syntax I will design for that). In other words, it may become a web site refactoring machine: taking an existing web site as input, processing it into a well-structured, manually editable form and producing output in a different format. Want to see what cnn.com will look like in the format of mugglenet.com? - no problem!

The body of knowledge is huge! I have found at least 30 papers which are directly related to this area. However, most of them (at least the ones I’ve read so far) focus on content extraction from a single web page (or the Web in general). None of them explore the idea of extracting structure from a web site (or generating it). Can a program discover an hierarchy in a web collection? I think it can. Am I on to something?

contact me
blog
research
about