computer science

Application of K-Nearest Neighbor Algorithm to Classification of Wikipedia Articles

This was a project for a class in data mining, which I took in Spring 2007 at Arizona State University. In this project I experimented with applying the K-nearest neighbor algorithm to a text classification problem.

The text corpus used for classification consisted of 10.000 articles from the English set of Wikipedia. The categories assigned to these documents were part of a set of 19.043 categories manually assigned to these documents by Wikipedia users.

The entire set of 10.000 documents was used as the training set, after which the k-nn algorithm was applied to a much smaller set of Wikipedia documents to automatically assign the categories.

The purpose of this experiment was to find a way to automatically classify Wikipedia articles using the Wikipedia ontology, which, potentially, might be used to simplify categorizing new Wikipedia documents.

Here's all the code, including the data (zip, 15.3MB).

Motivation

Wikipedia has been growing at a very fast rate: in 2006 the English Wikipedia consisted of about 745.000 articles, a year later it had 1.618.00 articles! Wikipedia Categories is a manually-maintained ontology of categories dealing with “everything and everyone.” The purpose of this ontology is to better organize Wikipedia and provide its users with better browsing options.

Why bother with auto-classification?
Assigning categories from the Wikipedia ontology to its articles is done manually, which creates the following challenges:

  1. The ontology is huge (currently more than 70.000 categories). Manually selecting the best fitting category is both very difficult and time consuming. Besides, with such a huge selection to choose from, it is highly subjective, which means that there is no one best way to categorize an article: given the same article, different editors would assign different categories to it, while different users would search for it under different categories. Therefore, a standardized, or at least a consistent way of categorizing articles would bring more order to the process of categorizing, and would simplify category-based search and browsing.

  2. There is barely any incentive for an editor to spend time “mining” the ontology for the most appropriate category, so in the absolute majority of cases, the assignment will be done in a hurry, if at all. A machine, on the other hand, is capable of giving the same amount of attention to all categorization tasks, so automatically assigned categories would be a product of a more systematic and thorough approach, which should significantly improve the result.

In this project, I experimented with automatically assigning categories to articles, which, ideally, might solve (or reduce) these problems.

Learning How to Do Research: Change of Scale

In the course of the semester, I made two significant changes to the concept of the project.

Scaling down the text corpus
The original goal was to classify 1.600.000 Wikipedia documents. However, after a semester of studying data mining, I understood that the scale of the project, or rather the size of the text corpus and category set do not have any particular significance. The purpose of the project is to experiment with a classification algorithm applying it to a specific domain to discover better or optimal implementation strategies. Running these kind of experiments on a text corpus of the size of 10.000 or 1.600.000 will, most likely, prove or disprove the same concept – the specific size is less important (as long as it is reasonably large).

Had it been a real-world application, meant for real use instead of research purposes, the size of the text corpus would matter. I decided to scale down the text corpus size to make the project manageable and to be able to run much more test cases with different weights and implementation details.

A different set of categories
The second significant change was to the category set. The original goal was to use 700.000 categories from the Open Directory project. This was an ambitious goal which proved to be beyond the scope of a semester’s work. Instead, I selected the Wikipedia ontology (approximately 70.000 categories). The primary reason for this decision was the fact that these categories have already been manually assigned to Wikipedia documents by thousands of Wikipedia users. That provided a useful benchmark: after automatically assigning categories to a document, I could evaluate the result by comparing the automatic assignment to the previously manually assigned set of categories.

Implementation steps

Algorithm design
Representations of text are very high dimensional (one feature for each word). For most text categorization tasks, there are many irrelevant and many relevant features, so methods that sum evidence from many or all features (e.g. NBC, K-NN, neural networks, SVM) tend to work better than ones that try to isolate just a few relevant features (decision-tree or rule induction). Therefore, I chose the K-NN algorithm for this task.

KNN is a lazy learner, which means that it leaves most of the work until the last moment: instead of building a classification model, it organizes the training set and runs a comparison function on an item to be classified only when the item is submitted for classification. I used the cosine similarity metric. For each document in the corpus I pre-computed the document TF x IDF vector. Then, for each item to be classified, the algorithm computed its similarity to each document in the corpus and returned the top k similar documents. It then selected a set of categories from this set of k results and assigned the categories which appeared at least c times. So, for example if we use 10-nearest neighbors and c is 9, then we'll assign all categories which appear at least in 9 documents.

Data transformation
I downloaded the raw data dumps (thank you Wikipedia!) and parsed them into manageable XML-based format:

		
<page> <id>42</id> <title>Interesting Document</title> <text>Fascinating text………</text> </page>

After that, I generated the following source files for each set:

  • list of document titles
  • existing document/category assignments
  • list of categories based on selected assignments

Data cleaning and feature selection
The original number of terms was 382.122. To reduce this set, I did the following:

  1. Applied a stoplist which improved accuracy of results, but not the feature set size
  2. Removed terms with digits and special characters
  3. Applied Porter stemmer, which reduced the set to 320.428 terms
  4. Removed terms occurring in more than 50% of docs, which reduced the set to 320.388 terms
  5. Removed terms occurring in 2 docs, which reduced the set to 77.721 terms.

Indexing
I generated various binary & text files for storing:

  • document norms, external IDs, titles
  • term IDFs, term document counts
  • document term vectors, term document vectors
  • categories and doc/cat assignments, etc…

I did not modify the vector space model to include term weights based on Wikipedia markup due to time constraints. I plan to do that in this project

Results

The newly assigned categories were compared to the manually assigned categories in terms of precision, recall and their F-measure (harmonic mean). For example, if the system assigned category A to 5 of the 10 documents in the testing set (3 correct, 2 incorrect, and 4 - missed), that would give recall = 3/(3+4)=3/7 and precision = 3/(3+2)=3/5.

The weights of k and c were systematically reassigned and the algorithm rerun on multiple combinations of testing documents until the optimal combination was identified.

Based on the evaluation results, the optimal weights were: k=50, c=5. However, the overall results were dissapointing:

For future work, I’ll consider more elaborate methods of comparing documents and category assignments, taking into consideration Wikipedia linkage structure and HTML markup.

 
contact me
blog
research
about