Application of K-Nearest Neighbor Algorithm to Classification of Wikipedia ArticlesThis was a project for a class in data mining, which I took in Spring 2007 at Arizona State University. In this project I experimented with applying the K-nearest neighbor algorithm to a text classification problem. The text corpus used for classification consisted of 10.000 articles from the English set of Wikipedia. The categories assigned to these documents were part of a set of 19.043 categories manually assigned to these documents by Wikipedia users. The entire set of 10.000 documents was used as the training set, after which the k-nn algorithm was applied to a much smaller set of Wikipedia documents to automatically assign the categories. The purpose of this experiment was to find a way to automatically classify Wikipedia articles using the Wikipedia ontology, which, potentially, might be used to simplify categorizing new Wikipedia documents. Here's all the code, including the data (zip, 16.1MB). MotivationWikipedia has been growing at a very fast rate: in 2006 the English Wikipedia consisted of about 745.000 articles, a year later it had 1.618.00 articles! Wikipedia Categories is a manually-maintained ontology of categories dealing with “everything and everyone.” The purpose of this ontology is to better organize Wikipedia and provide its users with better browsing options. Why bother with auto-classification?
In this project, I experimented with automatically assigning categories to articles, which, ideally, might solve (or reduce) these problems. Learning How to Do Research: Change of ScaleIn the course of the semester, I made two significant changes to the concept of the project. Scaling down the text corpus
Had it been a real-world application, meant for real use instead of research purposes, the size of the text corpus would matter. I decided to scale down the text corpus size to make the project manageable and to be able to run much more test cases with different weights and implementation details. A different set of categories
Implementation stepsAlgorithm design
KNN is a lazy learner, which means that it leaves most of the work until the last moment: instead of building a classification model, it organizes the training set and runs a comparison function on an item to be classified only when the item is submitted for classification. I used the cosine similarity metric. For each document in the corpus I pre-computed the document TF x IDF vector. Then, for each item to be classified, the algorithm computed its similarity to each document in the corpus and returned the top k similar documents. It then selected a set of categories from this set of k results and assigned the categories which appeared at least c times. So, for example if we use 10-nearest neighbors and c is 9, then we'll assign all categories which appear at least in 9 documents. Data transformation
<page> <id>42</id> <title>Interesting Document</title> <text>Fascinating text………</text> </page> After that, I generated the following source files for each set:
Data cleaning and feature selection
Indexing
I did not modify the vector space model to include term weights based on Wikipedia markup due to time constraints. I plan to do that in this project ResultsThe newly assigned categories were compared to the manually assigned categories in terms of precision, recall and their F-measure (harmonic mean). For example, if the system assigned category A to 5 of the 10 documents in the testing set (3 correct, 2 incorrect, and 4 - missed), that would give recall = 3/(3+4)=3/7 and precision = 3/(3+2)=3/5. The weights of k and c were systematically reassigned and the algorithm rerun on multiple combinations of testing documents until the optimal combination was identified. Based on the evaluation results, the optimal weights were: k=50, c=5. However, the overall results were dissapointing:
For future work, I’ll consider more elaborate methods of comparing documents and category assignments, taking into consideration Wikipedia linkage structure and HTML markup. |