A STUDY ON DOCUMENT RELATION DISCOVERY USING FREQUENT ITEMSET MINING

Thesis Title:

Author:

Kritsada Sriphaew

School:

School of Information and Computer Technology

Committees:

Assoc. Prof. Dr. Thanaruk Theeramunkong Advisor
Assoc. Prof. Dr. Stanislav S. Makhanov Co-Advisor
Assoc. Prof. Dr. Ekawit Nantajeewarawat Member
Asst. Prof. Dr. Junalux Chalidabhongse Member

Abstract:

Scientific publications available in the digital libraries are potentially the world's largest knowledge source but there have been very few attempts to take advantage of this kind of document. One traditional knowledge which is useful for retrieving desired information, understanding the nature of document contents and revealing hidden information between a set of documents, is the relations among such a collection of documents. Although relations among technical documents are distinctively useful, there is no trustworthy automatic approach to evaluate the quality of discovered relations. Extended from a relationship between a document pair, the document relation can involve more than two documents where the scope of related contents becomes more general depending on the co-occurring contents. Applications of document relation discovery include an automatic discovery system of related articles for literature review, an assistant system for article authoring and a novel search engine which takes a set of documents as a query instead of a set of keywords or a document as provided in a conventional method. To discover good document relations, this thesis presents an extension of frequent itemset mining to discover the document relations on an attribute-value database where the values are weighted by real values, instead of boolean values as in the conventional method. The goals of thesis are: (1) to study how well the word-based approach performs in finding relations among documents using frequent itemset mining techniques, (2) to propose a method to automatically evaluate the discovered document relations using a citation graph, and (3) to invent a measure for automatically evaluating the quality of the discovered document relations. The approach is applied to discover word-based relations among scientific publications. The proposed method is evaluated using a set of scientific publications in a digital library to judge the quality of discovered document relations based on their references (citations). With the concept of transitivity as direct/indirect citations, the thesis introduces a series of evaluation criteria, called order accumulative citation matrices, to define the validity (quality) of discovered relations. Two kinds of validity, called soft validity and hard validity, are presented to express the quality of the discovered relations. For the purpose of impartial comparison, the expected validity is statistically estimated based on the generative probability of each document relation pattern. The experimental results show that the discovered document relations using a bigram model as term definition are more valid than those using a unigram model. Stopword removal is a significant scheme for filtering unnecessary terms in the process of representing document content. The results also show that the proposed method successfully discovers a set of document relations, the quality of which is significantly better than its expectation. With the human evaluation of sampled document relations, it is confirmed that the proposed automatic evaluation method based on citation information is a potential approach to evaluate the quality of document relations. Moreover, an extension of the term weighting scheme can enhance the quality of discovered document relations, where inverse document frequency performs well to discover high-valid relations from the collection. Furthermore, the augmented normalized term frequency can help to discover the good quality relations in a higher rank while the bigram term frequency performs well in any rank of discovered document relations.