Thesis Title:
A STUDY ON DOCUMENT RELATION DISCOVERY USING FREQUENT ITEMSET MINING
Author:
Kritsada Sriphaew
School:
School of Information and Computer Technology
Committees:
Assoc. Prof. Dr. Thanaruk Theeramunkong Advisor
Assoc. Prof. Dr. Stanislav S. Makhanov Co-Advisor
Assoc. Prof. Dr. Ekawit Nantajeewarawat Member
Asst. Prof. Dr. Junalux Chalidabhongse Member
Abstract:
Scientific publications available in the digital libraries are potentially the
world's largest knowledge source but there have been very few attempts
to take advantage of this kind of document. One traditional knowledge which is
useful for retrieving desired information, understanding the nature of document
contents and revealing hidden information between a set of documents, is the
relations among such a collection of documents. Although relations among
technical documents are distinctively useful, there is no trustworthy automatic
approach to evaluate the quality of discovered relations. Extended from a
relationship between a document pair, the document relation can involve more
than two documents where the scope of related contents becomes more general
depending on the co-occurring contents. Applications of document relation
discovery include an automatic discovery system of related articles for
literature review, an assistant system for article authoring and a novel search
engine which takes a set of documents as a query instead of a set of keywords
or a document as provided in a conventional method. To discover good document
relations, this thesis presents an extension of frequent itemset mining to
discover the document relations on an attribute-value database where the values
are weighted by real values, instead of boolean values as in the conventional
method. The goals of thesis are: (1) to study how well the word-based approach
performs in finding relations among documents using frequent itemset mining
techniques, (2) to propose a method to automatically evaluate the discovered
document relations using a citation graph, and (3) to invent a measure for
automatically evaluating the quality of the discovered document relations. The
approach is applied to discover word-based relations among scientific
publications. The proposed method is evaluated using a set of scientific
publications in a digital library to judge the quality of discovered document
relations based on their references (citations). With the concept of
transitivity as direct/indirect citations, the thesis introduces a series of
evaluation criteria, called order accumulative citation matrices, to define the
validity (quality) of discovered relations. Two kinds of validity, called soft
validity and hard validity, are presented to express the quality of the
discovered relations. For the purpose of impartial comparison, the expected
validity is statistically estimated based on the generative probability of each
document relation pattern. The experimental results show that the discovered
document relations using a bigram model as term definition are more valid than
those using a unigram model. Stopword removal is a significant scheme for
filtering unnecessary terms in the process of representing document content.
The results also show that the proposed method successfully discovers a set of
document relations, the quality of which is significantly better than its
expectation. With the human evaluation of sampled document relations, it is
confirmed that the proposed automatic evaluation method based on citation
information is a potential approach to evaluate the quality of document
relations. Moreover, an extension of the term weighting scheme can enhance the
quality of discovered document relations, where inverse document frequency
performs well to discover high-valid relations from the collection.
Furthermore, the augmented normalized term frequency can help to discover the
good quality relations in a higher rank while the bigram term frequency
performs well in any rank of discovered document relations.