Research
Instead of reading a long document, we would sometimes like to read
its concise summary.
In our research group, we are working on text summarization; we develop
methods for generating a summary.
We are specifically interested in mathematical models that emulate how
summaries are generated from the input documents.
Maximum Coverage Summarization Model
We formulate text summarization as maximum coverage problem; we represent each sentence as a set of conceptual units (words, in our setting) and generate a summary that contains as many conceptual units as possible.
[EACL2009]
Summarization Model based on Facility Location Problem
We proposed a text summarization model based on facility location problem, in which a summary is generated so that the whole input documents are going to be entailed by the summary. This is the method that makes good used of entailment relations between sentences.
[CIKM2010]
Twitter Summarization
We also work on Twitter summarization, where tweets on a certain topic are collected and a summary on those tweets is generated. In particular, we develop a method for automatically generating live sports updates (e.g., soccer) from the numerous tweets on the match.
[ECIR2011]
Summarization Model based on Sentence Compression and Sentence Selection
One limitation of sentence selection approaches to text summarization is that each sentence can have unimportant parts. We proposed a method for generating a summary by simultaneously conducting the sentence compression and the sentence selection. We specifically formulated the summarization problem as the extraction of dependency subtrees. We also proposed a new summarization method that makes use of the rhetorical structure of the document.
[ACL2013, ACL2014]
Reputation risk has been changing its characteristics with the advent of the internet. Rumors used to be propagated from a person to person individually; groundless stories tended not to be propagated because each person judged the reliability of such stories. Nowadays, however, it is surprisingly easy to spread a rumor to the public by means of emails, web bulletin board, or SNS. If somebody writes on Twitter "a computer of *** corporation gets easily out of order", it can make the sales of the computer decreased. There can also be offensive posts about other organizations such as schools and universities, or about individuals. On the contrary, there can also be favorable posts as well. We need to protect ourselves against bad groundless reputations, and at the same time make good use of the opinions on the internet.
In our research group, we work on sentiment analysis or opinion analysis. We are developing methods for collecting, classifying (into "positive" or "negative") and summarizing opinions on the internet.
Sentiment Polarity of Words
A fundamental resource for sentiment analysis is the sentiment polarities of words. In order to construct such a resource, we took an approach based on statistical mechanics. In our method, we first collect word pairs that are likely to have the same polarity, from a dictionary, a thesaurus, and a corpus. We then construct a large lexical network consisting of word nodes by connecting those pairs. By regarding this network as the Ising spin model, where the polarity of each word corresponds to the spin of an electron, we estimate the state of the network using the mean-field approximation and extract sentiment polarities of words.
[SIGNL-166, NLP2005, ACL2005, NLP2011]
Sentiment Polarity of Phrase
In this work, we deal with the sentimen polarities of phrases consisting of a noun and an adjective. This is a little more complex than words, because the sentiment polarity of a phrase consisting of a noun and an adjective is not the mere sum of the polarities of the two words. Let's think of an example "risk is low". The polarity is positive, although the polarity of "risk" is negative. We use a latent variable model to represent the polarity of phrases and use it to extract the
computational framework of phrase polarity.
[SIGNL-168, EACL2006, NAACL2007]
Sentiment Polarity of document
The task of finding the sentiment polarity of a document is also called the sentiment classification of documents. We proposed to use word subsequences and dependency subtrees extracted from the input documents as features for supervised classifiers. We also proposed a model that represents the polarity shift of words, which is the phenomenon that the polarity of a word changes from negative to positive (or vice-versa) depending on the context.
[PAKDD'05, IJCNLP2008]
Use of Sound Symbolism for Sentiment Classification
We present a method for estimating the sentiment polarity of Japanese sentences including onomatopoeic words. We use the vocal sound features of onomatopoeic words as features for supervised sentiment classification.
[PRICAI 2012]
We are also working on a number of other tasks in sentiment analysis, such as the extraction of evaluative objects and evaluative attributes, and the agreement/disagreement classification of opinions.
There are a lot of kinds of relations between words, clauses, sentences, and documents in natural language texts. In order to understand the meaning of natural language texts, these relations have to be recognized. In our research group, we work on recognition of such relations.
Anaphora Resolution
In linguistics, anaphora is a phenomenon where the meaning of one expression, which is called anaphor, depends on another expression in context. The correct interpretation of anaphora is vital for natural language understanding. We have tackled anaphora resolution in Japanese texts, especially zero anaphora and associative anaphora, on the basis of knowledge acquired from large corpus.
[EMNLP09 , IJCNLP2011]
Coherence Model
We have to recognize not only relations between words or sentences but also topic coherence to understand text. In addition, the technique for evaluating the local coherence is useful for text correction and proofreading. Thus, we proposed a local coherence model for Japanese text that leverages the tendency of syntactic role transition of textual entities.
[CICLing 2010]
Cross-Document Relations between Sentences
A pair of sentences in different newspaper articles on an event can have one of several relations such as the relation between two sentences that have the same information on an event (equivalence) and the relation between two sentences that have the same information except for values of numeric attributes (transition). We focused on these two relations and proposed methods of identifying them.
[IJCNLP2008]
Knowledge Acquisition for Case Alternation
Predicate-argument structure analysis is one of the fundamental techniques for many natural language applications. In Japanese, the relationship between a predicate and its argument is usually represented by using case particles. However, since case particles vary depending on the voices, we have to take case alternation into account to represent predicate-argument structure. Therefore, we work on automatic knowledge acquisition for case alternation between the passive/causative and active voices, which leverages large lexical case frames obtained from large Web corpus, and several alternation patterns.
[EMNLP2013]
The reputation is now disseminated quickly on the WWW, because everyone can send a message to the world easily by using social media such as blogs and Twitter. Therefore, we are tackling to present methods to find out what information attracts people's attention and what opinion they have. We had developed a system that can be characterized by the following technologies of automatic blog collection and monitoring, trend analysis and sentiment analysis in the blogs, and attribute identification of bloggers.
Generating Live Sports Updates from Twitter
Many Twitter users post their opinions, impressions, and statuses of televised events such as sports events. However, since the volume of such posts is extremely huge, it requires a lot of time and effort to understand what happens within events. We propose a method of
generating live sports updates from Twitter posts on an event. Our method selects descriptive and prompt tweets that are posted within a short time after important subevents by exploiting users called good reporters, who promptly explain what is happening at each moment throughout the event.
[WI2013]
Attribute identification of bloggers
Blog classification (e.g., identifying bloggers' gender or age) is one of the most interesting current problems in blog analysis. Although this problem is usually solved by applying supervised learning techniques, the large labeled dataset required for training is not always available. In contrast, unlabeled blogs can easily be collected from the web. Therefore, a semi-supervised learning method for blog classification, effectively using unlabeled data, is proposed. In this method, entries from the same blog are assumed to have the same characteristics. With this assumption, the proposed method captures the characteristics of each blog, such as writing style and topic, and uses these characteristics to improve the classification accuracy.
[Proceedings of the 23rd national conference on Artificial intelligence
- Volume 2, pp.1156--1161, 2008]
Detecting bursty words from blogs
We proposed a method for extracting 'burst of a word' which is related to a popular topic in a document stream. A document stream is defined as a sequence of documents which arrive in temporal order, and we regard blog and BBS as document streams to apply the method originally
proposed by Kleinberg. However, since Kleinberg's algorithm cannot be applied to the document streams whose distribution of documents is not uniform, we extend the method to be able to apply to blog and BBS.
[First International Workshop on Knowledge Discovery on Data Streams, 2004]
Furthermore, we have tackled a couple of work that is targeting at community-based question-answering services, and microblogs.
We are working on the following themes in addition to the above.
Automatic Generation of Distinctive Explanation for Kanji
The phonetic alphabet enables people to dictate letters of the alphabet accurately by using representative words, i.e., A for Alpha. Japanese kanji (idiographic Chinese characters) vastly outnumber the letters of the Roman alphabet, and thus Japanese requires an explanatory reading, i.e. distinctive explanation, like a phonetic alphabet. We propose a corpus-based method for automatically generating distinctive explanations for a kanji, in which information about familiarity and homophones of kanji are taken into consideration.
[COLING2012]
Morphological Analysis for Noisy Text
In recent years, Consumer Generated Media (CGM) such as Blogs and Social Networking Service (SNS) have become prevalent, and we thus have to deal with texts written by a wide variety of authors. Since there are many types of non-standard tokens such as abbreviations and phonetic substitution in these texts, conventional text analysis tools cannot perform well. In order to alleviate this problem, we proposed a simple but effective approach to unknown word processing in Japanese morphological analysis, which handles unknown words that are derived from words in a pre-defined lexicon and unknown onomatopoeias.
[IJNLP2013]