Additional explanation for Formal run evaluation
(Last updated on Dec. 28, 2000)
Evaluation methods for each subtask have been described in Task
Description,
however, we add here the following as additional explanation with regard
to
the Formal run evaluation. We used the almost the same evaluation methods
as the Dryrun evaluation.
Subtask A-1
No additional explanation.
Subtask A-2 (content-based evaluation)
With human-produced and system-produced summaries, they are
morphologically
analyzed by Juman, and only the content words are extracted. Then the
distance is computed between word-frequency vector of human summary and
word-frequency vector of system summary, and we use it to see how close
the
summaries are in terms of their content words.
<Conditions>
- Juman version 3.61 is used
- Content words are words whose part-of-speech is one of the following:
noun, verb, adjective, unknown
- Elements of the word-frequency vector are tf*idf values of each content
word
- To compute df, we use the results of morphological analysis of all the
articles in Mainichi newspaper CD-ROMs (1994 or 1998 versions).
- Cosine distance is used for computing the distance.
We have two kinds of human-produced summaries for Subtask A-2.
- Freely summarized texts
- Summaries produced by selecting important parts of the sentences in
the text
Both kinds used for the Formal run.
Subtask A-2 (Subjective evaluation)
The following four kinds of summaries as well as the original texts are
prepared.
- Summaries produced by selecting important parts of the sentences in the
text
- Freely summarized texts
- Summaries produced by a system
- Summaries produced by using tf-based method
First, the evaluator (one person) reads the original text and its
summaries
(4 kinds). Then, evaluate and score them in terms of how readable they
are,
and how well the content of the text is described in the summary. The
scores are one of 1, 2, 3, and 4 where 1 is the best and 4 is the worst,
i.e. the lower the score, the better the evaluation is.
Subtask B
- First, we provide the subjects (36 students) with queries and the texts
that are the result of the retrieval.
- The subjects judge whether the texts are relevant to the query by
reading their summaries.
- Evaluation measures:
- Experiment data
- the number of topic: 12
- 50 texts for one topic (600 texts in total)
- 36 subjects are divided into 12 groups of three (i.e. each group has
three subjects)
- One subject evaluates one topic once
- One subject evaluates one system once
- The same combination (of topic and system) is evaluated by only one
group
- The combinations are made to have fairness as much as possible in
evaluation in terms of the order of systems and topics.
- The texts are given to the subjects at random.
We have three levels (A, B, C) of evaluation (relevance judgment) for a
topic.
We have produced two kinds of the evaluation results: only the level A is
regarded as relevant, and the level A and B are regarded as relevant.
And, we made new evaluation data from Mainichi newspaper CD-ROM (1998
version) and used that for the experiment.
TOPICs of evaluation data were selected from IREX IR Text Collection (see
below).
(
http://www.csl.sony.co.jp/person/sekine/IREX)
Evaluation results in all cases are average values of three subjects.
Co-chairs of the Text Summarization Task
Manabu Okumura : oku@pi.titech.ac.jp
Takahiro Fukusima : fukusima@res.otemon.ac.jp
complain, advice to tsc-admin@recall.jaist.ac.jp