Additional explanation for Dryrun evaluation
(Last updated on Oct. 30, 2000)
Evaluation methods for each subtask have been described in Task
Description,
however, we add here the following as additional explanation with regard
to
the Dryrun evaluation conducted in September. We will use the almost the
same evaluation methods at the formal run evaluation.
Subtask A-1
No additional explanation.
Subtask A-2 (content-based evaluation)
With human-produced and system-produced summaries, they are
morphologically
analyzed by Juman, and only the content words are extracted. Then the
distance is computed between word-frequency vector of human summary and
word-frequency vector of system summary, and we use it to see how close
the
summaries are in terms of their content words.
<Conditions>
- Juman version 3.61 is used
- Content words are words whose part-of-speech is one of the following:
noun, verb, adjective, unknown
- Elements of the word-frequency vector are tf*idf values of each content
word
- To compute df, we use the results of morphological analysis of all the
articles in Mainichi newspaper CD-ROMs (1994, 1995 versions).
- Cosine distance is used for computing the distance.
We have two kinds of human-produced summaries for Subtask A-2.
- Freely summarized texts
- Summaries produced by selecting important parts of the sentences in
the text
The content-based evaluation at the Dryrun is based on the comparison with
the latter.
Both kinds will be used for the formal run.
Subtask A-2 (Subjective evaluation)
The following four kinds of summaries as well as the original texts are
prepared.
- Summaries produced by selecting important parts of the sentences in the
text
- Freely summarized texts
- Summaries produced by a system
- Summaries produced by using lead method
First, the evaluator (one person) reads the original text and its
summaries
(4 kinds). Then, evaluate and score them in terms of how readable they
are,
and how well the content of the text is described in the summary. The
scores are one of 1, 2, 3, and 4 where 1 is the best and 4 is the worst,
i.e. the lower the score, the better the evaluation is.
Subtask B
- First, we provide the subjects (30 students) with queries and the texts
that are the result of the retrieval.
- The subjects judge whether the texts are relevant to the query by
reading their summaries.
- Evaluation measures:
- Time: how long it took to finish the task
- Measures to show how well the task is conducted :
Recall, Precision, and F-measures are used.
Recall = the number of texts for which the subjects judged correctly as relevant /
the total number of relevant texts
Precision = the number of texts for which the subjects judged correctly as relevant /
the total number of texts judged as relevant by the subjects
F-Measures= 2*Recall*Precision / (Recall+Precision)
- Experiment data
- the number of topic: 10
- 30 texts for one topic (300 texts in total)
- 30 subjects are divided into 10 groups of three (i.e. each group has
three subjects)
- One subject evaluates one topic once
- One subject evaluates one system once
- The same combination (of topic and system) is evaluated by only one
group
- The combinations are made to have fairness as much as possible in
evaluation in terms of the order of systems and topics.
- The texts are given to the subjects at random.
We have three levels (A, B, C) of evaluation (relevance judgment) for a
topic.
We have produced two kinds of the evaluation results: only the level A is
regarded as relevant, and the level A and B are regarded as relevant.
And, we used IREX IR Text Collection (see below) for the experiment data.
(
http://www.csl.sony.co.jp/person/sekine/IREX)
Evaluation results in all cases are average values of three subjects.
Co-chairs of the Text Summarization Task
Manabu Okumura : oku@pi.titech.ac.jp
Takahiro Fukusima : fukusima@res.otemon.ac.jp
complain, advice to tsc-admin@recall.jaist.ac.jp