Additional explanation for Dryrun evaluation

Additional explanation for Dryrun evaluation (Last updated on Oct. 30, 2000)

Evaluation methods for each subtask have been described in Task Description, however, we add here the following as additional explanation with regard to the Dryrun evaluation conducted in September. We will use the almost the same evaluation methods at the formal run evaluation.

Subtask A-1

No additional explanation.

Subtask A-2 (content-based evaluation)

With human-produced and system-produced summaries, they are morphologically analyzed by Juman, and only the content words are extracted. Then the distance is computed between word-frequency vector of human summary and word-frequency vector of system summary, and we use it to see how close the summaries are in terms of their content words.

Juman version 3.61 is used
Content words are words whose part-of-speech is one of the following:
noun, verb, adjective, unknown
Elements of the word-frequency vector are tf*idf values of each content word
To compute df, we use the results of morphological analysis of all the articles in Mainichi newspaper CD-ROMs (1994, 1995 versions).
Cosine distance is used for computing the distance.

We have two kinds of human-produced summaries for Subtask A-2.

Freely summarized texts
Summaries produced by selecting important parts of the sentences in the text

The content-based evaluation at the Dryrun is based on the comparison with the latter.

Both kinds will be used for the formal run.

Subtask A-2 (Subjective evaluation)

The following four kinds of summaries as well as the original texts are prepared.

Summaries produced by selecting important parts of the sentences in the text
Freely summarized texts
Summaries produced by a system
Summaries produced by using lead method

First, the evaluator (one person) reads the original text and its summaries (4 kinds). Then, evaluate and score them in terms of how readable they are, and how well the content of the text is described in the summary. The scores are one of 1, 2, 3, and 4 where 1 is the best and 4 is the worst, i.e. the lower the score, the better the evaluation is.

Subtask B

First, we provide the subjects (30 students) with queries and the texts that are the result of the retrieval.
The subjects judge whether the texts are relevant to the query by reading their summaries.

Evaluation measures:

Time: how long it took to finish the task
Measures to show how well the task is conducted :
Recall, Precision, and F-measures are used.

 Recall = the number of texts for which the subjects judged correctly as relevant /
         the total number of relevant texts
 Precision = the number of texts for which the subjects judged correctly as relevant /
         the total number of texts judged as relevant by the subjects
 F-Measures= 2*Recall*Precision / (Recall+Precision)

Experiment data
- the number of topic: 10
- 30 texts for one topic (300 texts in total)
- 30 subjects are divided into 10 groups of three (i.e. each group has three subjects)
  - One subject evaluates one topic once
  - One subject evaluates one system once
  - The same combination (of topic and system) is evaluated by only one group
  - The combinations are made to have fairness as much as possible in evaluation in terms of the order of systems and topics.
  - The texts are given to the subjects at random.
We have three levels (A, B, C) of evaluation (relevance judgment) for a topic. We have produced two kinds of the evaluation results: only the level A is regarded as relevant, and the level A and B are regarded as relevant.

And, we used IREX IR Text Collection (see below) for the experiment data.
( http://www.csl.sony.co.jp/person/sekine/IREX)

Evaluation results in all cases are average values of three subjects.

Co-chairs of the Text Summarization Task
Manabu Okumura : oku@pi.titech.ac.jp
Takahiro Fukusima : fukusima@res.otemon.ac.jp

complain, advice to tsc-admin@recall.jaist.ac.jp