Additional explanation for Formal run evaluation

Additional explanation for Formal run evaluation (Last updated on Dec. 28, 2000)

Evaluation methods for each subtask have been described in Task Description, however, we add here the following as additional explanation with regard to the Formal run evaluation. We used the almost the same evaluation methods as the Dryrun evaluation.

Subtask A-1

No additional explanation.

Subtask A-2 (content-based evaluation)

With human-produced and system-produced summaries, they are morphologically analyzed by Juman, and only the content words are extracted. Then the distance is computed between word-frequency vector of human summary and word-frequency vector of system summary, and we use it to see how close the summaries are in terms of their content words.

Juman version 3.61 is used
Content words are words whose part-of-speech is one of the following:
noun, verb, adjective, unknown
Elements of the word-frequency vector are tf*idf values of each content word
To compute df, we use the results of morphological analysis of all the articles in Mainichi newspaper CD-ROMs (1994 or 1998 versions).
Cosine distance is used for computing the distance.

We have two kinds of human-produced summaries for Subtask A-2.

Freely summarized texts
Summaries produced by selecting important parts of the sentences in the text

Both kinds used for the Formal run.

Subtask A-2 (Subjective evaluation)

The following four kinds of summaries as well as the original texts are prepared.

Summaries produced by selecting important parts of the sentences in the text
Freely summarized texts
Summaries produced by a system
Summaries produced by using tf-based method

First, the evaluator (one person) reads the original text and its summaries (4 kinds). Then, evaluate and score them in terms of how readable they are, and how well the content of the text is described in the summary. The scores are one of 1, 2, 3, and 4 where 1 is the best and 4 is the worst, i.e. the lower the score, the better the evaluation is.

Subtask B

First, we provide the subjects (36 students) with queries and the texts that are the result of the retrieval.
The subjects judge whether the texts are relevant to the query by reading their summaries.

Evaluation measures:

Time: how long it took to finish the task

Measures to show how well the task is conducted :
Recall, Precision, and F-measures are used.

 Recall = the number of texts for which the subjects judged correctly as relevant /
         the total number of relevant texts
 Precision = the number of texts for which the subjects judged correctly as relevant /
         the total number of texts judged as relevant by the subjects
 F-Measures= 2*Recall*Precision / (Recall+Precision)

Summary length

Experiment data
- the number of topic: 12
- 50 texts for one topic (600 texts in total)
- 36 subjects are divided into 12 groups of three (i.e. each group has three subjects)
  - One subject evaluates one topic once
  - One subject evaluates one system once
  - The same combination (of topic and system) is evaluated by only one group
  - The combinations are made to have fairness as much as possible in evaluation in terms of the order of systems and topics.
  - The texts are given to the subjects at random.
We have three levels (A, B, C) of evaluation (relevance judgment) for a topic. We have produced two kinds of the evaluation results: only the level A is regarded as relevant, and the level A and B are regarded as relevant.

And, we made new evaluation data from Mainichi newspaper CD-ROM (1998 version) and used that for the experiment. TOPICs of evaluation data were selected from IREX IR Text Collection (see below).
( http://www.csl.sony.co.jp/person/sekine/IREX)

Evaluation results in all cases are average values of three subjects.

Co-chairs of the Text Summarization Task
Manabu Okumura : oku@pi.titech.ac.jp
Takahiro Fukusima : fukusima@res.otemon.ac.jp

complain, advice to tsc-admin@recall.jaist.ac.jp