TSC3 Task Definition

Task  : Multiple Document Summarization
The participants produce two kinds of summaries from sets of
documents which are considered to be relevant to queries.

This task is basically the continuation of TSC2 Task B.

What is given to the participants:
- document sets
- queries which were used to produce the document sets
- set of questions about important information of the document sets
- summary lengths (2 kinds)

What the participants submit:
- two kinds of summaries
Note: It is up to the participants what information they use
(including what is given to them or some other information).
We will ask them what they use later on.

Subtask (optional task) Sentence extraction from the document sets
The participants select relevant sentences to queries from sets of the
query-relevant documents, and delete redundant information in them.
(cf. TREC Novelty Track relevant or new sentence extraction)

What is given to the participants:
- in addition to the above, the number of sentences to be extracted.
(2 kinds)

What the participants submit:
- result of evaluation which is described below in
  evaluation 1.

We think multiple-document summarization system needs
at least the following:
1. important sentence extraction technique
2. technique to measure the degree of closeness (or redundancy)
   of the extracted sentences
3. technique of shortening the sentences after deleting the
   redundant information.

The subtask aims at evaluating 1. and 2.

Evaluation

1. Recall, Precision, and F-measure
for systems which produce extracts in making summaries.
We will provide the scoring tool and the human-produced
extracts. (this is the evaluation of the subtask)
The participants should submit the evaluation results by the
specified date.  We gather the evaluation results of the
participants with the results of baseline system.

We use two kinds of recall, precision measures:
normal recall and precision scores, and
recall and precision scores which take redundancy into
account.

2. Intrinsic evaluation

a. Content evaluation
   Human judges match summaries they produced with sysytem
   results at sentence level, and evaluate the results based
   on the degree of the matching (how well they match).
   The sentences in the human-produced summaries have values
   that show the degree of importance, and these values are
   taken into account in coming up with final evaluation.

b. Readability evaluation
   Subjective evaluation based on quality questions.
   (Cf. DUC 2002)

If budget allows,

c. Evaluation by edition (continuation of TSC2 editing evaluation)

3. Extrinsic evaluation
The participants measure how much the system summaries include
the passages which are answers to the sets of questions.

We will make available the human-selected passages which are answers
to the sets of questions and provide the scorer.  The participants
submit the evaluation results by the specified date.
We gather the evaluation results of the
participants with the results of baseline system.

(If possible, we would like to use the scorer for SUMMAC Q&A task)

Schedule (tentative)

Note: we have no dryruns this time.

formal run
  start of the run   : 2003.11.17
  system results due : 2003.11.24
  evaluation results returned : 2004.2.1
  evaluation results (evaluation 1 and 3) due : 2004.2.1

Workshop
  workshop paper due : 2004.3.19
  NTCIR-4 Workshop: 2004.5

TSC3 features:
- Evaluation of core techniques for multiple document summarization
  (the optional subtask)
- Documents come from multiple genres (newspaper articles from two
  sources, newspaper articles and web pages)
- Automatic evaluation
  We adopt automatic (offline) evaluation methods in evaluation 1 and 3.