TSC3 Task Definition Task : Multiple Document Summarization The participants produce two kinds of summaries from sets of documents which are considered to be relevant to queries. This task is basically the continuation of TSC2 Task B. What is given to the participants: - document sets - queries which were used to produce the document sets - set of questions about important information of the document sets - summary lengths (2 kinds) What the participants submit: - two kinds of summaries Note: It is up to the participants what information they use (including what is given to them or some other information). We will ask them what they use later on. Subtask (optional task) Sentence extraction from the document sets The participants select relevant sentences to queries from sets of the query-relevant documents, and delete redundant information in them. (cf. TREC Novelty Track relevant or new sentence extraction) What is given to the participants: - in addition to the above, the number of sentences to be extracted. (2 kinds) What the participants submit: - result of evaluation which is described below in evaluation 1. We think multiple-document summarization system needs at least the following: 1. important sentence extraction technique 2. technique to measure the degree of closeness (or redundancy) of the extracted sentences 3. technique of shortening the sentences after deleting the redundant information. The subtask aims at evaluating 1. and 2. Evaluation 1. Recall, Precision, and F-measure for systems which produce extracts in making summaries. We will provide the scoring tool and the human-produced extracts. (this is the evaluation of the subtask) The participants should submit the evaluation results by the specified date. We gather the evaluation results of the participants with the results of baseline system. We use two kinds of recall, precision measures: normal recall and precision scores, and recall and precision scores which take redundancy into account. 2. Intrinsic evaluation a. Content evaluation Human judges match summaries they produced with sysytem results at sentence level, and evaluate the results based on the degree of the matching (how well they match). The sentences in the human-produced summaries have values that show the degree of importance, and these values are taken into account in coming up with final evaluation. b. Readability evaluation Subjective evaluation based on quality questions. (Cf. DUC 2002) If budget allows, c. Evaluation by edition (continuation of TSC2 editing evaluation) 3. Extrinsic evaluation The participants measure how much the system summaries include the passages which are answers to the sets of questions. We will make available the human-selected passages which are answers to the sets of questions and provide the scorer. The participants submit the evaluation results by the specified date. We gather the evaluation results of the participants with the results of baseline system. (If possible, we would like to use the scorer for SUMMAC Q&A task) Schedule (tentative) Note: we have no dryruns this time. formal run start of the run : 2003.11.17 system results due : 2003.11.24 evaluation results returned : 2004.2.1 evaluation results (evaluation 1 and 3) due : 2004.2.1 Workshop workshop paper due : 2004.3.19 NTCIR-4 Workshop: 2004.5 TSC3 features: - Evaluation of core techniques for multiple document summarization (the optional subtask) - Documents come from multiple genres (newspaper articles from two sources, newspaper articles and web pages) - Automatic evaluation We adopt automatic (offline) evaluation methods in evaluation 1 and 3.