TOOLS for TSC
[Japanese][TSC Home]
Tools for TSC
Last updated on August 29, 2000.
NOTE:
We use newspaper data (the Mainichi newspaper database) of 1994, 1995, 1998.
If you have not obtained them yet, please contact the Mainichi and
purchase the license to use the database by yourself (if you have any
questions regarding how to obtain the data, please contact the chairs.)
We use only data of 1994 and 1995 for our Dry run.
The following tools are available.
- mai2sgml.pl by IREX
(A tool which converts data of the Mainichi News paper into SGML form)
Participants should use mai2sgml.pl provided by IREX and use the
data transformed for IREX IR task. The additional information to
the data after the transformation can be used in any way by the
participants. However, such information as keywords which exists
only in original data should not be used.
[For details, please see
section 3 in the task description]
- usage:
- nkf -e /cdrom/mai94.txt | jperl mai2sgml.pl > mai94.sgml
- nkf -e /cdrom/mai95.txt | jperl mai2sgml.pl > mai95.sgml
- nkf -e /cdrom/mai98.txt | jperl mai2sgml.pl > mai98.sgml
- tscsgml.pl by TSC
(A tool which attaches <SENTENCE> and <PARAGRAPH> tags to
output (results) of mai2sgml.pl ([94|95|98].sgml)).
The participants in A-1 (extraction of important sentences)
use the data which is the output of tscsgml.pl (which again uses
the output of mai2sgml.pl). For the participants of the other
subtasks, it is up to you whether you use tscsgml.pl or not.
[For details, please see
section 3 in the task description]
There are three ways to use this program.
- tscsgml.pl < mai94.sgml > mai94.tsc
- All articles in the data are extended to TSC format.
- tscsgml.pl -L DOCIDS_FILE < mai94.sgml
- Only specifiled articles which are listed in DOCIDS_FILE
are extended to TSC format and each article is divided into each
file automatically.
- tscsgml.pl -D DOCIDS_FILE < mai94.sgml > DOCIDS_FILE.94.tsc
- Only specifiled articles which are listed in DOCIDS_FILE
are extended to TSC format and output to STANDARD OUTPUT.
DOCIDS_FILEs which list IDs of the articles to be summarized
are distributed from TSC to the participants in A-1.
Please use those files with option flag -L and -D.
- ex.
- tscsgml.pl -L docids_file.94 < mai94.sgml
- tscsgml.pl -L docids_file.95 < mai95.sgml
- Each DOCNO.tsc file which listed in docids_file.[94|94]
will makes.
- Format of DOCIDS_FILE
==BNF==
file := doc-id*
doc-id := <DOCNO>number</DOCNO>
Example:
<DOCNO>94070805</DOCNO>
<DOCNO>94080104</DOCNO>
<DOCNO>94090203</DOCNO>
...
Requirement(s)
- nkf ; Network Kanji Filter
- jperl5
complain, advice to
tsc-request@recall.jaist.ac.jp