Programs/Corpus for Thai Linguistic Development
Segmentation
-
Concordancing program for word frequency text analysis
[ME, NT/2000, XP and Vista](Pay)
Also support most European languages, Chinese, Russian and Japanese
-
SWATCH (Smart Word Analysis for THai)
[link, SWATCH for Windows, SWATCH for Linux]
Description: SWATH (Smart Word Analysis for THai) is a word segmentation for Thai. Swath offers 3 algorithms: Longest Matching, Maximal Matching and Part-of-Speech Bigram. The program supports various file input format such as html, rtf, LaTeX as well as plain text.
Published since: 1999
-
LongLexTo
[download]
Description: Tokenizing Thai texts using Longest Matching Approach.
Published since: 2006
Corpus
-
Thai Part-of-Speech
[link]
Description: A meaning of Thai Part-of-Speech Tagsets and their examples.
Published since: 1997
-
Orchid Corpus
[link]
Description: The Part-of-Speech (POS) Tagged Corpus, Orchid, is an aim to build a Thai text corpus with syntactic word class annotation. Though there is no consensus of many issues in Thai syntax (such as, word or sentence construction, word or sentence classification, etc.), they initially proposed a standard using in constructing Orchid. Word classification as well as word and sentence breaking using Orchid is somehow verified in machine translation system. They are not closed to the competence of Thai syntax but are expected to be verified together with the corpus and to be improved by thoroughly use in general text. Corpus is available in TIS-620 and UTF-8 format.
Published since: 1997
-
BEST
[link, local]
Description: Data from Thai Word Segmentation Software Contest.
Published since: 2009
Dictionary