TR95-10
Approximating Annotated Corpora with Finite-State Transductions: A Case Study in Part of Speech Tagging
-
- "Approximating Annotated Corpora with Finite-State Transductions: A Case Study in Part of Speech Tagging", Tech. Rep. TR95-10, Mitsubishi Electric Research Laboratories, Cambridge, MA, January 1995.BibTeX TR95-10 PDF
- @techreport{MERL_TR95-10,
- author = {Emmanuel Roche, Yves Schabes},
- title = {Approximating Annotated Corpora with Finite-State Transductions: A Case Study in Part of Speech Tagging},
- institution = {MERL - Mitsubishi Electric Research Laboratories},
- address = {Cambridge, MA 02139},
- number = {TR95-10},
- month = jan,
- year = 1995,
- url = {https://www.merl.com/publications/TR95-10/}
- }
,
- "Approximating Annotated Corpora with Finite-State Transductions: A Case Study in Part of Speech Tagging", Tech. Rep. TR95-10, Mitsubishi Electric Research Laboratories, Cambridge, MA, January 1995.
Abstract:
There is a natural correspondence between annotated corpora and functions: a corpus can be seen as a collection of points and their images by a function that maps the raw input to the annotated output. We illustrate this point by considering a corpus of sentences annotated with their part-of-speech. We then show that the construction of a part-of-speech disambiguator from a training corpus is equivalent to approximating the function corresponding to the corpus. A good approximation can be computed within the space of finite-state functions. The inferred function is capable of generalizing the disambiguation process to unknown text with state of the art accuracy. Moreover, the resulting function to linear-time implementation. In a companion paper, the method has also been successfully applied to letter-to-sound conversion.