TR95-10

Approximating Annotated Corpora with Finite-State Transductions: A Case Study in Part of Speech Tagging


    •  Emmanuel Roche, Yves Schabes, "Approximating Annotated Corpora with Finite-State Transductions: A Case Study in Part of Speech Tagging", Tech. Rep. TR95-10, Mitsubishi Electric Research Laboratories, Cambridge, MA, January 1995.
      BibTeX TR95-10 PDF
      • @techreport{MERL_TR95-10,
      • author = {Emmanuel Roche, Yves Schabes},
      • title = {Approximating Annotated Corpora with Finite-State Transductions: A Case Study in Part of Speech Tagging},
      • institution = {MERL - Mitsubishi Electric Research Laboratories},
      • address = {Cambridge, MA 02139},
      • number = {TR95-10},
      • month = jan,
      • year = 1995,
      • url = {https://www.merl.com/publications/TR95-10/}
      • }
Abstract:

There is a natural correspondence between annotated corpora and functions: a corpus can be seen as a collection of points and their images by a function that maps the raw input to the annotated output. We illustrate this point by considering a corpus of sentences annotated with their part-of-speech. We then show that the construction of a part-of-speech disambiguator from a training corpus is equivalent to approximating the function corresponding to the corpus. A good approximation can be computed within the space of finite-state functions. The inferred function is capable of generalizing the disambiguation process to unknown text with state of the art accuracy. Moreover, the resulting function to linear-time implementation. In a companion paper, the method has also been successfully applied to letter-to-sound conversion.