Building a Large Annotated Corpus of English: The Penn Treebank

Loading...
Thumbnail Image

Embargo Date

Related Collections

Degree type

Discipline

Subject

Funder

Grant number

License

Copyright date

Distributor

Related resources

Author

Santorini, Beatrice
Marcinkiewicz, Mary Ann

Contributor

Abstract

In this paper, we review our experience with constructing one such large annotated corpus--the Penn Treebank, a corpus consisting of over 4.5 million words of American English. During the first three-year phase of the Penn Treebank Project (1989-1992), this corpus has been annotated for part-of-speech (POS) information. In addition, over half of it has been annotated for skeletal syntactic structure.

Advisor

Date Range for Data Collection (Start Date)

Date Range for Data Collection (End Date)

Digital Object Identifier

Series name and number

Publication date

1993-10-01

Volume number

Issue number

Publisher

Publisher DOI

Journal Issues

Comments

University of Pennsylvania Department of Computer and Information Science Technical Report No. MS-CIS-93-87.

Recommended citation

Collection