打印本文 打印本文  关闭窗口 关闭窗口  
美国国家语料库(ANC)介绍
作者:admin  文章来源:本站原创  点击数  更新时间:2011-11-16  文章录入:admin  责任编辑:admin



美国国家语料库(ANC)介绍

 

(欢迎收藏本页)

 

ANC = The American National Corpus美国国家语料库

http://www.anc.org/ 

 

美国国家语料库(American National CorpusANC)是目前规模最大的关于美国英语使用现状的语料库,它包括从1990年起的各种文字材料、口头材料的文字记录。ANC已出版过两个版本,第一个版本包含1,000万口语和书面语美式英语词汇,第二个版本则包含了2,200万口语和书面语美式英语词汇。

The First Release of the ANC

The First Release of the ANC is a beta version. It contains over 10,000,000 words of written and spoken American English, annotated for lemma and part of speech. It is available for research and education for a nominal licensing fee from the Linguistic Data Consortium. Commercial users can obtain the corpus and gain rights to use it in commercial products by joining the ANC Consortium.

The texts included in the first 10 million words of the ANC are those that were first received. Therefore the corpus is not balanced. There has been no hand-validation of the XML tagging or the part of speech annotation tags. Headers are minimal, although they contain fairly complete information concerning domain, subdomain, subject, audience, and medium. Check the list of known bugs and caveats for a description of the limitations we are currently aware of.

One of the aims of releasing this first 10 million words is to get feedback from the community about its structure and annotation, so that modifications can be made, if necessary, for the final release of the full 100 million words. We therefore invite comments and bug reports from the community of ANC users. Please contact anc@cs.vassar.edu .

The Second Release of the ANC

The Second Release of the American National Corpus contains over 22,000,000 words of written and spoken American English, annotated for lemma, part of speech, noun chunks, and verb chunks. Part of speech tags using the Penn tagset are included for all data in the Second Release, and many documents are also PoS-tagged using the Biber tagset.

The ANC Second Release is available for research and education for a nominal licensing fee from the Linguistic Data Consortium. Commercial users can obtain the corpus and gain rights to use it in commercial products by joining the ANC Consortium. Please consult the LDC Catalog entry for the ANC Second Release.

The First and Second Releases of the ANC include materials which have been acquired to date, and therefore the current release of the ANC is not balanced. There has been no hand-validation of the XML tagging or the annotation. Headers are typically minimal, although most contain complete information concerning domain, subdomain, subject, audience, and medium. Check the list of known bugs and caveats for a description of the limitations we are currently aware of.

One of the aims of the Second Release is to get feedback from the community about its structure and annotation, so that modifications can be made, if necessary, for the final release of the full 100 million words. We therefore invite comments and bug reports from the community of ANC users. Please contact anc@cs.vassar.edu.

ANC address:

http://www.anc.org/

more corpus addresses:

/Article/201111/2702.html 

 

打印本文 打印本文  关闭窗口 关闭窗口