Changes

12 bytes removed , 19:45, 29 April 2008

no edit summary

Line 1: Line 1:

[[Image:lightwave.jpg]]

−

[[Image:Copora2.jpg|right|]]

+

[[Image:Copora2.jpg|right|frame]]

−

~~== Text Corpus ==~~

In [[linguistics]], a ''corpus'' (plural '''corpora''') or textcorpora) or text corpus is a large and structured set of texts (now usually electronically stored and processed). They are used to do statistical analysis, checking occurrences or validating linguistic rules on a specific universe.

Line 7: Line 6:

In order to make the corpora more useful for doing linguistic research, they are often subjected to a process known as [[annotation]]. An example of annotating a corpus is [[part-of-speech]] tagging, or POS-tagging, in which information about each word's part of speech (verb, noun, adjective, etc.) is added to the corpus in the form of tags. Another example is indicating the [[lemma]] (base) form of each word. When the language of the corpus is not a working language of the researchers who use it, interlinear glossing is usedto make the annotation bilingual.

+

Corpora are the main knowledge base in corpus linguistics. The analysis and processing of various types of corpora are also the subject of much work in [[computational linguistics]], [[speech recognition]] and [[machine translation]], where they are often used to create hidden [[Markov]] models for POS-tagging and other purposes. Corpora and frequency lists derived from them are useful for language teaching.

Rdavis

Bureaucrats, Administrators

102,800

edits

Changes

Corpora (view source)

Revision as of 19:45, 29 April 2008

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Tools

Search