Changes

From Nordan Symposia
Jump to navigationJump to search
2,725 bytes added ,  22:31, 20 November 2009
no edit summary
Line 1: Line 1: −
[[Image:lighterstill.jpg]]
+
[[Image:lighterstill.jpg]] [[Image:Copora2.jpg|right|frame]]
 +
 
 +
In [[linguistics]], a ''corpus'' (plural '''corpora''') or textcorpora) or [[text]] corpus is a large and [[structure]]d set of texts (now usually electronically stored and processed). They are used to do [[statistical]] [[analysis]], checking occurrences or validating linguistic rules on a specific [[universe]].
 +
 
 +
A corpus may contain texts in a single [[language]] (monolingual corpus) or text [[data]] in multiple languages (multilingual corpus). Multilingual corpora that have been specially formatted for side-by-side comparison are called aligned parallel corpora.
 +
 
 +
In order to make the corpora more useful for doing linguistic [[research]], they are often subjected to a [[process]] known as [[annotation]]. An example of annotating a corpus is [[part-of-speech]] tagging, or POS-tagging, in which [[information]] about each word's part of speech (verb, noun, adjective, etc.) is added to the corpus in the form of tags. Another example is indicating the [[lemma]] (base) form of each word. When the language of the corpus is not a working language of the researchers who use it, interlinear glossing is used to make the annotation bilingual.
 +
 
 +
Corpora are the main [[knowledge]] base in corpus linguistics. The [[analysis]] and processing of various types of corpora are also the subject of much work in [[computational linguistics]], speech recognition and [[machine]] [translation]], where they are often used to create hidden [[Markov]] models for POS-tagging and other purposes. Corpora and frequency lists derived from them are useful for language teaching.
 +
 
 +
== Archaeological corpora ==
 +
Text corpora are also used in the study of [[historical]] documents, for example in attempts to decipher ancient scripts, or in Biblical scholarship. Some archaeological corpora can be of such short duration that they provide a snapshot in [[time]]. One of the shortest corpora in time, may be the 15-30 year [[Amarna]] letters texts-(1350 BC). The corpus of an ancient city, (for example the "[[Kültepe]] Texts" of Turkey), may go through a series of corpora, determined by their find site dates.
 +
 
 +
== Some notable text corpora ==
 +
 
 +
English language:
 +
 
 +
* [[American National Corpus]], * [[British National Corpus]], * Brown Corpus, * Helsinki Corpus, * Longman-Lancaster Corpus, * North American News Text corpus, * [[Oxford English Corpus]], * Scottish Corpus of Texts & Speech
 +
 
 +
Historical languages:
 +
 
 +
* [http://nordan.daynal.org/wiki/index.php?title=Thesaurus_Linguae_Graecae Thesaurus Linguae Graecae] (Ancient Greek), * Electronic Text Corpus of [[Sumerian]] Literature, * Neo-Assyrian Text Corpus Project, * Amarna letters, (for [[Akkadian]], Egyptian, Sumerogram's, etc.)
 +
 
 +
[[Category: General Reference]]
 +
[[Category: Linguistics]]
   −
Singular form of the plural term [[Corpora]].
      
[[Category: General Reference]]
 
[[Category: General Reference]]

Navigation menu