Line 1: |
Line 1: |
− | [[Image:lighterstill.jpg]] | + | [[Image:lighterstill.jpg]] [[Image:Copora2.jpg|right|frame]] |
| + | |
| + | In [[linguistics]], a ''corpus'' (plural '''corpora''') or textcorpora) or [[text]] corpus is a large and [[structure]]d set of texts (now usually electronically stored and processed). They are used to do [[statistical]] [[analysis]], checking occurrences or validating linguistic rules on a specific [[universe]]. |
| + | |
| + | A corpus may contain texts in a single [[language]] (monolingual corpus) or text [[data]] in multiple languages (multilingual corpus). Multilingual corpora that have been specially formatted for side-by-side comparison are called aligned parallel corpora. |
| + | |
| + | In order to make the corpora more useful for doing linguistic [[research]], they are often subjected to a [[process]] known as [[annotation]]. An example of annotating a corpus is [[part-of-speech]] tagging, or POS-tagging, in which [[information]] about each word's part of speech (verb, noun, adjective, etc.) is added to the corpus in the form of tags. Another example is indicating the [[lemma]] (base) form of each word. When the language of the corpus is not a working language of the researchers who use it, interlinear glossing is used to make the annotation bilingual. |
| + | |
| + | Corpora are the main [[knowledge]] base in corpus linguistics. The [[analysis]] and processing of various types of corpora are also the subject of much work in [[computational linguistics]], speech recognition and [[machine]] [translation]], where they are often used to create hidden [[Markov]] models for POS-tagging and other purposes. Corpora and frequency lists derived from them are useful for language teaching. |
| + | |
| + | == Archaeological corpora == |
| + | Text corpora are also used in the study of [[historical]] documents, for example in attempts to decipher ancient scripts, or in Biblical scholarship. Some archaeological corpora can be of such short duration that they provide a snapshot in [[time]]. One of the shortest corpora in time, may be the 15-30 year [[Amarna]] letters texts-(1350 BC). The corpus of an ancient city, (for example the "[[Kültepe]] Texts" of Turkey), may go through a series of corpora, determined by their find site dates. |
| + | |
| + | == Some notable text corpora == |
| + | |
| + | English language: |
| + | |
| + | * [[American National Corpus]], * [[British National Corpus]], * Brown Corpus, * Helsinki Corpus, * Longman-Lancaster Corpus, * North American News Text corpus, * [[Oxford English Corpus]], * Scottish Corpus of Texts & Speech |
| + | |
| + | Historical languages: |
| + | |
| + | * [https://nordan.daynal.org/wiki/index.php?title=Thesaurus_Linguae_Graecae Thesaurus Linguae Graecae] (Ancient Greek), * Electronic Text Corpus of [[Sumerian]] Literature, * Neo-Assyrian Text Corpus Project, * Amarna letters, (for [[Akkadian]], Egyptian, Sumerogram's, etc.) |
| + | |
| + | [[Category: General Reference]] |
| + | [[Category: Linguistics]] |
| | | |
− | Singular form of the plural term [[Corpora]].
| |
| | | |
| [[Category: General Reference]] | | [[Category: General Reference]] |