Line 1: |
Line 1: |
| [[Image:lightwave.jpg]] | | [[Image:lightwave.jpg]] |
− | [[Image:Copora.jpg|center|]] | + | [[Image:Copora2.jpg|right|frame]] |
− | == Text Corpus ==
| + | In [[linguistics]], a ''corpus'' (plural '''corpora''') or textcorpora) or text corpus is a large and structured set of texts (now usually electronically stored and processed). They are used to do statistical analysis, checking occurrences or validating linguistic rules on a specific universe. |
− | | |
− | In [[linguistics]], a corpus (plural corpora) or textcorpora) or text corpus is a large and structured set of texts (now usually electronically stored and processed). They are used to do statistical analysis, checking occurrences or validating linguistic rules on a specific universe. | |
| | | |
| A corpus may contain texts in a single language (monolingual corpus) or text data in multiple languages (multilingual corpus). Multilingual corpora that have been specially formatted for side-by-side comparison are called aligned parallel corpora. | | A corpus may contain texts in a single language (monolingual corpus) or text data in multiple languages (multilingual corpus). Multilingual corpora that have been specially formatted for side-by-side comparison are called aligned parallel corpora. |
| | | |
| In order to make the corpora more useful for doing linguistic research, they are often subjected to a process known as [[annotation]]. An example of annotating a corpus is [[part-of-speech]] tagging, or POS-tagging, in which information about each word's part of speech (verb, noun, adjective, etc.) is added to the corpus in the form of tags. Another example is indicating the [[lemma]] (base) form of each word. When the language of the corpus is not a working language of the researchers who use it, interlinear glossing is usedto make the annotation bilingual. | | In order to make the corpora more useful for doing linguistic research, they are often subjected to a process known as [[annotation]]. An example of annotating a corpus is [[part-of-speech]] tagging, or POS-tagging, in which information about each word's part of speech (verb, noun, adjective, etc.) is added to the corpus in the form of tags. Another example is indicating the [[lemma]] (base) form of each word. When the language of the corpus is not a working language of the researchers who use it, interlinear glossing is usedto make the annotation bilingual. |
| + | |
| | | |
| Corpora are the main knowledge base in corpus linguistics. The analysis and processing of various types of corpora are also the subject of much work in [[computational linguistics]], [[speech recognition]] and [[machine translation]], where they are often used to create hidden [[Markov]] models for POS-tagging and other purposes. Corpora and frequency lists derived from them are useful for language teaching. | | Corpora are the main knowledge base in corpus linguistics. The analysis and processing of various types of corpora are also the subject of much work in [[computational linguistics]], [[speech recognition]] and [[machine translation]], where they are often used to create hidden [[Markov]] models for POS-tagging and other purposes. Corpora and frequency lists derived from them are useful for language teaching. |
− |
| |
− | [[Category: General Reference]]
| |
− | [[Category: Linguistics]]
| |
| | | |
| == Archaeological corpora == | | == Archaeological corpora == |
Line 40: |
Line 36: |
| | | |
| [[Category: General Reference]] | | [[Category: General Reference]] |
| + | [[Category: Linguistics]] |