Changes

2,726 bytes added , 22:11, 12 December 2020

m

Text replacement - "http://nordan.daynal.org" to "https://nordan.daynal.org"

Line 1: Line 1: −

[[Image:lighterstill.jpg]]

+

[[Image:lighterstill.jpg]] [[Image:Copora2.jpg|right|frame]]

+

In [[linguistics]], a ''corpus'' (plural '''corpora''') or textcorpora) or [[text]] corpus is a large and [[structure]]d set of texts (now usually electronically stored and processed). They are used to do [[statistical]] [[analysis]], checking occurrences or validating linguistic rules on a specific [[universe]].

+

A corpus may contain texts in a single [[language]] (monolingual corpus) or text [[data]] in multiple languages (multilingual corpus). Multilingual corpora that have been specially formatted for side-by-side comparison are called aligned parallel corpora.

+

In order to make the corpora more useful for doing linguistic [[research]], they are often subjected to a [[process]] known as [[annotation]]. An example of annotating a corpus is [[part-of-speech]] tagging, or POS-tagging, in which [[information]] about each word's part of speech (verb, noun, adjective, etc.) is added to the corpus in the form of tags. Another example is indicating the [[lemma]] (base) form of each word. When the language of the corpus is not a working language of the researchers who use it, interlinear glossing is used to make the annotation bilingual.

+

Corpora are the main [[knowledge]] base in corpus linguistics. The [[analysis]] and processing of various types of corpora are also the subject of much work in [[computational linguistics]], speech recognition and [[machine]] [translation]], where they are often used to create hidden [[Markov]] models for POS-tagging and other purposes. Corpora and frequency lists derived from them are useful for language teaching.

+

== Archaeological corpora ==

+

Text corpora are also used in the study of [[historical]] documents, for example in attempts to decipher ancient scripts, or in Biblical scholarship. Some archaeological corpora can be of such short duration that they provide a snapshot in [[time]]. One of the shortest corpora in time, may be the 15-30 year [[Amarna]] letters texts-(1350 BC). The corpus of an ancient city, (for example the "[[Kültepe]] Texts" of Turkey), may go through a series of corpora, determined by their find site dates.

+

== Some notable text corpora ==

+

English language:

+

* [[American National Corpus]], * [[British National Corpus]], * Brown Corpus, * Helsinki Corpus, * Longman-Lancaster Corpus, * North American News Text corpus, * [[Oxford English Corpus]], * Scottish Corpus of Texts & Speech

+

Historical languages:

+

* [https://nordan.daynal.org/wiki/index.php?title=Thesaurus_Linguae_Graecae Thesaurus Linguae Graecae] (Ancient Greek), * Electronic Text Corpus of [[Sumerian]] Literature, * Neo-Assyrian Text Corpus Project, * Amarna letters, (for [[Akkadian]], Egyptian, Sumerogram's, etc.)

+

[[Category: General Reference]]

+

[[Category: Linguistics]]

−

~~Singular form of the plural term [[Corpora]].~~

[[Category: General Reference]]

Mywikis

Bureaucrats, Administrators

13,094

edits

Changes

Corpus (view source)

Revision as of 22:11, 12 December 2020

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Tools

Search