A collection of linguistic data (usually contained in a computer database) used for research, scholarship, and teaching. Plural: corpora.
The first systematically organized computer corpus was the Brown University Standard Corpus of Present-Day American English (commonly known as the Brown Corpus), compiled in the 1960s by linguists Henry Kučera and W. Nelson Francis.
Notable English language corpora include the following:
- The American National Corpus (ANC)
- British National Corpus (BNC)
- The Corpus of Contemporary American English (COCA)
- The International Corpus of English (ICE)
- Corpus Lexicography
- Corpus Linguistics
- 360,000,000 Words and Counting
- Present-Day English (PDE)
- What Is Language?
- What Is Linguistics?
From the Latin, "body"
Examples and Observations:
- "The 'authentic materials' movement in language teaching that emerged in the 1980s [advocated] a greater use of real-world or 'authentic' materials--materials not specially designed for classroom use--since it was argued that such material would expose learners to examples of natural language use taken from real-world contexts. More recently the emergence of corpus linguistics and the establishment of large-scale databases or corpora of different genres of authentic language have offered a further approach to providing learners with teaching materials that reflect authentic language use."
(Jack C. Richards, Series Editor's Preface. Using Corpora in the Language Classroom, by Randi Reppen. Cambridge University Press, 2010)
- Modes of Communication: Writing and Speech
"Corpora may encode language produced in any mode--for example, there are corpora of spoken language and there are corpora of written language. In addition, some video corpora record paralinguistic features such as gesture . . ., and corpora of sign language have been constructed . . ..
"Corpora representing the written form of a language usually present the smallest technical challenge to construct. . . . Unicode allows computers to reliably store, exchange and display textual material in nearly all of the writing systems of the world, both current and extinct. . . .
"Material for a spoken corpus, however, is time-consuming to gather and transcribe. Some material may be gathered from sources like the World Wide Web . . .. However, transcripts such as these have not been designed as reliable materials for linguistic exploration of spoken language. . . . [S]poken corpus data is more often produced by recording interactions and then transcribing them. Orthographic and/or phonemic transcriptions of spoken materials can be compiled into a corpus of speech which is searchable by computer."
(Tony McEnery and Andrew Hardie, Corpus Linguistics: Method, Theory and Practice. Cambridge University Press, 2012)
"Concordancing is a core tool in corpus linguistics and it simply means using corpus software to find every occurrence of a particular word or phrase. . . . With a computer, we can now search millions of words in seconds. The search word or phrase is often referred to as the 'node' and concordance lines are usually presented with the node word/phrase in the centre of the line with seven or eight words presented at either side. These are known as Key-Word-in-Context displays (or KWIC concordances)."
(Anne O'Keeffe, Michael McCarthy, and Ronald Carter, "Introduction." From Corpus to Classroom: Language Use and Language Teaching. Cambridge University Press, 2007)
- Advantages of Corpus Linguistics
"In 1992 [Jan Svartvik] presented the advantages of corpus linguistics in a preface to an influential collection of papers. His arguments are given here in abbreviated form:
- Corpus data are more objective than data based on introspection.However, Svartvik also points out that it is crucial that the corpus linguist engages in careful manual analysis as well: mere figures are rarely enough. He stresses too that the quality of the corpus is important."
- Corpus data can easily be verified by other researchers and researchers can share the same data instead of always compiling their own.
- Corpus data are needed for studies of variation between dialects, registers and styles.
- Corpus data provide the frequency of occurrence of linguistic items.
- Corpus data do not only provide illustrative examples, but are a theoretical resource.
- Corpus data give essential information for a number of applied areas, like language teaching and language technology (machine translation, speech synthesis etc.).
- Corpora provide the possibility of total accountability of linguistic features--the analyst should account for everything in the data, not just selected features.
- Computerised corpora give researchers all over the world access to the data.
- Corpus data are ideal for non-native speakers of the language.
(Hans Lindquist, Corpus Linguistics and the Description of English. Edinburgh University Press, 2009)
- Additional Applications of Corpus-Based Research
"Apart from the applications in linguistic research per se, the following practical applications may be mentioned.
Lexicography(Geoffrey N. Leech, "Corpora." The Linguistics Encyclopedia, ed. by Kirsten Malmkjaer. Routledge, 1995)
Corpus-derived frequency lists and, more especially, concordances are establishing themselves as basic tools for the lexicographer. . . .
. . . The use of concordances as language-learning tools is currently a major interest in computer-assisted language learning (CALL; see Johns 1986). . . .
Machine translation is one example of the application of corpora for what computer scientists call natural language processing. In addition to machine translation, a major research goal for NLP is speech processing, that is, the development of computer systems capable of outputting automatically produced speech from written input (speech synthesis), or converting speech input into written form (speech recognition)."