English Corpora A – E (materiales de corpus-linguistics.com 24.04.2004)

English Corpora A – E (materiales de corpus-linguistics.com 24.04.2004)

Historia de la lingüística de corpus, Lingüística de Corpus No Comments on English Corpora A – E (materiales de corpus-linguistics.com 24.04.2004)

 

Corpora: English Corpora: A – E

 

 

Corpus
Linguistics
Tutorials
Corpora
English
Corpora
German
Corpora
More
Languages
Spoken
Corpora
Learner
Corpora
ICE

Corpora

Parallel
Corpora
Historical
Corpora
Treebanks
Text

Archives

Alphabetical
List
Software
CL
in Applied Linguistics

 

You are now in section > Corpora
> English Corpora
> A – E

 

A – E
F – J
K – O
P – T
U – Z

 

ACE
Bank
of English
BNC
BROWNCEECS
Christine
CIC
CLCCOLT
CPSAECI

ACE – Australian Corpus of English

Org: Macquarie University, Sydney, Australia
Time: 1986

Size:

1 mio words (500×2000 words samples)
Contents: written and spoken, multigeneric (15 different genres)

Access:

available on the ICAME
CD-ROM
Notes: modelled
on BROWN and LOB
for linguistic research

Bank of English – Cobuild

Org: Cobuild and the University of Birmingham, UK (John
Sinclair)
Time: majority of the material originates after 1990

Size:

415 mio words in Oct 2000 (still growing)
Contents: written and spoken; multigeneric (

Access:

CobuildDirect allows restricted access to the corpus
through a java or telnet interface; restricted concordance and collocation
queries are possible
Notes:

BNC
– British National Corpus

Org: Lead by an industrial/academic consortium lead by
Oxford University Press
Time: completed in 1994; first release in 1995; second release
in 2001

Size:

over 100 mio words (4,125 texts)
Contents: multigeneric; 90% written and 10% spoken materials

Access:

Licensed; Guest account available by using the SARA
Client at the
BNC Online Service or conduct
a simple search at the BNC.
Notes: SGML
Markup according to the TEI
guidelines; POS
tagging carried out with CLAWS

BROWN
University Corpus

Org: Brown University, Rhode Island,U.S.
Time: 1960s

Size:

ca. 1 mio words
Contents: American written English; 500 text samples of approximately
2,000 words distributed over 15 text categories

Access:

available on the ICAME
CD-ROM
Notes:

CEECS
– Corpus of Early English Correspondence Sampler

Org: University of Helsinki, Finnland
Time: 1418-1680

Size:

approx. 450,000 words
Contents: click
here for a list of included texts

Access:

available on the ICAME
CD-ROM
Notes: represents the non-copyrighted materials included
in the Corpus of Early English Correspondence

CHRISTINE
Corpus

Org: Geoffrey
Sampson, University of Essex, UK
Time: first distributed in August 2000

Size:

Contents: spoken English, and particularly spontaneous,
informal spoken English

Access:

freely available for download
here
Notes: see also SUSANNE

name=”CIC”

CIC
– Cambridge International Corpus

Org: Cambridge University Press
Time: ongoing

Size:

300 mio words and expanding
Contents: multigeneric; written and spoken British and American
materials, learners’ English

Access:

“Currently, it can only be used by authors and
writers working for Cambridge University Press and by members of
staff at UCLES.”
Notes: “Authors, editors and lexicographers use the
CIC
[…] when they are working on books for Cambridge University Press.”

CLC
– Cambridge Learner Corpus

Org: Cambridge University Press and UCLES.
Time: ongoing

Size:

10 mio and expanding
Contents: anonymised exam scripts written by students taking
UCLES
English exams around the world

Access:

“Currently, it can only be used by authors and
writers working for Cambridge University Press and by members of
staff at UCLES.”
Notes: It forms part of the Cambridge
International Corpus

COLT
– Bergen Corpus of London Teenage Language

Org: University of Bergen, Norway
Time: material collected in 1993

Size:

500.000 words; Pilot-version consists of 151 texts
Contents: transcripts of spoken ‘London Teenage Language’

Access:

search in the pilot version is available; reg. users
can search the entire corpus online; COLT is available on the ICAME
CD-ROM
Notes: COLT is part of the BNC;
it is tagged for word classes

CPSA
– Corpus of Spoken Professional American English

Org: Contact: Michael
Barlow
Time: 1994-1998

Size:

2 main sub-corpora, 1 mio words each
Contents: short interchanges by 400 speakers – professional activities broadly
tied to academics and politics

Access:

Registered users only ($79 for the individual using
the tagged version)
Notes: The tagging was performed by Tony McEnery and Paul
Baker using the CLAWS programme at UCREL, Lancaster University; available both tagged and untagged

ECI
Corpus

Org: ELSNET
Time: materials collected between 1984 and 1993

Size:

Four different corpora ranging from 4 to 34 mio. words
Contents: German, French and Dutch newspaper texts; parallel
texts in English Spanish and French

Access:

available on CD Rom for € 50 for research purposes
only
Notes:

You are now in section > Corpora
> English Corpora >
A – E

Data-driven
learning
Virtual
Resources
Bibliography
Email
About

 

 

 

 

<FILE ARCHIVED ON  AND RETRIEVED FROM THE INTERNET ARCHIVE ALL OTHER CONTENT MAY ALSO BE PROTECTED BY COPYRIGHT (17 U.S.C.
SECTION 108(a)(3)). contact rubtcova.com

Author

Leave a comment

Back to Top