1 Introduction to Corpus LinguisticsWord文档下载推荐.docx-资源下载

1 Introduction to Corpus LinguisticsWord文档下载推荐.docx

1、1.1 What is a corpus?In the language sciences a corpus is a body of written text or transcribed speech which can serve as a basis for linguistic analysis and description. In many respects it is the use to which the body of textual material is put, rather than its design features, which define what a

2、 corpus is. A corpus constitutes an empirical basis not only for identifying the elements and structural patterns which make up the systems we use in a language, but also for mapping out our use of these systems. A corpus can be analyzed and compared with other corpora or parts of corpora to study v

3、ariation. Most importantly, it can be analyzed distributionally to show how often particular phonological, lexical, grammatical, discoursal or pragmatic features occur, and also where they occur. By the 1990s there were many corpus-making projects in various parts of the world. Lancashire （1991） sho

4、ws the huge range of corpora, archives and other electronic databases available or being compiled for a wide variety of purposes. Some of the largest corpus projects have been undertaken for commercial purposes, by dictionary publishers. Other projects in corpus compilation or analysis are on a smal

5、ler scale, and do not necessarily become well known. Undertaken as part of graduate theses or undergraduate projects, they enabled students to gain original insights into the structure and use of language.1. 2 Categorization of CorpusComputerized corpora consist of:Raw corpora （原始语料库），这就是将现实中的口语和笔语用

6、文字形式收集起来，按一定原则（语域，语体，历时，共时等）归类汇编起来的各种语料库。Annotated corpora （附码语料库），这是指对原始语料进行了词性、语法、语音、语义或语篇乃至语用标记附码的语料库Parallel corpora （平行语料库），这是指两种或多种语言在句子乃至单词短语层面上实现同步对译的互动语料库，如英法德西班牙等语种的平行语料库CRATER （McEnery & Oakes 1996）和英汉双语平行语料库（中国外语教学研究中心基地 2000）等Learners corpora （学习者语料库）, 即非母语学习者的口语和笔语语料库，其中包括注有学习者拼写和语法差错

7、标记以及修改提示的语料库。如ICLE （国际英语学习者书面语料库），LINDSEI （国际英语学习者口语语料库）（Granger 2000）和 CLEC （中国英语学习者书面语料库）（桂诗春 2001）等等Lattice corpora （网格式语料库），这是指对自然语言（包括口语和笔语）进行自动语音和手写识别处理之后声称的语料库（Atwell 1996）.总体说来，语料库分成原始语料库与附码语料库。1.3 What a corpus can do Strictly speaking, a corpus by itself can do nothing at all, being not

8、hing other than a store of used language. Corpus access software, however, can rearrange that store so that observations of various kinds can be made. If a corpus represents, very roughly and partially, a speakers experience of language, the access software re-orders that experience so that it can b

9、e reexamined in ways that are usually impossible. A corpus does not contain new information about language, but the software packages process data from a corpus in three ways: showing frequency, phraseology and collocation.2. What is corpus linguistics?2.1 The definition of corpus linguisticsOver th

10、e last three decades the compilation and analysis of corpora stored in computerized databases has led to a new scholarly enterprise known as corpus linguistics. It brings together some of the findings of corpus-based studies of English, the language which has so far received the most attention from

11、corpus linguists, and shows how quantitative analysis can contribute to linguistic description.2. 2 The history of corpus linguisticsThe use of corpus for linguistic studies can date back to the end of the nineteenth century when only cards and manual retrieval could be used as a means of research.A

12、s we have seen, corpus linguistics goes beyond the use of corpora as a source of evidence in linguistic description. It also revives and carries on a concern of some linguists with the statistical distribution of linguistic items in the context of use. From 1920s there was, especially in the United

13、States and the United Kingdom, a tradition of word counting in texts in order to discover the most frequent, and arguably therefore the most pedagogically useful, words and grammatical structures for language teaching purposes. From the 1930s, Prague School linguistics undertook quantitative studies

14、（Mainly of Czech, English and Russian） of different parts of speech, the location and distribution of information in the sentence, and the statistical distribution of syllable types and structures. Different varieties of English have been studied. The earliest computerized corpora compiled for ling

15、uistic research from the 1960s required the use of mainframe computers, and researchers frequently had to design their own software for analysis. Initial interest was often in lexis, including word counts, but it was quickly apparent that a computer facilitated the study of permissible or likely wor

16、d sequences or collocations （are we more likely to write different from, different to or different than?） and grammatical and stylistic characteristics of particular authors and genres. There was a particular interest in what characterized scientific style, newspaper style and literary or imaginativ

17、e style. The renowned British scholar R. Greenbaum began to cooperate for the sake of establishing a corpus Survey of English Usage （SEU） in 1950s and 1960s, first on paper and then computerized at the beginning of the 1980s, which marks the transition from the traditional corpus to the computerized

18、 corpus. Brown University Standard Corpus of Present-day American English Corpus （BROWN） was established in the 1960s and 1970s. London-Lund Corpus of Spoken English （LLC） was accomplished in the 1980s, which was the first corpus of its kind, including formal and informal speeches, commentaries, dia

19、logues, discussions, interviews and so on. These three classic corpora lay a solid foundation for the present-day corpus linguistics, for they are based on systematically comprehensive, authentic and reliable corpora, and easy for storage and retrieval. 2. 3 The scope of corpus linguisticsCorpus lin

20、guistics is based on bodies of text as the domain of study and as the source of evidence for linguistic description and argumentation. It also has come to embody methodologies for linguistic description in which quantification of the linguistic items is part of the research activity. As Leech （1992:

21、107） has noted, the focus of study is on performance rather than on competence, and on observation of language in use leading to theory rather than vice versa.Corpus linguists are concerned typically not only with what words, structures or uses are possible in a language but also with what is probab

22、le what is likely to occur in language use. The use of corpus as a source of evidence however is not necessarily incompatible with any linguistic theory, and progress in the language sciences as a whole is likely to benefit from a judicious use of evidence from various sources: texts, introspection,

23、 elicitation or other types of experimentation as appropriate. Any scientific enterprise must be empirical in the sense it has to be supported or falsified on evidence and, in the final analysis, statements made about language have to stand up to the evidence of language use. The evidence can be bas

24、ed on the introspective judgment of speakers of the language or on a corpus of text. The difference lies in the richness of the evidence and the confidence we can have in the generalizability of that evidence, and in its validity and reliability.2. 4 Applications of corpus linguisticsCorpus linguist

25、ics can be widely exploited in a variety of domainsmost centrally in the design of syllabi and materials for language teaching, but also in dictionary work, the study of ideology and culture, translation, stylistics, forensic linguistics, and the provision of on-line assistance for writers in well-d

26、efined technical domains. 3. Types of corpus researchersWork in corpus linguistics is currently associated with several quite different activities. Scholars working in the field tend to be identified with one or more of them.The first group of researchers consists of corpus makers or compilers. Thes

27、e scholars are concerned with the design and compilation of corpora, the collection of texts and their preparation and storage for later analysis.A second group of researchers has been concerned with developing tools for the analysis of corpora. This is the main task of researchers in computational

28、linguistics.A third group of researchers consists of descriptive linguists whose main concern has been to make use of computerized corpora to describe reliably the lexicon and grammar of languages, both of the linguistic systems we use and our likely use of those systems. It is the probabilistic asp

29、ect of corpus-based descriptive linguistic studies which especially distinguishes them from conventional descriptive fieldwork in linguistics or lexicography.A fourth area of activity, which has been among the most innovative outcomes of the corpus revolution, has been the exploitation of corpus-bas

30、ed linguistic description for use in a variety of applications such as language learning and teaching, and natural language processing by machine, including speech recognition and translation. 4. The objective of offering this courseIt is my hope that this course will whet the appetites of the growi

31、ng body of teachers and students with access to corpora to discover more for themselves about how language works in all their variety.There is no doubt that corpus linguistics is not an end in itself but is one source of evidence for improving descriptions of the structure and use of languages, and

32、for various applications, including the processing of natural language by machine and understanding how to learn and teach a language.It should be made clear that corpus linguistics is not amindless process of automatic language description. Linguists use corpora to answer questions and solve problems. Some of the most revealing insights on language and language use have come from a blend of manual and computer analysis. It is now possible for researchers with access to a personal computer and off-shelf software

邮箱/手机：
温馨提示：	快捷下载时，用户名和密码都是您填写的邮箱或者手机号，方便查询和重复下载（系统自动生成）。如填写123，账号就是123，密码也是123。
特别说明：	请自助下载，系统不会自动发送文件的哦；如果您已付费，想二次下载，请登录后访问：我的下载记录
支付方式：
验证码：	换一换

账号：
密码：
验证码：	换一换
当日自动登录忘记密码？