1、1 Introduction to Corpus Linguistics第一章语料库语言学的目标和方法Introduction to Corpus Linguistics1.1 What is a corpus?In the language sciences a corpus is a body of written text or transcribed speech which can serve as a basis for linguistic analysis and description. In many respects it is the use to which the
2、body of textual material is put, rather than its design features, which define what a corpus is. A corpus constitutes an empirical basis not only for identifying the elements and structural patterns which make up the systems we use in a language, but also for mapping out our use of these systems. A
3、corpus can be analyzed and compared with other corpora or parts of corpora to study variation. Most importantly, it can be analyzed distributionally to show how often particular phonological, lexical, grammatical, discoursal or pragmatic features occur, and also where they occur. By the 1990s there
4、were many corpus-making projects in various parts of the world. Lancashire (1991) shows the huge range of corpora, archives and other electronic databases available or being compiled for a wide variety of purposes. Some of the largest corpus projects have been undertaken for commercial purposes, by
5、dictionary publishers. Other projects in corpus compilation or analysis are on a smaller scale, and do not necessarily become well known. Undertaken as part of graduate theses or undergraduate projects, they enabled students to gain original insights into the structure and use of language.1. 2 Categ
6、orization of CorpusComputerized corpora consist of:Raw corpora (原始语料库),这就是将现实中的口语和笔语用文字形式收集起来,按一定原则(语域,语体,历时,共时等)归类汇编起来的各种语料库。Annotated corpora (附码语料库),这是指对原始语料进行了词性、语法、语音、语义或语篇乃至语用标记附码的语料库Parallel corpora (平行语料库),这是指两种或多种语言在句子乃至单词短语层面上实现同步对译的互动语料库,如英法德西班牙等语种的平行语料库CRATER (McEnery & Oakes 1996)和英汉双语平
7、行语料库 (中国外语教学研究中心基地 2000)等Learners corpora (学习者语料库), 即非母语学习者的口语和笔语语料库,其中包括注有学习者拼写和语法差错标记以及修改提示的语料库。如ICLE (国际英语学习者书面语料库),LINDSEI (国际英语学习者口语语料库)(Granger 2000) 和 CLEC (中国英语学习者书面语料库)(桂诗春 2001)等等Lattice corpora (网格式语料库),这是指对自然语言 (包括口语和笔语)进行自动语音和手写识别处理之后声称的语料库 (Atwell 1996).总体说来,语料库分成原始语料库与附码语料库。1.3 What a
8、 corpus can do Strictly speaking, a corpus by itself can do nothing at all, being nothing other than a store of used language. Corpus access software, however, can rearrange that store so that observations of various kinds can be made. If a corpus represents, very roughly and partially, a speakers e
9、xperience of language, the access software re-orders that experience so that it can be reexamined in ways that are usually impossible. A corpus does not contain new information about language, but the software packages process data from a corpus in three ways: showing frequency, phraseology and coll
10、ocation.2. What is corpus linguistics?2.1 The definition of corpus linguisticsOver the last three decades the compilation and analysis of corpora stored in computerized databases has led to a new scholarly enterprise known as corpus linguistics. It brings together some of the findings of corpus-base
11、d studies of English, the language which has so far received the most attention from corpus linguists, and shows how quantitative analysis can contribute to linguistic description.2. 2 The history of corpus linguisticsThe use of corpus for linguistic studies can date back to the end of the nineteent
12、h century when only cards and manual retrieval could be used as a means of research.As we have seen, corpus linguistics goes beyond the use of corpora as a source of evidence in linguistic description. It also revives and carries on a concern of some linguists with the statistical distribution of li
13、nguistic items in the context of use. From 1920s there was, especially in the United States and the United Kingdom, a tradition of word counting in texts in order to discover the most frequent, and arguably therefore the most pedagogically useful, words and grammatical structures for language teachi
14、ng purposes. From the 1930s, Prague School linguistics undertook quantitative studies (Mainly of Czech, English and Russian) of different parts of speech, the location and distribution of information in the sentence, and the statistical distribution of syllable types and structures. Different variet
15、ies of English have been studied. The earliest computerized corpora compiled for linguistic research from the 1960s required the use of mainframe computers, and researchers frequently had to design their own software for analysis. Initial interest was often in lexis, including word counts, but it wa
16、s quickly apparent that a computer facilitated the study of permissible or likely word sequences or collocations (are we more likely to write different from, different to or different than?) and grammatical and stylistic characteristics of particular authors and genres. There was a particular intere
17、st in what characterized scientific style, newspaper style and literary or imaginative style. The renowned British scholar R. Greenbaum began to cooperate for the sake of establishing a corpus Survey of English Usage (SEU) in 1950s and 1960s, first on paper and then computerized at the beginning of
18、the 1980s, which marks the transition from the traditional corpus to the computerized corpus. Brown University Standard Corpus of Present-day American English Corpus (BROWN) was established in the 1960s and 1970s. London-Lund Corpus of Spoken English (LLC) was accomplished in the 1980s, which was th
19、e first corpus of its kind, including formal and informal speeches, commentaries, dialogues, discussions, interviews and so on. These three classic corpora lay a solid foundation for the present-day corpus linguistics, for they are based on systematically comprehensive, authentic and reliable corpor
20、a, and easy for storage and retrieval. 2. 3 The scope of corpus linguisticsCorpus linguistics is based on bodies of text as the domain of study and as the source of evidence for linguistic description and argumentation. It also has come to embody methodologies for linguistic description in which qua
21、ntification of the linguistic items is part of the research activity. As Leech (1992:107) has noted, the focus of study is on performance rather than on competence, and on observation of language in use leading to theory rather than vice versa.Corpus linguists are concerned typically not only with w
22、hat words, structures or uses are possible in a language but also with what is probable what is likely to occur in language use. The use of corpus as a source of evidence however is not necessarily incompatible with any linguistic theory, and progress in the language sciences as a whole is likely to
23、 benefit from a judicious use of evidence from various sources: texts, introspection, elicitation or other types of experimentation as appropriate. Any scientific enterprise must be empirical in the sense it has to be supported or falsified on evidence and, in the final analysis, statements made abo
24、ut language have to stand up to the evidence of language use. The evidence can be based on the introspective judgment of speakers of the language or on a corpus of text. The difference lies in the richness of the evidence and the confidence we can have in the generalizability of that evidence, and i
25、n its validity and reliability.2. 4 Applications of corpus linguisticsCorpus linguistics can be widely exploited in a variety of domainsmost centrally in the design of syllabi and materials for language teaching, but also in dictionary work, the study of ideology and culture, translation, stylistics
26、, forensic linguistics, and the provision of on-line assistance for writers in well-defined technical domains. 3. Types of corpus researchersWork in corpus linguistics is currently associated with several quite different activities. Scholars working in the field tend to be identified with one or mor
27、e of them.The first group of researchers consists of corpus makers or compilers. These scholars are concerned with the design and compilation of corpora, the collection of texts and their preparation and storage for later analysis.A second group of researchers has been concerned with developing tool
28、s for the analysis of corpora. This is the main task of researchers in computational linguistics.A third group of researchers consists of descriptive linguists whose main concern has been to make use of computerized corpora to describe reliably the lexicon and grammar of languages, both of the lingu
29、istic systems we use and our likely use of those systems. It is the probabilistic aspect of corpus-based descriptive linguistic studies which especially distinguishes them from conventional descriptive fieldwork in linguistics or lexicography.A fourth area of activity, which has been among the most
30、innovative outcomes of the corpus revolution, has been the exploitation of corpus-based linguistic description for use in a variety of applications such as language learning and teaching, and natural language processing by machine, including speech recognition and translation. 4. The objective of of
31、fering this courseIt is my hope that this course will whet the appetites of the growing body of teachers and students with access to corpora to discover more for themselves about how language works in all their variety.There is no doubt that corpus linguistics is not an end in itself but is one sour
32、ce of evidence for improving descriptions of the structure and use of languages, and for various applications, including the processing of natural language by machine and understanding how to learn and teach a language.It should be made clear that corpus linguistics is not amindless process of autom
33、atic language description. Linguists use corpora to answer questions and solve problems. Some of the most revealing insights on language and language use have come from a blend of manual and computer analysis. It is now possible for researchers with access to a personal computer and off-shelf software
copyright@ 2008-2022 冰豆网网站版权所有
经营许可证编号:鄂ICP备2022015515号-1