1 Introduction to Corpus Linguistics.docx
《1 Introduction to Corpus Linguistics.docx》由会员分享,可在线阅读,更多相关《1 Introduction to Corpus Linguistics.docx(8页珍藏版)》请在冰豆网上搜索。
1IntroductiontoCorpusLinguistics
第一章语料库语言学的目标和方法
IntroductiontoCorpusLinguistics
1.1Whatisacorpus?
Inthelanguagesciencesacorpusisabodyofwrittentextortranscribedspeechwhichcanserveasabasisforlinguisticanalysisanddescription.Inmanyrespectsitistheusetowhichthebodyoftextualmaterialisput,ratherthanitsdesignfeatures,whichdefinewhatacorpusis.
Acorpusconstitutesanempiricalbasisnotonlyforidentifyingtheelementsandstructuralpatternswhichmakeupthesystemsweuseinalanguage,butalsoformappingoutouruseofthesesystems.Acorpuscanbeanalyzedandcomparedwithothercorporaorpartsofcorporatostudyvariation.Mostimportantly,itcanbeanalyzeddistributionallytoshowhowoftenparticularphonological,lexical,grammatical,discoursalorpragmaticfeaturesoccur,andalsowheretheyoccur.
Bythe1990sthereweremanycorpus-makingprojectsinvariouspartsoftheworld.Lancashire(1991)showsthehugerangeofcorpora,archivesandotherelectronicdatabasesavailableorbeingcompiledforawidevarietyofpurposes.Someofthelargestcorpusprojectshavebeenundertakenforcommercialpurposes,bydictionarypublishers.Otherprojectsincorpuscompilationoranalysisareonasmallerscale,anddonotnecessarilybecomewellknown.Undertakenaspartofgraduatethesesorundergraduateprojects,theyenabledstudentstogainoriginalinsightsintothestructureanduseoflanguage.
1.2CategorizationofCorpus
Computerizedcorporaconsistof:
Rawcorpora(原始语料库),这就是将现实中的口语和笔语用文字形式收集起来,按一定原则(语域,语体,历时,共时等)归类汇编起来的各种语料库。
Annotatedcorpora(附码语料库),这是指对原始语料进行了词性、语法、语音、语义或语篇乃至语用标记附码的语料库
Parallelcorpora(平行语料库),这是指两种或多种语言在句子乃至单词短语层面上实现同步对译的互动语料库,如英法德西班牙等语种的平行语料库CRATER(McEnery&Oakes1996)和英汉双语平行语料库(中国外语教学研究中心基地2000)等
Learnerscorpora(学习者语料库),即非母语学习者的口语和笔语语料库,其中包括注有学习者拼写和语法差错标记以及修改提示的语料库。
如ICLE(国际英语学习者书面语料库),LINDSEI(国际英语学习者口语语料库)(Granger2000)和CLEC(中国英语学习者书面语料库)(桂诗春2001)等等
Latticecorpora(网格式语料库),这是指对自然语言(包括口语和笔语)进行自动语音和手写识别处理之后声称的语料库(Atwell1996).
总体说来,语料库分成原始语料库与附码语料库。
1.3Whatacorpuscando
Strictlyspeaking,acorpusbyitselfcandonothingatall,beingnothingotherthanastoreofusedlanguage.Corpusaccesssoftware,however,canrearrangethatstoresothatobservationsofvariouskindscanbemade.Ifacorpusrepresents,veryroughlyandpartially,aspeaker’sexperienceoflanguage,theaccesssoftwarere-ordersthatexperiencesothatitcanbereexaminedinwaysthatareusuallyimpossible.Acorpusdoesnotcontainnewinformationaboutlanguage,butthesoftwarepackagesprocessdatafromacorpusinthreeways:
showingfrequency,phraseologyandcollocation.
2.Whatiscorpuslinguistics?
2.1Thedefinitionofcorpuslinguistics
Overthelastthreedecadesthecompilationandanalysisofcorporastoredincomputerizeddatabaseshasledtoanewscholarlyenterpriseknownascorpuslinguistics.Itbringstogethersomeofthefindingsofcorpus-basedstudiesofEnglish,thelanguagewhichhassofarreceivedthemostattentionfromcorpuslinguists,andshowshowquantitativeanalysiscancontributetolinguisticdescription.
2.2Thehistoryofcorpuslinguistics
Theuseofcorpusforlinguisticstudiescandatebacktotheendofthenineteenthcenturywhenonlycardsandmanualretrievalcouldbeusedasameansofresearch.
Aswehaveseen,corpuslinguisticsgoesbeyondtheuseofcorporaasasourceofevidenceinlinguisticdescription.Italsorevivesandcarriesonaconcernofsomelinguistswiththestatisticaldistributionoflinguisticitemsinthecontextofuse.From1920stherewas,especiallyintheUnitedStatesandtheUnitedKingdom,atraditionofwordcountingintextsinordertodiscoverthemostfrequent,andarguablythereforethemostpedagogicallyuseful,wordsandgrammaticalstructuresforlanguageteachingpurposes.
Fromthe1930s,PragueSchoollinguisticsundertookquantitativestudies(MainlyofCzech,EnglishandRussian)ofdifferentpartsofspeech,thelocationanddistributionofinformationinthesentence,andthestatisticaldistributionofsyllabletypesandstructures.DifferentvarietiesofEnglishhavebeenstudied.
Theearliestcomputerizedcorporacompiledforlinguisticresearchfromthe1960srequiredtheuseofmainframecomputers,andresearchersfrequentlyhadtodesigntheirownsoftwareforanalysis.Initialinterestwasofteninlexis,includingwordcounts,butitwasquicklyapparentthatacomputerfacilitatedthestudyofpermissibleorlikelywordsequencesorcollocations(arewemorelikelytowritedifferentfrom,differenttoordifferentthan?
)andgrammaticalandstylisticcharacteristicsofparticularauthorsandgenres.Therewasaparticularinterestinwhatcharacterized‘scientificstyle’,‘newspaperstyle’and‘literaryorimaginativestyle’.TherenownedBritishscholarR.GreenbaumbegantocooperateforthesakeofestablishingacorpusSurveyofEnglishUsage(SEU)in1950sand1960s,firstonpaperandthencomputerizedatthebeginningofthe1980s,whichmarksthetransitionfromthetraditionalcorpustothecomputerizedcorpus.BrownUniversityStandardCorpusofPresent-dayAmericanEnglishCorpus(BROWN)wasestablishedinthe1960sand1970s.London-LundCorpusofSpokenEnglish(LLC)wasaccomplishedinthe1980s,whichwasthefirstcorpusofitskind,includingformalandinformalspeeches,commentaries,dialogues,discussions,interviewsandsoon.Thesethreeclassiccorporalayasolidfoundationforthepresent-daycorpuslinguistics,fortheyarebasedonsystematicallycomprehensive,authenticandreliablecorpora,andeasyforstorageandretrieval.
2.3Thescopeofcorpuslinguistics
Corpuslinguisticsisbasedonbodiesoftextasthedomainofstudyandasthesourceofevidenceforlinguisticdescriptionandargumentation.Italsohascometoembodymethodologiesforlinguisticdescriptioninwhichquantificationofthelinguisticitemsispartoftheresearchactivity.AsLeech(1992:
107)hasnoted,thefocusofstudyisonperformanceratherthanoncompetence,andonobservationoflanguageinuseleadingtotheoryratherthanviceversa.
Corpuslinguistsareconcernedtypicallynotonlywithwhatwords,structuresorusesarepossibleinalanguagebutalsowithwhatisprobable–whatislikelytooccurinlanguageuse.Theuseofcorpusasasourceofevidencehoweverisnotnecessarilyincompatiblewithanylinguistictheory,andprogressinthelanguagesciencesasawholeislikelytobenefitfromajudicioususeofevidencefromvarioussources:
texts,introspection,elicitationorothertypesofexperimentationasappropriate.Anyscientificenterprisemustbeempiricalinthesenseithastobesupportedorfalsifiedonevidenceand,inthefinalanalysis,statementsmadeaboutlanguagehavetostanduptotheevidenceoflanguageuse.Theevidencecanbebasedontheintrospectivejudgmentofspeakersofthelanguageoronacorpusoftext.Thedifferenceliesintherichnessoftheevidenceandtheconfidencewecanhaveinthegeneralizabilityofthatevidence,andinitsvalidityandreliability.
2.4Applicationsofcorpuslinguistics
Corpuslinguisticscanbewidelyexploitedinavarietyofdomains—mostcentrallyinthedesignofsyllabiandmaterialsforlanguageteaching,butalsoindictionarywork,thestudyofideologyandculture,translation,stylistics,forensiclinguistics,andtheprovisionofon-lineassistanceforwritersinwell-definedtechnicaldomains.
3.Typesofcorpusresearchers
Workincorpuslinguisticsiscurrentlyassociatedwithseveralquitedifferentactivities.Scholarsworkinginthefieldtendtobeidentifiedwithoneormoreofthem.
Thefirstgroupofresearchersconsistsofcorpusmakersorcompilers.Thesescholarsareconcernedwiththedesignandcompilationofcorpora,thecollectionoftextsandtheirpreparationandstorageforlateranalysis.
Asecondgroupofresearchershasbeenconcernedwithdevelopingtoolsfortheanalysisofcorpora.Thisisthemaintaskofresearchersincomputationallinguistics.
Athirdgroupofresearchersconsistsofdescriptivelinguistswhosemainconcernhasbeentomakeuseofcomputerizedcorporatodescribereliablythelexiconandgrammaroflanguages,bothofthelinguisticsystemsweuseandourlikelyuseofthosesystems.Itistheprobabilisticaspectofcorpus-baseddescriptivelinguisticstudieswhichespeciallydistinguishesthemfromconventionaldescriptivefieldworkinlinguisticsorlexicography.
Afourthareaofactivity,whichhasbeenamongthemostinnovativeoutcomesofthecorpusrevolution,hasbeentheexploitationofcorpus-basedlinguisticdescriptionforuseinavarietyofapplicationssuchaslanguagelearningandteaching,andnaturallanguageprocessingbymachine,includingspeechrecognitionandtranslation.
4.Theobjectiveofofferingthiscourse
Itismyhopethatthiscoursewillwhettheappetitesofthegrowingbodyofteachersandstudentswithaccesstocorporatodiscovermoreforthemselvesabouthowlanguageworksinalltheirvariety.
Thereisnodoubtthatcorpuslinguisticsisnotanendinitselfbutisonesourceofevidenceforimprovingdescriptionsofthestructureanduseoflanguages,andforvariousapplications,includingtheprocessingofnaturallanguagebymachineandunderstandinghowtolearnandteachalanguage.
Itshouldbemadeclearthatcorpuslinguisticsisnota
mindlessprocessofautomaticlanguagedescription.Linguistsusecorporatoanswerquestionsandsolveproblems.Someofthemostrevealinginsightsonlanguageandlanguageusehavecomefromablendofmanualandcomputeranalysis.Itisnowpossibleforresearcherswithaccesstoapersonalcomputerandoff-shelfsoftware