[PDF]Advancesinnaturallanguageprocessing.pdf
《[PDF]Advancesinnaturallanguageprocessing.pdf》由会员分享,可在线阅读,更多相关《[PDF]Advancesinnaturallanguageprocessing.pdf(7页珍藏版)》请在冰豆网上搜索。
REVIEWAdvancesinnaturallanguageprocessingJuliaHirschberg1*andChristopherD.Manning2,3Naturallanguageprocessingemployscomputationaltechniquesforthepurposeoflearning,understanding,andproducinghumanlanguagecontent.Earlycomputationalapproachestolanguageresearchfocusedonautomatingtheanalysisofthelinguisticstructureoflanguageanddevelopingbasictechnologiessuchasmachinetranslation,speechrecognition,andspeechsynthesis.Todaysresearchersrefineandmakeuseofsuchtoolsinreal-worldapplications,creatingspokendialoguesystemsandspeech-to-speechtranslationengines,miningsocialmediaforinformationabouthealthorfinance,andidentifyingsentimentandemotiontowardproductsandservices.Wedescribesuccessesandchallengesinthisrapidlyadvancingarea.Overthepast20years,computationallin-guisticshasgrownintobothanexcitingareaofscientificresearchandapracticaltechnologythatisincreasinglybeingin-corporatedintoconsumerproducts(forexample,inapplicationssuchasApplesSiriandSkypeTranslator).Fourkeyfactorsenabledthesedevelopments:
(i)avastincreaseincomputingpower,(ii)theavailabilityofverylargeamountsoflinguisticdata,(iii)thedevelopmentofhighlysuccessfulmachinelearning(ML)methods,and(iv)amuchricherunderstandingofthestructureofhumanlanguageanditsdeploymentinsocialcontexts.InthisReview,wedescribesomecur-rentapplicationareasofinterestinlanguageresearch.Theseeffortsillustratecomputationalapproachestobigdata,basedoncurrentcutting-edgemethodologiesthatcombinestatisticalanal-ysisandMLwithknowledgeoflanguage.Computationallinguistics,alsoknownasnat-urallanguageprocessing(NLP),isthesubfieldofcomputerscienceconcernedwithusingcom-putationaltechniquestolearn,understand,andproducehumanlanguagecontent.Computation-allinguisticsystemscanhavemultiplepurposes:
Thegoalcanbeaidinghuman-humancommu-nication,suchasinmachinetranslation(MT);aidinghuman-machinecommunication,suchaswithconversationalagents;orbenefitingbothhumansandmachinesbyanalyzingandlearn-ingfromtheenormousquantityofhumanlan-guagecontentthatisnowavailableonline.Duringthefirstseveraldecadesofworkincomputationallinguistics,scientistsattemptedtowritedownforcomputersthevocabulariesandrulesofhumanlanguages.Thisprovedadifficulttask,owingtothevariability,ambiguity,andcontext-dependentinterpretationofhumanlanguages.Forinstance,astarcanbeeitheranastronomicalobjectoraperson,and“star”canbeanounoraverb.Inanotherexample,twoin-terpretationsarepossiblefortheheadline“Teacherstrikesidlekids,”dependingonthenoun,verb,andadjectiveassignmentsofthewordsinthesentence,aswellasgrammaticalstructure.Beginninginthe1980s,butmorewidelyinthe1990s,NLPwastransformedbyresearchersstartingtobuildmod-elsoverlargequantitiesofempiricallanguagedata.Statisticalorcorpus(“bodyofwords”)basedNLPwasoneofthefirstnotablesuccessesoftheuseofbigdata,longbeforethepowerofMLwasmoregenerallyrecognizedortheterm“bigdata”evenintroduced.AcentralfindingofthisstatisticalapproachtoNLPhasbeenthatsimplemethodsusingwords,part-of-speech(POS)sequences(suchaswhetherawordisanoun,verb,orpreposition),orsimpletemplatescanoftenachievenotableresultswhentrainedonlargequantitiesofdata.Manytextandsentimentclassifiersarestillbasedsolelyonthedifferentsetsofwords(“bagofwords”)thatdocumentscontain,withoutregardtosentenceanddiscoursestructureormeaning.Achievingimprovementsoverthesesimplebaselinescanbequitedifficult.Nevertheless,thebest-performingsystemsnowusesophisticatedMLapproachesandarichunderstandingoflinguisticstructure.High-performancetoolsthatidentifysyntacticandsemanticinformationaswellasinformationaboutdiscoursecontextarenowavailable.OneexampleisStanfordCoreNLP
(1),whichprovidesastandardNLPpreprocessingpipelinethatin-cludesPOStagging(withtagssuchasnoun,verb,andpreposition);identificationofnamedentities,suchaspeople,places,andorganizations;parsingofsentencesintotheirgrammaticalstructures;andidentifyingco-referencesbetweennounphrasementions(Fig.1).Historically,twodevelopmentsenabledtheinitialtransformationofNLPintoabigdatafield.Thefirstwastheearlyavailabilitytoresearchersoflinguisticdataindigitalform,particularlythroughtheLinguisticDataConsortium(LDC)
(2),establishedin1992.Today,largeamountsofdigitaltextcaneasilybedownloadedfromtheWeb.Availableaslinguisticallyannotateddataarelargespeechandtextcorporaanno-tatedwithPOStags,syntacticparses,semanticlabels,annotationsofnamedentities(persons,places,organizations),dialogueacts(statement,question,request),emotionsandpositiveorneg-ativesentiment,anddiscoursestructure(topicorrhetoricalstructure).Second,performanceim-provementsinNLPwerespurredonbysharedtaskcompetitions.Originally,thesecompetitionswerelargelyfundedandorganizedbytheU.S.DepartmentofDefense,buttheywerelateror-ganizedbytheresearchcommunityitself,suchastheCoNLLSharedTasks(3).ThesetaskswereaprecursorofmodernMLpredictivemodelingandanalyticscompetitions,suchasonKaggle(4),inwhichcompaniesandresearchersposttheirdataandstatisticiansanddataminersfromallovertheworldcompetetoproducethebestmodels.AmajorlimitationofNLPtodayisthefactthatmostNLPresourcesandsystemsareavailableonlyforhigh-resourcelanguages(HRLs),suchasEnglish,French,Spanish,German,andChinese.Incontrast,manylow-resourcelanguages(LRLs)suchasBengali,Indonesian,Punjabi,Cebuano,andSwahilispokenandwrittenbymillionsofpeoplehavenosuchresourcesorsystemsavail-able.Afuturechallengeforthelanguagecommu-nityishowtodevelopresourcesandtoolsforhundredsorthousandsoflanguages,notjustafew.MachinetranslationProficiencyinlanguageswastraditionallyahall-markofalearnedperson.Althoughthesocialstandingofthishumanskillhasdeclinedinthemodernageofscienceandmachines,translationbetweenhumanlanguagesremainscruciallyim-portant,andMTisperhapsthemostsubstantialwayinwhichcomputerscouldaidhuman-humancommunication.Moreover,theabilityofcom-puterstotranslatebetweenhumanlanguagesremainsaconsummatetestofmachineintel-ligence:
Correcttranslationrequiresnotonlytheabilitytoanalyzeandgeneratesentencesinhumanlanguagesbutalsoahumanlikeunder-standingofworldknowledgeandcontext,de-spitetheambiguitiesoflanguages.Forexample,theFrenchword“bordel”straightforwardlymeans“brothel”;butifsomeonesays“Myroomisunbordel,”thenatranslatingmachinehastoknowenoughtosuspectthatthispersonisprobablynotrunningabrothelinhisorherroombutratherissaying“Myroomisacompletemess.”Machinetranslationwasoneofthefirstnon-numericapplicationsofcomputersandwasstudiedintensivelystartinginthelate1950s.However,thehand-builtgrammar-basedsystemsofearlydec-adesachievedverylimitedsuccess.Thefieldwastransformedintheearly1990swhenresearchersatIBMacquiredalargequantityofEnglishandFrenchsentencesthatweretranslationsofeachother(knownasparalleltext),producedastheproceedingsofthebilingualCanadianParliament.ThesedataallowedthemtocollectstatisticsofwordtranslationsandwordsequencesandtobuildaprobabilisticmodelofMT(5).Followingaquietperiodinthelate1990s,thenewmillenniumbroughtthepotentcombina-tionofampleonlinetext,includingconsiderablequantitiesofparalleltext,muchmoreabundantandinexpensivecomputing,andanewideaforbuildingstatisticalphrase-basedMTsystemsSCIENCEsciencemag.org17JULY2015VOL349ISSUE62452611DepartmentofComputerScience,ColumbiaUniversity,NewYork,NY10027,USA.2DepartmentofLinguistics,StanfordUniversity,Stanford,CA94305-2150,USA.3DepartmentofComputerScience,StanfordUniversity,Stanford,CA94305-9020,USA.*Correspondingauthor.E-mail:
juliacs.columbia.eduonJuly16,2015www.sciencemag.orgDownloadedfromonJuly16,2015www.sciencemag.orgDownloadedfromonJuly16,2015www.sciencemag.orgDownloadedfromonJuly16,2015www.sciencemag.orgDownloadedfromonJuly16,2015www.sciencemag.orgDownloadedfromonJuly16,2015www.sciencemag.orgDownloadedfrom(6).Ratherthantranslatingwordbyword,thekeyadvanceistonoticethatsmallwordgroupsoftenhavedistinctivetranslations.TheJapa-nese“mizuiro”isliterallythesequenceoftwowords(“watercolor”),butthisisnotthecorrectmeaning(nordoesitmeanatypeofpainting);rather,itindicatesalight,sky-bluecolor.Suchphrase-basedMTwasusedbyFranzOchinthedevelopmentofGoogleTranslate.Thistechnologyenabledtheserviceswehavetoday,whichallowfreeandinstanttranslationbetweenmanylanguagepairs,butitstillpro-ducestranslationsthatareonlyjustserviceablefordeterminingthegistofapassage.However,verypromisingworkcontinuestopushMTfor-ward.Muchsubsequentresearchhasaimedtobetterexploitthestructureofhumanlanguagesentences(i.e.,theirsyntax)intranslationsys-tems(7,8),andresearchersareactivelybuildingdeepermeaningrepresentationsoflanguage(9)toenableanewlevelofsemanticMT.Finally,justinthepastyear,wehaveseenthedevelopmentofanextremelypromisingapproachtoMTthroughtheuseofdeep-learningbasedsequencemodels.Thecentralideaofdeeplearn-ingisthatifwecantrainamodelwithseveralrepresentationallevelstooptimizeafinalobjec-tive,suchastranslationquality,thenthemodelcanitselflearnintermediaterepresentationsthatareusefulforthetaskathand.Thisideahasbeenexploredparticularlyforneuralnet-workmodelsinwhichinformationisstoredinreal-valuedvectors,withthemappingbetweenvectorsconsistingofamatrixmultiplicationfol-lowedbyanonlinearity,suchasasigmoidfunc-tionthatmapstheoutputvaluesofthematrixmultiplicationonto1,1.Buildinglargemodelsofthisformismuchmorepracticalwith