1、REVIEWAdvances in naturallanguage processingJulia Hirschberg1*and Christopher D.Manning2,3Natural language processing employs computational techniques for the purpose of learning,understanding,and producing human language content.Early computational approaches tolanguage research focused on automati
2、ng the analysis of the linguistic structure of languageanddevelopingbasictechnologiessuchasmachinetranslation,speechrecognition,andspeechsynthesis.Todays researchers refine and make use of such tools in real-world applications,creating spoken dialogue systems and speech-to-speech translation engines
3、,mining socialmedia for information about health or finance,and identifying sentiment and emotion towardproducts and services.We describe successes and challenges in this rapidly advancing area.Over the past 20 years,computational lin-guistics has grown into both an excitingarea of scientific resear
4、ch and a practicaltechnology that is increasingly being in-corporated into consumer products(forexample,in applications such as Apples Siri andSkypeTranslator).Fourkeyfactorsenabledthesedevelopments:(i)a vast increase in computingpower,(ii)the availability of very large amountsof linguistic data,(ii
5、i)the development of highlysuccessful machine learning(ML)methods,and(iv)amuchricherunderstandingofthestructureof human language and its deployment in socialcontexts.In this Review,we describe some cur-rent application areas of interest in languageresearch.These efforts illustrate computationalappro
6、achestobigdata,basedoncurrentcutting-edge methodologies that combine statistical anal-ysis and ML with knowledge of language.Computationallinguistics,alsoknownasnat-ural language processing(NLP),is the subfieldof computer science concerned with using com-putational techniques tolearn,understand,andp
7、roducehumanlanguagecontent.Computation-al linguistic systems can have multiple purposes:The goal can be aiding human-human commu-nication,such as in machine translation(MT);aiding human-machine communication,such aswith conversational agents;or benefiting bothhumans and machines by analyzing and lea
8、rn-ing from the enormous quantity of human lan-guage content that is now available online.During the first several decades of work incomputational linguistics,scientists attemptedto write down for computers the vocabulariesand rules of human languages.This proved adifficult task,owing to the variabi
9、lity,ambiguity,and context-dependent interpretation of humanlanguages.For instance,a star can be either anastronomical object or a person,and“star”canbe a noun or a verb.In another example,two in-terpretationsarepossiblefortheheadline“Teacherstrikesidlekids,”dependingonthenoun,verb,andadjectiveassig
10、nmentsofthewordsinthesentence,aswellasgrammaticalstructure.Beginning in the1980s,but more widely in the 1990s,NLP wastransformedbyresearchersstartingtobuildmod-els over large quantities of empirical languagedata.Statisticalorcorpus(“bodyofwords”)basedNLP was one of the first notable successes ofthe
11、use of big data,long before the power ofML was more generally recognized or the term“big data”even introduced.A central finding of this statistical approach toNLP has been that simple methods using words,part-of-speech(POS)sequences(suchaswhethera wordis a noun,verb,orpreposition),or simpletemplates
12、 can often achieve notable results whentrained on large quantities of data.Many textand sentiment classifiers are still based solely onthe different sets of words(“bag of words”)thatdocuments contain,without regard to sentenceand discourse structure or meaning.Achievingimprovementsoverthesesimplebas
13、elinescanbequite difficult.Nevertheless,the best-performingsystems now use sophisticated ML approachesand a rich understanding of linguistic structure.High-performance tools that identify syntacticand semanticinformationaswellas informationabout discourse context are now available.Oneexampleis Stanf
14、ordCoreNLP(1),whichprovidesa standard NLP preprocessing pipeline that in-cludes POS tagging(with tags suchasnoun,verb,andpreposition);identification of named entities,such as people,places,and organizations;parsingof sentences into their grammatical structures;and identifying co-references between n
15、ounphrase mentions(Fig.1).Historically,two developments enabled theinitialtransformation ofNLP intoa bigdata field.The first was the early availability to researchersof linguistic data in digital form,particularlythrough the Linguistic Data Consortium(LDC)(2),established in 1992.Today,large amountso
16、f digital text can easily be downloaded fromthe Web.Available as linguistically annotateddata are large speech and text corpora anno-tated with POS tags,syntactic parses,semanticlabels,annotations of named entities(persons,places,organizations),dialogue acts(statement,question,request),emotions and
17、positive or neg-ative sentiment,and discourse structure(topicor rhetorical structure).Second,performance im-provements in NLP were spurred on by sharedtaskcompetitions.Originally,thesecompetitionswere largely funded and organized by the U.S.Department of Defense,but they were later or-ganized by the
18、 research community itself,suchas the CoNLL Shared Tasks(3).These tasks werea precursor of modern ML predictive modelingand analytics competitions,such as on Kaggle(4),in which companies and researchers post theirdataandstatisticiansanddataminersfromalloverthe world compete to produce the best model
19、s.AmajorlimitationofNLPtodayisthefactthatmost NLP resources and systems are availableonly forhigh-resource languages(HRLs),such asEnglish,French,Spanish,German,and Chinese.Incontrast,manylow-resourcelanguages(LRLs)such as Bengali,Indonesian,Punjabi,Cebuano,and Swahilispoken and written by millions o
20、fpeople have no such resources or systems avail-able.Afuturechallengeforthelanguagecommu-nity is how to develop resources and tools forhundredsorthousandsoflanguages,notjustafew.Machine translationProficiency in languages was traditionally a hall-mark of a learned person.Although the socialstanding
21、of this human skill has declined in themodernageofscienceandmachines,translationbetween human languages remains crucially im-portant,and MT is perhaps the most substantialwayinwhichcomputerscouldaidhuman-humancommunication.Moreover,the ability of com-puters to translate between human languagesremain
22、s a consummate test of machine intel-ligence:Correct translation requires not onlytheability toanalyze and generate sentences inhuman languages but also a humanlike under-standing of world knowledge and context,de-spite the ambiguities of languages.For example,theFrenchword“bordel”straightforwardlym
23、eans“brothel”;but if someone says“My room is unbordel,”thenatranslatingmachinehastoknowenoughtosuspectthatthispersonisprobablynotrunninga brothel inhisorherroom butratherissaying“My room is a complete mess.”Machine translation was one of the first non-numericapplicationsofcomputersandwasstudiedinten
24、sively starting in the late 1950s.However,thehand-built grammar-based systems of early dec-ades achieved very limited success.The field wastransformedin the early 1990s when researchersat IBM acquired a large quantity of English andFrench sentences that were translations of eachother(known as parall
25、el text),produced as theproceedingsofthebilingualCanadianParliament.These data allowed them to collect statistics ofword translations and word sequences and tobuild a probabilistic model of MT(5).Following a quiet period in the late 1990s,the new millennium brought the potent combina-tion of ample o
26、nline text,including considerablequantities of parallel text,much more abundantand inexpensive computing,and a new ideafor building statistical phrase-based MT systemsSCIENCEsciencemag.org17 JULY 2015 VOL 349 ISSUE 62452611Department of Computer Science,Columbia University,New York,NY 10027,USA.2Dep
27、artment of Linguistics,Stanford University,Stanford,CA 94305-2150,USA.3Department of ComputerScience,Stanford University,Stanford,CA 94305-9020,USA.*Corresponding author.E-mail:juliacs.columbia.edu on July 16,2015www.sciencemag.orgDownloaded from on July 16,2015www.sciencemag.orgDownloaded from on J
28、uly 16,2015www.sciencemag.orgDownloaded from on July 16,2015www.sciencemag.orgDownloaded from on July 16,2015www.sciencemag.orgDownloaded from on July 16,2015www.sciencemag.orgDownloaded from(6).Rather than translating word by word,thekey advance is to notice that small word groupsoften have distinc
29、tive translations.The Japa-nese“mizu iro”is literally the sequenceof two words(“water color”),but this is not thecorrect meaning(nor does it mean a type ofpainting);rather,it indicates a light,sky-blue color.Such phrase-based MT was used by Franz Och inthe development of Google Translate.This techno
30、logy enabled the services we havetoday,which allow free and instant translationbetween many language pairs,but it still pro-duces translations that are only just serviceablefor determining the gist of a passage.However,very promising work continues to push MT for-ward.Much subsequent research has ai
31、med tobetter exploit the structure of human languagesentences(i.e.,their syntax)in translation sys-tems(7,8),and researchers are actively buildingdeeper meaning representations of language(9)to enable a new level of semantic MT.Finally,just in the past year,we have seen thedevelopmentofanextremelypr
32、omisingapproachto MT through the use of deep-learningbasedsequence models.The central idea of deep learn-ing is that if we can train a model with severalrepresentational levels to optimize a final objec-tive,such as translation quality,then the modelcan itself learn intermediate representationsthat
33、are useful for the task at hand.This ideahas been explored particularly for neural net-work models in which information is stored inreal-valued vectors,with the mapping betweenvectors consisting of a matrix multiplication fol-lowed by a nonlinearity,such as a sigmoid func-tion that maps the output values of the matrixmultiplicationonto1,1.Buildinglargemodelsof this form is much more practical with
copyright@ 2008-2022 冰豆网网站版权所有
经营许可证编号:鄂ICP备2022015515号-1