Springer.docx

上传人:b****5 文档编号:11537309 上传时间:2023-03-19 格式:DOCX 页数:5 大小:19.70KB
下载 相关 举报
Springer.docx_第1页
第1页 / 共5页
Springer.docx_第2页
第2页 / 共5页
Springer.docx_第3页
第3页 / 共5页
Springer.docx_第4页
第4页 / 共5页
Springer.docx_第5页
第5页 / 共5页
亲,该文档总共5页,全部预览完了,如果喜欢就下载吧!
下载资源
资源描述

Springer.docx

《Springer.docx》由会员分享,可在线阅读,更多相关《Springer.docx(5页珍藏版)》请在冰豆网上搜索。

Springer.docx

Springer

Springer

NearestNeighbourSearchForXMLTreesLszlKovcs,TiborRpsi,ErikaBaksa-Vargakovacs@iit.uni-miskolc.hu,repasi@iit.uni-miskolc.hu,iitev@uni-miskolc.huAbstract:

Thereisasignificantresearcheffortonefficientcomputingofsimilaritiesbetweenobjectsofnontraditionaldatatypesasstrings,documents,soundtracksorpictures.ItisreasonabletousetheresultsoftheseeffortsintheproblemofXMLtreematching,too.AsanXMLdocumenthasatreestructureandthetreescanbetransformedintoalinearstructure,atreecanberegardedasaspecialkindofstring.Inourapproach,theresultsofstringcomparisonwillbeextendedtomeasurethesimilaritybetweenXMLtrees.Keywords:

Treeeditdistance,Nearestneighboursearching1IntroductiontoXMLandXPathTheExtensibleMarkupLanguage(XML)isasimple,veryflexibletextformatderivedfromSGML(StandardGeneralizedMarkupLanguage)[1].XMLprovidesawayofdescribingdatainarich,flexibleandefficientwaybymarkingupdatawithdescriptivetags.However,XMLdoesnotprovideawaytolocatespecificpiecesofstructureddatawithinagivendocument.Instead,theXMLPathLanguage(XPath)providesasyntaxforlocatingspecificpartsofanXMLdocumenteffectivelyandefficiently.ItgetsitsnamefromitsuseofapathnotationasinURLsfornavigatingthroughthehierarchicalstructureofanXMLdocument.InXPath,anXMLdocumentisviewedconceptuallyasatreeinwhicheachpartofthedocumentisrepresentedasanode[2].ThestandardXPathmodelassumesrandomaccesstoXMLdataandalsotheabilitytoqueryXMLdataatanytime.XPathisnotastructurallanguagelikeXML;rather,itisastring-basedlanguageofexpressions.Anexpressionisevaluatedtoyieldanobject,whichhasoneofthefollowingfourbasictypes:

node-set(anunorderedcollectionofnodeswithoutduplicates);boolean;numberorstring.XPathexpressionsoftenoccurinXMLattributes.Expressionevaluationoccurswithrespecttoacontext.Thecontextconsistsof:

anode(thecontextnode),apairofnon-zeropositiveintegers(thecontextpositionandthecontextsize),asetofvariablebindings,afunctionlibraryandthesetofnamespacedeclarationsinscopefortheexpression[3].Animportantkindofexpressionisalocationpath.Therearetwokindsoflocationpaths:

relativelocationpathsandabsolutelocationpaths.

(1)Arelativelocationpathconsistsofasequenceofoneormorelocationstepsseparatedby/,andeachstepinturnselectsasetofnodesrelativetothecontextnode.Alocationstephasthreeparts:

anaxis,whichspecifiesthetreerelationshipbetweenthenodesselectedbythelocationstepandthecontextnode;anodetest,whichspecifiesthenodetypeandexpanded-nameofthenodesselectedbythelocationstep;andzeroormorepredicates,whichusearbitraryexpressionstofurtherrefinethesetofnodesselectedbythelocationstep.Thenode-setselectedbythelocationstepisthenode-setthatresultsfromgeneratinganinitialnode-setfromtheaxisandnode-test,andthenfilteringthatnode-setbyeachofthepredicatesinturn.

(2)Anabsolutelocationpathconsistsof/optionallyfollowedbyarelativelocationpath.A/byitselfselectstherootnodeofthedocumentcontainingthecontextnode.Ifitisfollowedbyarelativelocationpath,thenthelocationpathselectsthesetofnodesthatwouldbeselectedbytherelativelocationpathrelativetotherootnodeofthedocumentcontainingthecontextnode[3].Thusalocationpathselectsasetofnodes.Theresultofevaluatinganexpressionthatisalocationpathisthenode-setcontainingthenodesselectedbythelocationpath.Theinitialnode-setconsistsofthenodeshavingtherelationshiptothecontextnodespecifiedbytheaxis,andhavingthenodetypeandexpanded-namespecifiedbythenodetest.Theinitialnode-setisfilteredbythefirstpredicatetogenerateanewnode-set;thisnewnode-setisthenfilteredusingthesecondpredicate,andsoon.Thefinalnode-setisthenode-setselectedbythelocationstep.Theaxisaffectshowtheexpressionineachpredicateisevaluatedandsothesemanticsofapredicateisdefinedwithrespecttoanaxis.Everyaxishasaprincipalnodetype.Fortheattributeaxis,theprincipalnodetypeisattribute;forthenamespaceaxis,theprincipalnodetypeisnamespace;forotheraxes,theprincipalnodetypeiselement[3].ThedisadvantageofXPathisthatitisbasedonpartialmatching.Itmeansthatsomepartsofthetreemustmatchthegivenpatternexactly,whileothersaretotallyignored.Theresultedmeasuredoesnottakeintoaccountthedeviationfromthepattern.However,thisdeviationvaluewouldcarryusefulinformationforusers.Ouraimistofindanefficientalgorithmthatmeasuresthedifferencebetweenthepatternandtheselectedsubtrees.ThereforeallowingsearchingfornearestneighbourapproximationsofthepatternXMLdocument.2ThedistancebetweentwoXMLtreesTheproblemofobjectmatchingisaninterdisciplinarareaasitcollectsmethodsfromthreecurrentdisciplines:

computerscience,statisticsandoperationsresearch.Thereisasignificantresearcheffortonefficientcomputingofsimilaritiesbetweenobjectsofnontraditionaldatatypesasstrings,documents,soundtracksorpictures.ItisreasonabletousetheresultsoftheseeffortsintheproblemofXMLtreematching,too.AsanXMLdocumenthasatreestructureandthetreescanbetransformedintoalinearstructure,atreecanberegardedasaspecialkindofstring.Inourapproach,theresultsofstringcomparisonwillbeextendedtomeasurethesimilaritybetweenXMLtrees.ThestringsandXMLtreescanbeconsideredasobjectslocatedinmetricspace.AspaceXismetricifthereisarealnon-negativefunctionoftwoobjectsd(A,B)defined.Thefunctionisknownasthedistancebetweenthetwopoints.Itischaracterizedbythefollowingproperties.ForAX,BX,andCX1.d(A,B)=0ifandonlyifA=B(thedistanceis0ifandonlyifthepointscoincide);2.d(A,B)=d(B,A)(thedistancefromAtoBisthesameasthedistancefromBtoA);3.d(A,B)+d(B,C)d(A,C)(thesumoftwosidesofatriangleisneverlessthanthethirdside).ThebestknownstringmatchingalgorithmsincomputersciencearetheRusselSoundexcodingandtheLevenshteineditdistancemethods.TheSoundexmethodisusedinrecordlinkagemethodsmainlyforblockingthemasterrecordset.TheLevenshteineditdistanceisdefinedasthesmallestnumberofinsertions,deletions,andsubstitutionsrequiredtochangeonestringortreeintoanother.TheefficiencyofthealgorithmtocomputethedistancebetweenstringsisO(mn),wheremandnarethelengthsofthestrings.TheLevenshteindistanceisdefinedforstringsofarbitrarylength.Itcountsthedifferencesbetweentwostrings,wherewewouldcountadifferencenotonlywhenstringshavedifferentcharactersbutalsowhenonehasacharacterwhereastheotherdoesnot.Ingeneral,thetransformationofstringscanbeperformedbetweenstringsofdifferentcharactersets.Thus,theinputparameterforstringdistanceproblemisatriple(A,B,c),whereAisthecharactersetoftheinputstring,Bisthecharactersetoftheoutputstring,andcisthecostfunction.Thetransformationmayusuallyconsistsofthefollowingelementaryoperations:

insertion,deletionandsubstitution.Thereareseveralvariantsforelementarycostfunctions,amongwhichthesimplestoneistheunit-costfunction.Inthiscasetheinsertion,thedeletionofasinglecharacterhasthevalueof1,andthesubstitutionoftwodifferentcharactersisequalto1,too.Forastrings,lets(i)standforitsithcharacter.Fortwocharactersaandb,definec(0,a)=1(insertion)c(a,0)=1(deletion)c(a,b)=0ifa=b;1,otherwise(substitution)Assumewearegiventwostringssandtoflengthnandm,respectively.Thelengthvalueisdenotedbyanupperindex:

sn,tm..Thedistance,i.e.theminimalcostoftransformationfromstotisusuallycalculatedwithusingadynamicprogrammingalgorithm.Thedistancevalueisdefinedwitharecursiveformula:

d(sn,tm)=min{c(sn,tm)+d(sn-1,tm-1)c(sn,0)+d(sn-1,tm)c(0,tm)+d(sn,tm-1)wheresidenotesthecharateratthepositioniinstrings.Basically,editdistancecanbecomputedinO(nm)orO(nm/log(n))timeifthecostweightsarerational[7].Inalotofapplications,amoreprecisedistancevaluecanbeachievedwiththenormalizededitdistancemeasure.Thismeasuretakesnotonlythenumberofelementarytransformationsintoaccount,butalsothelengthoftheinputandoutputstrings.Thenormalizededitdistanceistheratioofthetotalweighttothelengthofthesequence.ThecomplexityofthenormalizededitdistanceisO(nmlog(n))[7].Thedistancevaluefortreescanbedefinedsimilarlytothedistanceofstrings.Animportantassumptioninourinvestigationisthatthelabelsofthetreesareordered.ThisassumptioncorrespondstotheXMLstandard.Usually,likein[4],thefollowingelementaryerationsaredefinedfortreeobjects:

-relabel:

assignsanewnodenametotherootofthetree-insert:

insertinganewnodeintothechildrenoftherootnode-delete:

deletinganodefromthechildrenoftherootnode-inserttree:

insertingatreeundertherootnode-deletingtree:

deletingatreefromthechildrenofthenodeInourapproach,thetreeinsertionortreedeletionoperatorsarereplacedwithalistofsinglenodeoperations.Accordingto[7],thetreedistancevaluecanbecalculatedusingthefollowingrecursiveformula:

d(0,0)=0d(F,0)=d(F-v,0)+c(v,0)d(0,F)=d(0,F-v)+c(0,v)d(F1,F2)=min{d(F1-v,F2)+c(T(v),0)d(F1,F2-v)+c(0,T(v))d(F1-T

展开阅读全文
相关资源
猜你喜欢
相关搜索

当前位置:首页 > 医药卫生 > 基础医学

copyright@ 2008-2022 冰豆网网站版权所有

经营许可证编号:鄂ICP备2022015515号-1