文献翻译数据类型泛化用于数据挖掘算法.docx
《文献翻译数据类型泛化用于数据挖掘算法.docx》由会员分享,可在线阅读,更多相关《文献翻译数据类型泛化用于数据挖掘算法.docx(18页珍藏版)》请在冰豆网上搜索。
![文献翻译数据类型泛化用于数据挖掘算法.docx](https://file1.bdocx.com/fileroot1/2022-11/28/2485931b-db23-4a73-9a5f-87c25143bc95/2485931b-db23-4a73-9a5f-87c25143bc951.gif)
文献翻译数据类型泛化用于数据挖掘算法
英文翻译
系别
专业
班级
学生姓名
学号
指导教师
DataTypesGeneralizationforDataMiningAlgorithms
Abstract
Withtheincreasingofdatabaseapplications,mininginterestinginformationfromhugedatabasesbecomesofmostconcernandavarietyofminingalgorithmshavebeenproposedinrecentyears.Asweknow,thedataprocessedindataminingmaybeobtainedfrommanysourcesinwhichdifferentdatatypesmaybeused.However,noalgorithmcanbeappliedtoallapplicationsduetothedifficultyforfittingdatatypesofthealgorithm,sotheselectionofanappropriateminingalgorithmisbasedonnotonlythegoalofapplication,butalsothedatafittability.Therefore,totransformthenon-fittingdatatypeintotargetoneisalsoanimportantworkindatamining,buttheworkisoftentediousorcomplexsincealotofdatatypesexistinrealworld.Mergingthesimilardatatypesofagivenselectedminingalgorithmintoageneralizeddatatypeseemstobeagoodapproachtoreducethetransformationcomplexity.Inthiswork,thedatatypesfittabilityproblemforsixkindsofwidelyuseddataminingtechniquesisdiscussedandadatatypegeneralizationprocessincludingmergingandtransformingphasesisproposed.Inthemergingphase,theoriginaldatatypesofdatasourcestobeminedarefirstmergedintothegeneralizedones.Thetransformingphaseisthenusedtoconvertthegeneralizeddatatypesintothetargetonesfortheselectedminingalgorithm.Usingthedatatypegeneralizationprocess,theusercanselectappropriateminingalgorithmjustforthegoalofapplicationwithoutconsideringthedatatypes.
1.Introduction
Inrecentyears,theamountofvariousdatagrowsrapidlyWidelyavailable,low-costcomputertechnologynowmakesitpossibletobothcollecthistoricaldataandalsoinstituteon-lineanalysisfornewlyarrivingdata.AutomateddatagenerationandgatheringleadstotremendousamountsofdatastoredindatabasesAlthoughwearefilledwithdata,butwelackforknowledge.Dataminingistheautomateddiscoveryofnon-trivial,previouslyunknown,andpotentiallyusefulknowledgeembeddedindatabases.Differentkindsofdataminingmethodsandalgorithmshavebeenproposed,eachofwhichhasitsownadvantagesandsuitableapplicationdomains.However,itisdifficultforuserstochooseanappropriateonebythemselves.tochooseanappropriateonebythemselves.Thisisbecausethedataprovidedcannotbedirectlyusedfordataminingalgorithms.Sincemostdataminingalgorithmscanonlybeappliedtosomespecificdatatypes,thetypesofdatastoredindatabasesrestrictsthechoiceofdataminingmethods.Ifcertainkindsofknowledgeneedtobeobtainedusingsomedataminingalgorithms,datatypestransformationshouldbedonefirstandthisiswhatwecalled“thedatatypesfittabilityproblem”fordatamining.Forthetimebeing,thereisnotoolthatcanhelpuserstodothiskindofdatatypestransformation.Inthispaper,wewillsurveyandanalyzethedatatypesfittabilityproblemfordataminingalgorithms,andthenweproposea“datatypesgeneralizationprocess”tosolvethedatatypesfittabilityproblemfortheattributesinrelationaldatabases.
The“datatypesgeneralizationprocess”includingmergingandtransformingphasesisaproceduretotransformthedatatypesofatttributescontainedinrelations(tables).Inthemergingphase,theoriginaldatatypesofdatasourcestobeminedarefirstmergedintothegeneralizedones.Thetransformingphaseisthenusedtoconvertthegeneralizeddatatypesintothetargetonesfortheselectedminingalgorithm.Usingthedatatypegeneralizationprocess,theusercanselectappropriateminingalgorithmjustforthegoalofapplicationwithoutconsideringthedatatypes.
2.Relatedwork
Asmentionedabove,becausemanydataminingalgorithmscanonlybeappliedtothedatatypeswithrestrictedrange,userspossiblyneedtododatatypestransformationbeforetheselectedalgorithmhasbeenexecuted.Inthispaper,weproposeageneralconceptcalled“datatypesgeneralizationprocess“whichprovideaprocedurefordoingthiskindofdatatypestransformation.Datatypesgeneralizationcanbeseenasapre-processingofdatamining.Ofcourse,otherpre-processingsuchasdataselection,datacleaning,dimension(attribute)reduction,missingdatahandlingmayalsoneedtobeperformedbeforerunningtheselecteddataminingalgorithm.Insummary,thewholeprocessofdataminingistheso-calledKDD(knowledgediscoveryindatabases),asshowninFigure1.
Figure1:
TheKDDprocessandtheroleofdatatypesgeneralization.
Thereisamajordifferencebetweenthedatatypesgeneralizationprocessandotherdataminingpre-processes.Otherpre-processes(likemissingvaluehandling)areallindependentoftheselecteddataminingmethod.Thatis,theycanbedonewithoutknowingwhatdataminingalgorithmwillbeused.Butitisclearthatdatatypesgeneralizationprocessdependsonthedesiredminingmethod.Thetargetofdoingdatatransformationusingdatatypesgeneralizationistomakethespecifieddatasetsuitablefortheminingalgorithm.Therefore,ifwewanttoachievethisgoal,wemustsurveyboththedatatypesindatabasesandtheirrelationswithvariousdataminingmethods.TheflowofsolvingadataminingproblemwithdoingdatatransformationisillustratedinFigure2.
Figure2:
Solvingdataminingproblemswithdatatransformationdatatypestransformation
Someresearchersproposedhowtogeneralizethedatacontainedinattributesusing"attribute-orientedinduction"whichallowsthegeneralizationofdata,offerstwomajoradvantagesfortheminingoflargedatabases.First,itallowstherawdatatobehandledathigherconceptuallevels.Generalizationisperformedwiththeuseof"attributeconcepthierarchies",wheretheleavesofagivenattribute'sconcepthierarchycorre-spondtotheattribute'svaluesinthedata(referredtoasprimitiveleveldata).Generalizationofthetrainingdataisachievedbyreplacingprimitiveleveldatabyhigherlevelconcepts.
Infact,datageneralizationusingattributeconcepthierarchiesisakindofdatatypetransformationwhichreducesthenumberofdistinctvaluescontainedinattributes.Wefirstprovideatypicaldescriptionofthedatatypesfittabilityproblemandadatatypesgeneralizationprocesstodefineandsolvethedatatypestransformationproblemforattributes.Hence,datageneralizationusingconcepthierarchiesisincludedintheprocessforperformingspecifieddatatypestransformation.
Anotherrelatedworkisthatsomeresearcherssurveyedabouthowtotransformdataintonumericalvalues.Almostalldata-drivenalgorithmsutilizenumericinputs.Fromacomputerprocessingpointofview,handlingcomputationswithnumbersiseasierandmoreefficient.Therefore,iftheinputvaluesarenon-numeric(e.g.,textstrings),theyshouldbeintelligentlyconvertedtomeaningfulnumericalvaluesinmanycases.Numericalvaluescanbeseenasadatatypeandtransformingdataintonumericalvaluesisakindofdatatypestransformation.Thestrategiesareincludedinthedatatypesgeneralizationprocessforperformingdatatypestransformation.
3.Analysisofthedatatypesfittabilityproblem
Inrecentyears,duetotheexplosionofinformationandtherapidgrowthofdatabaseapplications,dataminingtechniquesbecomemoreandmoreimportant.Forthisreason,differentkindsofdataminingmethodsoralgorithmshavebeenproposed.However,itisdifficultforuserstochooseasuitableonebythemselveswithoutpriorknowledgeaboutdatamining.Actually,thekindofdataminingmethodsshouldbeapplieddependsonboththecharacteristicofthedatatobeminedandthekindofknowledgetobefoundthroughthedataminingprocess.Hence,thetypesofdatastoredindatabasesplayanimportantroleduringthedataminingprocessandrestrictthedataminingmethodscanbechosenbyusers.Itistruethatallkindsofdataminingmethodscanonlybeappliedtoparticulardatabasessuitableforeachkindandthisiswhatwecalled"thedatatypesfittabilityproblem"fordatamining.Tosolvethisproblem,weneedtoinvestigatetherelationshipsbetweenthecharacteristicsofthedatatobeminedandvariouskindsofdataminingtechniques.Withtherelation-ships,wecanclearlyanalyzethedatatypesfittabilityproblemandfurtherknowwhetherthedatatypestransformationcanbeperformedornot.Hence,analyzingthiskindofrelationshipsisapreparationworkforourdatatypesgeneralizationprocess,whichexplainswhythedatatypesgeneralizationprocesscansolvethedatafittabilityproblem.Wenowillustratetheanalysisasfollows.
3.1Fourkindsofdataformsfordatamining
Dataminingtechniquesususallycanbeappliedtofourkindsofdataforms:
texual,temporal,transactionalandrelationalforms.Differentkindsofdataformsareusedtostoredifferentkindsofdatatypes.Wedescribeeachkindofdataformsinthefollowing:
(1)Textualdataforms:
Textualdataformsareusedtorepresenttextsordocuments.Basically,thiskindofdataformscanbeseenasasetofcharacterswithhugeamount.
(2)Temporaldataforms:
Time-seriesdataisstoredintemporaldataforms.Datathatvarieswithtime(suchashistoricaldata)canbestoredintheformofnumericaltime-series.
(3)Transactionaldataforms:
Forexample,thepasttransactionsofamarketcanbestoredintransactionaldataforms.Eachtransactionrecordsalistofitemsboughtinthattransaction.
(4)Relationaldataforms:
Relationaldataformsarethemostwidelyuseddataformsandcanstorediffierentkindsofdata.Thebasicunitsofrelationaldataformsarerelations(