大数据挖掘外文翻译文献Word下载.docx

资源描述

大数据挖掘外文翻译文献Word下载.docx

《大数据挖掘外文翻译文献Word下载.docx》由会员分享，可在线阅读，更多相关《大数据挖掘外文翻译文献Word下载.docx（19页珍藏版）》请在冰豆网上搜索。

大数据挖掘外文翻译文献Word下载.docx

VHShａstri,VSｒｅepｒaｄa

文献出处：

《IｎternatioｎalJouｒｎaloｆＥｍergingTrｅnｄsandTｅchｎoｌogyinCoｍpｕtｅｒSｃience》,2０16,3８

（2）:

99-１03

字数统计：

英文２29１单词，１2196字符;

中文3868汉字

外文文献：

A　SｔudyｏｆDataＭｉniｎｇwithBiｇ　Ｄata

AbsｔｒａctＤatahaｓbecomeaｎimｐoｒｔantｐａrtof　every　eｃonomy,indusｔｒy,organｉｚatioｎ，bｕsｉｎess,fｕnction　ａnd　indｉvidual.BｉgＤaｔaｉs　aterｍ　usedto　ｉdentｉfｙlaｒgeｄatａsetsｔypically　whosesｉzeislargeｒ　ｔhanthe　tyｐicaldaｔabａse.Bｉgdaｔａintroducesuniｑue　comｐutatiｏnal　ａndｓtatistiｃaｌchalｌenges．Big　Dataareatpresentexｐandinginｍostｏfthedomａins　ｏfengineｅｒingａndscience.Data　miniｎghelpｓtoextｒactusefuldatａfromthe　hugｅ　ｄataｓets　dueｔｏitsvoｌumｅ，　ｖariaｂiｌityａndvelｏcitｙ.ＴhｉｓａrticleｐｒesenｔsaHAＣEtheoｒｅｍthaｔcharacteriｚes　thｅ　featureｓoftｈeBig　Ｄaｔarevolutｉon，aｎｄproposes　aＢiｇDａｔapｒoceｓsingmodel,　froｍｔｈeｄａtaminｉngｐerｓｐective.

Keywords:

　BiｇData,DataMinｉnｇ,ＨＡCEthｅｏrem,strucｔureｄanduｎstrｕctｕｒed.

Ｉ.Introｄｕctｉon

ＢｉgDataｒeｆerｓtoenｏrｍｏｕsamoｕｎｔofｓtructureｄｄata　and　unsｔｒuctureddａtathatｏverfｌowtheoｒganizａtion.Iｆ　tｈis　daｔa　iｓpropeｒｌyuｓed,it　ｃａn　leaｄtomeaｎinｇfuｌiｎfｏrmaｔion.Biｇdataｉncluｄeｓ　alargｅｎumbeｒofdａtａwｈiｃhrequirｅｓalot　ofprocessing　inreａltiｍe．　Itprｏvidesarooｍtｏｄisｃoverｎewvaｌuｅs，　ｔoｕndersｔａnd　in-depｔｈknowleｄgefrｏmhiｄdｅｎvａluｅsａndprovｉｄe　ａspaｃeｔｏmanagetｈedataeffeｃtively.A　database　isanorganizedcoｌlecｔiｏn　of　loｇiｃallｙ　rｅlateddａtａ　wｈicｈcaｎbeeasiｌy　managed,ｕｐdateｄand　accessｅd.Dataminiｎｇisaｐrocｅss　disｃoveｒinｇiｎｔeresｔiｎgknｏｗlｅdgｅｓuchａsａｓｓociaｔiｏns,ｐaｔteｒns,ｃｈangｅｓ,aｎomalｉesａnｄsiｇnｉficａntstｒｕctureｓfromlａrgeamount　ｏfdａta　ｓtoｒeｄin　tｈedａtａｂaｓeｓorotheｒrepoｓitoriｅs.

BigＤata　inｃlｕｄes3　V’ｓasiｔsｃharactｅriｓtics.Thｅyareｖolume,veｌｏcitｙandｖarｉetｙ.　Vｏlｕmemeanｓtheaｍountofdata　genｅratedeｖｅry　sｅconｄ.Theｄaｔaｉｓinｓｔatｅof　reｓｔ.It　ｉsaｌｓｏ　kｎｏｗnfｏritsscａlｅ　chaｒaｃteristiｃs．Velocity　istｈｅ　speｅdwｉthwｈicｈtｈe　data　isgenerａted.　Itshｏｕld　havｅhigh　speed　ｄaｔａ．Thedａｔagenerａted　fｒoｍsociａlmediaｉsanexaｍplｅ.　Ｖａｒiｅtymｅansdifｆerent　tyｐes　oｆｄaｔａ　canbe　taｋenｓuchasａudｉo,viｄeｏor　dｏｃｕｍents.Itｃａnbeｎumerals,imaｇes，timeserｉeｓ,　arraｙｓetc．

Data　Mｉnｉnganalysestheｄａtａ　ｆｒoｍdｉffereｎｔｐeｒspectiｖｅs　andsuｍmａriｚinｇ　ｉt　inｔouseful　informationthatcanbeused　for　ｂusinｅｓｓsolutioｎｓ　aｎdpｒedictinｇｔhefutｕｒetrｅnds．　Dａtamininｇ　（DM）,alsｏｃalｌedKnｏwledgeDiｓcovｅryｉnDatabasｅs　（KDＤ）or　KnowledgeDiｓcovｅryandDatａMｉｎｉnｇ,　ｉstheprｏｃｅss　of　seａrｃhiｎg　laｒgｅ　volｕｍesoｆdaｔa　automaticallｙｆorpatternｓ　suｃh　asassoｃiａtｉonrｕlｅs．Itapｐlｉｅsmanycomputationaltecｈnｉｑuesｆｒomｓｔatｉstｉcs，ｉnformationretrｉevａl,　machinelearniｎgａndpaｔteｒnｒｅcｏgnitiｏn.Datａｍinｉｎgextracｔonlｙ　rｅｑｕiｒed　ｐaｔｔernsfromｔhedatabase　inashorttｉme　span.Baｓedonｔhetypｅofpａｔｔernstｏbemｉned,dａtamiｎingtasｋscanbeclａssｉfied　into　ｓuｍｍaｒizatｉoｎ,cｌａssificａｔion,　cluｓtｅring，assoｃiationanｄtreｎdsanalysis.

Biｇ　Daｔａｉsexpaｎｄiｎgｉn　ａlldomaｉns　inclｕdingscieｎｃeａndeｎgineｅringfieldｓincｌｕｄiｎgphysｉcal,bioloｇicaｌandbiomeｄｉcalsciencｅｓ．

II.ＢIG　ＤATAwitｈDＡTAＭＩＮING

Ｇｅneraｌlybig　datarefｅrsto　acollecｔｉｏnｏｆ　ｌargevolｕmesofdataａndｔｈeseｄataaｒegeｎerａｔeｄ　frｏmvarｉｏussourceslikeintｅrneｔ,　social－ｍeｄia,　ｂusiｎess　ｏrｇaniｚation,sｅnsoｒｓetc.Weｃanexｔｒacｔ　ｓomeusｅfuliｎｆoｒmatｉonｗｉｔhtheｈelpof　ＤatａMining．　Ｉt　isaｔｅcｈniqueforｄisｃoveｒiｎg　paｔterns　aswelｌasdescｒiｐtiｖｅ,ｕndｅrｓtandable,mｏdｅls　from　aｌａrge　scaleｏｆdata.

Voｌｕmeisthｅsizeoｆ　ｔhe　datawｈichiｓlarｇerｔhan　petaｂyｔes　ａnd　teｒaｂytes.　Thｅscale　ａndrｉseofsizeｍaｋes　itdｉｆfｉcｕlｔtostoreanｄanalyseuｓiｎｇｔｒadiｔiｏnal　ｔｏols．BigＤaｔa　shｏuldbe　usedｔominｅlａｒｇe　ａmoｕnｔsofdａtawithinthepredｅfinedperｉod　oftｉｍe.Tｒａditional　datａｂａsesystemswｅre　ｄesｉgnｅｄ　ｔoaddｒｅsssmａｌlamountsofｄata　ｗｈichweｒｅ　ｓtructuｒｅd　andcｏnsiｓｔent,wｈerｅasBｉｇDaｔａinclｕdesｗidｅvarietｙofｄata　suｃhasｇeｏsｐatｉaldata,audｉo，ｖｉdeo,ｕnstructｕｒedtextanｄsｏon.

Ｂiｇ　Daｔa　mｉnｉngｒeｆers　toｔｈｅactiviｔｙoｆｇｏing　thｒougｈbig　dａｔasets　to　ｌｏok　foｒrelｅvanｔ　iｎforｍation．　Toprocesslarｇｅvolumes　ofdatafromdiｆferentsoｕrceｓquiｃkｌｙ,Ｈadoopisuｓed.Haｄoｏp　ｉsafrｅe,　Java-basedｐrｏｇramｍｉngfｒaｍｅworktｈatｓｕpportｓ　theprocesｓingoｆlａｒgｅｄatａsｅts　ｉｎadiｓｔributｅdｃｏmpｕｔingenｖｉronment.Itsdistriｂutedsuppｏrts　fａsｔdaｔatransfｅr　ratesamｏngnodｅsanｄallowｓtheｓysteｍｔｏcontiｎｕeoperatｉnｇ　unintｅrruptｅdatｔimｅｓofｎodeｆailure．Itｒｕｎs　ＭapRｅducｅfordistrｉｂｕteｄdatａpｒoｃessｉnｇandis　ｗorkswiｔhstｒucｔuｒedandunsｔructｕrｅddatａ.

IＩＩ．BIG　ＤATＡ　chａracteｒistｉcs-HＡCETHEOREＭ.

Wｅhaveｌargｅ　volumeofhｅterｏgeneoｕsdata.Ｔheｒｅexｉsｔｓa　cｏｍｐlｅxｒelaｔionshiｐamonｇｔｈe　data.　Wenｅｅd　todｉscover　uｓｅfｕlｉｎｆormatiｏnｆｒｏm　ｔhis　voluminouｓdaｔa.

Letus　imａgineａscenariｏｉn　wｈich　tｈeblindpeoplｅare　askedtodｒawelepｈａｎt.　Tｈｅinｆormatｉoncolｌected　ｂyeach　blｉnd　peoplemａythinkthetrｕnkaswａll,legastrｅe,　ｂodｙas　walland　tａilasropｅ．Ｔhebｌｉnd　men　cａneｘchangeinｆormaｔiｏnwｉtheａchｏｔher．

Ｆigurｅ1:

　Bliｎdmen　ａndtｈe　giaｎteｌeｐhａｎt

Ｓome　oｆthｅcｈaractｅrisｔiｃsthａｔｉnｃludeare:

ｉ.Vａｓｔdaｔawiｔh　heterｏgenｅousaｎd　ｄｉverｓｅsｏｕｒces:

Ｏnｅof　ｔhefundaｍentaｌchａｒａcteristicsoｆbiｇ　ｄａｔａ　is　thelａrge　vｏlｕmeoｆdatａｒｅpresｅntedbyheterogeｎeouｓａｎdｄiversｅ　dimｅnsｉonｓ.Foｒexamｐleinthe　ｂiｏmedｉｃalwｏｒlｄ,asingｌｅ　humanbeｉｎgiｓreprｅseｎｔedasname,age,ｇender，faｍｉly　ｈistｏrｙeｔc.，ForX-ｒayａndCTscaｎimaｇｅsandvｉdeosaｒeused．　Heｔeｒogenｅitｙreferｓtothｅ　dｉffｅreｎttｙpesｏfrepｒｅsｅntaｔions　of　sameindividuａｌａndｄiverserｅfersto　thevaｒietｙofｆeaturesｔorepｒｅsｅnｔ　singlｅｉnformaｔion.

ii.Autonomｏｕswithdｉstrｉbｕtedanｄｄe-　cenｔralizedcontrｏl:

　thesoｕｒｃesaｒｅ　auｔonomous,i.e．,　ａutomaticaｌｌyｇenerａtｅd;

iｔgeneｒatｅs　inｆｏrmａｔionwｉtｈoutaｎy　cenｔralizedｃontｒoｌ.Wｅ　cancomｐarｅit　withＷｏrｌdＷｉdｅWeｂ（ＷWW）whereｅacｈｓerverprovides　ａ　ｃｅrｔain　aｍounｔofｉnｆｏrmaｔionwitｈｏuｔｄｅpendingｏnotherｓervｅrs.

iii.Ｃomplｅxandｅｖolvingrelaｔionshiｐs:

Ａs　tｈesiｚe　ｏfthedatabecomeｓiｎfiｎitｅlylargｅ,thereｌationshiｐ　that　exｉｓｔsis　aｌｓo　ｌarｇe．Inｅarlystagｅs,when　ｄａｔaiｓsmall,thereiｓ　ｎocｏmpleｘiｔyｉnrelatioｎships　ａmｏｎgｔhe　daｔａ．　Ｄａtageｎｅraｔedｆrom　soｃｉａl　media　ａnd　oｔher　ｓｏｕrcｅshavｅｃｏｍplex　relatioｎｓhips．

IV.TOOＬＳ:

OPEN　SＯURCE　REVOＬUTION

Laｒgecoｍpａｎｉes　suchaｓFaceｂook,　Yaｈoo,Ｔwitter,　ＬinｋｅｄInbｅnｅfitand　ｃontriｂutewoｒkonoｐenｓourｃepｒｏjects.Ｉn　BigDａtaＭinｉng,ｔｈｅrｅａｒｅmaｎyopenｓoｕrｃeinｉｔiatives.　Theｍｏst　poｐular　of　thｅm　arｅ:

AｐａｃheＭahｏｕt：

Ｓcalａbｌemａcｈinelearningaｎｄｄaｔａ　minｉｎｇ　oｐensｏuｒce　softwarｅｂasｅｄｍainｌｙ　iｎHadｏop．Ithasimpｌementaｔionｓ　oｆawiderａngeofｍachineleａrnｉngａnｄｄataｍiｎｉngalgoritｈms:

ｃlｕｓterinｇ,classification,ｃollaｂoratｉve　fiｌteｒinｇaｎｄ　frequｅntｐattｅrnminｉng.

opeｎ　soｕrceprogramminglanｇｕａgeaｎd　sｏftware　ｅnviｒonmeｎtdeｓigned　ｆorｓtaｔｉsticalcomputｉｎgand　vｉsualization．ＲwaｓdｅsigｎedbｙRoss　IｈakaaｎｄRobert　GenｔlemaｎaｔtheUｎｉveｒsｉｔy　ofＡuckland,NｅｗＺｅalanｄbeginninｇin1993aｎdiｓuｓｅdfor　statｉｓｔicalanalｙｓisofveryｌａrｇedatａ　sets．

ＭOA:

　Sｔreaｍdata　mining　ｏｐｅｎｓourcesｏftｗarｅ　ｔoperｆorｍdatamiｎinｇinrealtｉmｅ．　Ｉｔｈaｓ　implｅmentａｔiｏｎｓ　ofclａssifiｃatiｏn，reｇｒessiｏn;

clｕsteｒingａｎｄfrｅｑuentｉtemｓｅｔmiｎinｇ　and　fｒeｑuent　grａｐhｍiｎｉｎg.　ItstａrｔｅdasaprojeｃtoｆtheＭａcｈineLearnｉng　ｇrouｐｏfUnivｅrsiｔｙ　oｆ　Wａikato,New　Zｅalaｎd，　famousfｏｒｔheWＥKAsｏftｗaｒe.Thestreａmsframeworｋprｏviｄesaｎｅnvironmentforｄefｉｎiｎgaｎd　runningstreamｐrｏcｅssｅs　ｕｓinｇsimpleXML　bａｓeddefinｉｔｉoｎsandisabｌe　ｔｏuseMOＡ,Aｎdrｏid　anｄStorm.

SＡＭOA:

　Itisａｎeｗupcoｍingsｏftwareｐroｊecｔfｏrdiｓｔrｉｂutｅd　streaｍminingthａtwiｌlcｏmbineＳ4anｄSｔorｍwithMOA.

VoｗpａlWａbbiｔ:

ｏｐensource　projectstartedatＹahoo！

　Ｒｅｓeaｒchａnd　continuinｇatMiｃroｓoftResearchtｏdｅsｉgｎ　a　faｓｔ,scalable,usｅfulleaｒninｇａlgｏrｉthm．VW　iｓabletolearnｆrｏmterａｆｅaturｅdatasetｓ.　Itｃａneｘcｅｅdtｈe　tｈroughpｕtofanysinglemachinenｅtｗorkintｅrｆaceｗhendoiｎg　lｉnearｌｅarninｇ，ｖiaparalｌeｌlｅａrning.

V.DATA　ＭINＩNＧforＢＩGＤATA

Dataminingis　the　processbywhiｃhｄatａ　isanａｌｙsedcomｉｎgfrｏmdｉｆfｅreｎt　ｓources　diｓｃoｖｅrsusefulｉｎforｍation.Dａｔａ　Ｍｉninｇcoｎtaiｎsseveｒalａlgorｉthｍｓwhicｈ　fall　iｎto　4　caｔegories．　Thｅyａｒe:

１.Ａssoｃｉａtion　Ｒｕle

２．Clustｅrinｇ

3．Clａssifｉcaｔiｏn

４.Rｅgression

Assoｃiａtionisuｓed　toｓearchreｌatioｎｓhip　betwｅen　vaｒiａbleｓ.　Itis　apｐliｅdｉn　ｓeａrｃhing　ｆｏrｆrequentｌyｖｉsitｅd　items.Iｎsｈｏｒt　itｅstaｂlishesrｅｌａtioｎshipａｍongｏbjeｃts.　Cｌustering　ｄiscovｅｒｓｇrouｐsandstructuresinthedatａ.Ｃlasｓificａtiondｅalswith　assｏciａtiｎｇ　anunｋｎowｎstrucｔureｔoa　knoｗnstrｕcture．　Regressionfiｎdsa　funｃtiｏnto　mｏｄelｔhe　datａ.

Thedifｆeｒentｄａta　ｍiｎｉngalgoritｈms　are: