来自维基百科对大数据的定义.docx

资源描述

来自维基百科对大数据的定义.docx

《来自维基百科对大数据的定义.docx》由会员分享，可在线阅读，更多相关《来自维基百科对大数据的定义.docx（8页珍藏版）》请在冰豆网上搜索。

来自维基百科对大数据的定义.docx

来自维基百科对大数据的定义

Bigdata-FromWikipedia

Ininformationtechnology,bigdata[1][2]isacollectionofdatasetssolargeandcomplexthatitbecomesdifficulttoprocessusingon-handdatabasemanagementtoolsortraditionaldataprocessingapplications.Thechallengesincludecapture,curation,storage,[3]search,sharing,analysis,[4]andvisualization.Thetrendtolargerdatasetsisduetotheadditionalinformationderivablefromanalysisofasinglelargesetofrelateddata,ascomparedtoseparatesmallersetswiththesametotalamountofdata,allowingcorrelationstobefoundto"spotbusinesstrends,determinequalityofresearch,preventdiseases,linklegalcitations,combatcrime,anddeterminereal-timeroadwaytrafficconditions."[5][6][7]

在信息技术中，“大数据”是指一些使用目前现有数据库管理工具或传统数据处理应用很难处理的大型而复杂的数据集。

其挑战包括采集、管理、存储、搜索、共享、分析和可视化。

更大的数据集的趋势是由于从相关数据的单一大数据集推导而来的额外信息，与分离的较小的具有相同数据总量的数据集相比，能够发现相关性来“识别商业趋势（spotbusinesstrends）、确定研究的质量、预防疾病、法律引用链接、打击犯罪以及实时确定道路交通状态”。

Asof2012,limitsonthesizeofdatasetsthatarefeasibletoprocessinareasonableamountoftimewereontheorderofexabytesofdata.[8][9]Scientistsregularlyencounterlimitationsduetolargedatasetsinmanyareas,includingmeteorology,genomics,[10]connectomics,complexphysicssimulations,[11]andbiologicalandenvironmentalresearch.[12]ThelimitationsalsoaffectInternetsearch,financeandbusinessinformatics.Datasetsgrowinsizeinpartbecausetheyareincreasinglybeinggatheredbyubiquitousinformation-sensingmobiledevices,aerialsensorytechnologies（remotesensing）,softwarelogs,cameras,microphones,radio-frequencyidentificationreaders,andwirelesssensornetworks.[13][14]Theworld'stechnologicalper-capitacapacitytostoreinformationhasroughlydoubledevery40monthssincethe1980s;[15]asof2012,everyday2.5quintillion（2.5×1018）bytesofdatawerecreated.[16]

截至2012年，数据集大小尺寸的限制是exabyte数量级的数据，这种规模是指以可行的处理方式在合理的时间内进行数据处理。

在许多领域科学家们经常遇到大数据集的限制，这些领域包括气象学、基因学、connectomics、复杂的物理仿真、以及生物和环境研究。

这些限制也影响到了互联网、金融和商业情报信息的研究。

数据集大小的增长是由于这些数据集不断地通过无处不在的信息感应移动设备、航空传感技术（遥感）、软件日志、摄像头、麦克风、无线频率识别阅读器（radio-frequencyidentificationreaders）-RFID和无线传感网络来收集和聚集。

从80年代起，全球存储信息人均信息存储能力在技术上大致每40个月就翻一番；asof2012,everyday2.5quintillion（2.5×1018）bytesofdatawerecreated.[16]截至到2012年，每天产生的数据为2.5quintillion（2.5*10^18）字节。

Bigdataisdifficulttoworkwithusingrelationaldatabasesanddesktopstatisticsandvisualizationpackages,requiringinstead"massivelyparallelsoftwarerunningontens,hundreds,oreventhousandsofservers".[17]Whatisconsidered"bigdata"variesdependingonthecapabilitiesoftheorganizationmanagingtheset,andonthecapabilitiesoftheapplicationsthataretraditionallyusedtoprocessandanalyzethedatasetinitsdomain."Forsomeorganizations,facinghundredsofgigabytesofdataforthefirsttimemaytriggeraneedtoreconsiderdatamanagementoptions.Forothers,itmaytaketensorhundredsofterabytesbeforedatasizebecomesasignificantconsideration."[18]

使用关系型数据库和桌面统计和可视化软件包对大数据进行处理是困难的，它需要“将大规模并行软件运行在数十台、数百台或甚至数千台服务器（来处理）”。

什么是“大数据”取决于企业管理数据集的能力、以及在其领域内使用传统方式对数据集的处理和分析能力。

“对某些企业来说，在第一次面对处理上百G字节的数据时就要重新考虑数据管理的选择，而对其他的企业来说，处理数百TB字节的数据量不成问题。

Definition

Bigdatausuallyincludesdatasetswithsizesbeyondtheabilityofcommonly-usedsoftwaretoolstocapture,curate,manage,andprocessthedatawithinatolerableelapsedtime.Bigdatasizesareaconstantlymovingtarget,asof2012rangingfromafewdozenterabytestomanypetabytesofdatainasingledataset.Withthisdifficulty,anewplatformof"bigdata"toolshasarisentohandlesensemakingoverlargequantitiesofdata,asintheApacheHadoopBigDataPlatform.

大数据通常包括在尺寸上超出常用软件工具对数据在一定的可容忍时间间隔内进行采集、管理和处理的能力的数据集。

大数据的尺寸是一个不断变化的目标，截至到2012年在一个单一数据集中的数据范围从十数TB到数个PB。

由于这种困难性，出现了新的“大数据“平台工具来在大量的数据中处理合理的数据，例如ApacheHadoop大数据平台。

MIKE2.0,anopenapproachtoInformationManagement,definesbigdataintermsofusefulpermutations,complexity,anddifficultytodeleteindividualrecords.

MIKE2.0，一个开放的信息管理方式，从有用的排列、复杂性和难以删除单一记录几个方面定义了大数据。

Ina2001researchreport[19]andrelatedlectures,METAGroup（nowGartner）analystDougLaneydefineddatagrowthchallengesandopportunitiesasbeingthree-dimensional,i.e.increasingvolume（amountofdata）,velocity（speedofdatainandout）,andvariety（rangeofdatatypesandsources）.Gartner,andnowmuchoftheindustry,continuetousethis"3Vs"modelfordescribingbigdata.[20]In2012,Gartnerupdateditsdefinitionasfollows:

"BigDataarehigh-volume,high-velocity,and/orhigh-varietyinformationassetsthatrequirenewformsofprocessingtoenableenhanceddecisionmaking,insightdiscoveryandprocessoptimization."[21]

在2001年的研究报告和相关文献中，METAGroup（现在的Gartner）的分析师DougLaney将数据增长的挑战和机遇定义成三维方式，即总量（数据量）、速度（数据进出（变化）的速度）和多样性（数据类型和数据源的范围）。

Gartner和目前业界大多数（人）延续使用这种“3V“模型来描述大数据。

在2012年，Gartner更新了其对大数据的定义：

”大数据是具备大数据量、高变化速度和/或高度多样新的信息资产，这些信息资产需要新型的处理方式来强化决策制定、洞察发现和处理优化。

Examples

Examplesincludeweblogs,RFID,sensornetworks,socialnetworks,socialdata（duetothesocialdatarevolution）,Internettextanddocuments,Internetsearchindexing,calldetailrecords,astronomy,atmosphericscience,genomics,biogeochemical,biological,andothercomplexandofteninterdisciplinaryscientificresearch,militarysurveillance,medicalrecords,photographyarchives,videoarchives,andlarge-scalee-commerce.

例子包括网络日志、RFID、传感器网络、社交网络、社交数据（由于社交数据革命）、互联网文本和文档、互联网搜索索引、呼叫详细记录（话单-CDR）、天文学、大气科学、基因学、生物化学、生物科学以及其他复杂和常常跨学科的科学研究、军事侦查、医疗记录、图片档案、视频档案、和大规模电子商务。

Scienceandresearch

ØWhentheSloanDigitalSkySurvey（SDSS）begancollectingastronomicaldatain2000,itamassedmoreinitsfirstfewweeksthanalldatacollectedinthehistoryofastronomy.Continuingatarateofabout200GBpernight,SDSShasamassedmorethan140terabytesofinformation.WhentheLargeSynopticSurveyTelescope,successortoSDSS,comesonlinein2016itisanticipatedtoacquirethatamountofdataeveryfivedays.[5]

在SloanDigitalSkySurvey（SDSS）于2000年开始采集天文数据时，在最初的几周内它积累了比天文史上收集的所有数据还要多的数据。

现在他还以每夜大约200GB数据量的速率增加。

SDSS已经累积了超过140TB的信息。

一旦大型的天文望远镜，SDSS的继任者，在2016年上线，预计它将每5天采集的数据量。

ØIntotal,thefourmaindetectorsattheLargeHadronCollider（LHC）produced13petabytesofdatain2010（13,000terabytes）.[22]

总的说来，四个主要的大型强子碰撞机在2010年所产生的是数据达到13PB（13000TB）。

Decodingthehumangenomeoriginallytook10yearstoprocess;nowitcanbeachievedinoneweek.[5]

解码人体基因原来需要10年的时间，现在它能在1周之内完成。

ØComputationalsocialscience—TobiasPreisetal.usedGoogleTrendsdatatodemonstratethatInternetusersfromcountrieswithahigherpercapitagrossdomesticproduct（GDP）aremorelikelytosearchforinformationaboutthefuturethaninformationaboutthepast.Thefindingssuggesttheremaybealinkbetweenonlinebehaviourandreal-worldeconomicindicators.[23][24][25]TheauthorsofthestudyexaminedGooglequerieslogsmadebyInternetusersin45differentcountriesin2010andcalculatedtheratioofthevolumeofsearchesforthecomingyear（‘2011’）tothevolumeofsearchesforthepreviousyear（‘2009’）,whichtheycallthe‘futureorientationindex’.[26]TheycomparedthefutureorientationindextothepercapitaGDPofeachcountryandfoundastrongtendencyforcountriesinwhichGoogleusersenquiremoreaboutthefuturetoexhibitahigherGDP.Theresultshintthattheremaypotentiallybearelationshipbetweentheeconomicsuccessofacountryandtheinformation-seekingbehaviorofitscitizenscapturedinbigdata.

计算社会科学——TobiasPreis等，使用Google趋势搜索来证明来自较高人均GDP国家的互联网用户访问未来的信息比访问过去的信息要多。

这个发现暗示有可能在连线行为与真实世界中的经济指标有某种联系。

该研究的作者考差了在2010年终45个国家的互联网用户的Google查询日志并计算了来年（2011年）搜索数量与去年（2009年）搜索数量的比率，这个比率被他们称之为“未来取向指数”。

他们比较了每个国家GDP的未来去向指数，发现具有较高的GDP国家的用户使用Google搜索的关于未来的信息的强烈倾向。

这个结果暗示在经济成功的国家，其国民在大数据中寻找信息的行为有可能的关联。

Government

ØIn2012,theObamaadministrationannouncedtheBigDataResearchandDevelopmentInitiative,whichexploredhowbigdatacouldbeusedtoaddressimportantproblemsfacingthegovernment.[27]Theinitiativewascomposedof84differentbigdataprogramsspreadacrosssixdepartments.[28]

在2012年，奥巴马政府宣布了大数据研究和开发计划，探索如何使用大数据来解决政府面临的问题。

该计划由跨越6个政府部门的84个大数据程序项目所组成。

ØTheUnitedStatesFederalGovernmentownssixofthetenmostpowerfulsupercomputersintheworld.[29]

美国联邦政府拥有世界上10台超级计算机中的6台。

Privatesector

ØWalmarthandlesmorethan1millioncustomertransactionseveryhour,whichisimportedintodatabasesestimatedtocontainmorethan2.5petabytesofdata–theequivalentof167timestheinformationcontainedinallthebooksintheUSLibraryofCongress.[5]

Walmart每小时处理100万个顾客事务，这些事务被输入进数据库，该数据库含有超过2.5PB的数据——是美国国会图书馆所有图书信息量的167倍

ØFacebookhandles40billionphotosfromitsuserbase.

Facebook处理来自用户群的400亿张照片

ØFICOFalconCreditCardFraudDetectionSystemprotects2.1billionactiveaccountsworld-wide.[30]

FICOFalcon信用卡欺诈检测系统保护全球21亿激活用户。

ØThevolumeofbusinessdataworldwide,acrossallcompanies,doublesevery1.2years,accordingtoestimates.[31]

据估计，全球所有公司的商业数据量每1.2年要翻番。

Market

"Bigdata"hasincreasedthedemandofinformationmanagementspecialistsinthatSoftwareAG,OracleCorporation,IBM,Microsoft,SAP,andHPhavespentmorethan$15billiononsoftwarefirmsonlyspecializingindatamanagementandanalytics.Thisindustryonitsownisworthmorethan$100billionandgrowingatalmost10percentayear,abouttwiceasfastasthesoftwarebusinessasawhole.[5]

“大数据”增加了对信息管理专业人士的需求，因为SoftwareAG、Oracle、IBM、Microsoft、SAP、和HP已经花费了超过150亿没劲收购软件公司，只为了数据管理和分析的专业人士。

这个行业自身的价值超过1000亿美元并以每年10%的速度增长，是整个软件业务增长速度的2倍。

Developedeconomiesmakeincreasinguseofdata-intensivetechnologies.Thereare4.6billionmobile-phonesubscriptionsworldwideandtherearebetween1billionand2billionpeopleaccessingtheinternet.[5]Between1990and2005,morethan1billionpeopleworldwideenteredthemid

展开阅读全文