Data Intensive Computing详细讨论.docx

上传人:b****5 文档编号:29964639 上传时间:2023-08-03 格式:DOCX 页数:37 大小:41.50KB
下载 相关 举报
Data Intensive Computing详细讨论.docx_第1页
第1页 / 共37页
Data Intensive Computing详细讨论.docx_第2页
第2页 / 共37页
Data Intensive Computing详细讨论.docx_第3页
第3页 / 共37页
Data Intensive Computing详细讨论.docx_第4页
第4页 / 共37页
Data Intensive Computing详细讨论.docx_第5页
第5页 / 共37页
点击查看更多>>
下载资源
资源描述

Data Intensive Computing详细讨论.docx

《Data Intensive Computing详细讨论.docx》由会员分享,可在线阅读,更多相关《Data Intensive Computing详细讨论.docx(37页珍藏版)》请在冰豆网上搜索。

Data Intensive Computing详细讨论.docx

DataIntensiveComputing详细讨论

I.DataIntensiveComputing

RobAllan

ComputationalScienceandEngineeringDepartment,

*****DRAFT*****

STFCDaresburyLaboratory,Daresbury,WarringtonWA44AD

II.Abstract:

Thisreportlooksatavarietyofrequirementsforandapproachestodataintensivecomputing.Italsoconsidersnovelwaystoextractinformationfromlargedatasets,particularlyifthiscanbecarriedout``insitu''avoidingtheneedforlengthyfiletransfers.

© ScienceandTechnologyFacilitiesCouncil2011.NeithertheCouncilnortheLaboratoryacceptanyresponsibilityforlossordamagearisingfromtheuseofinformationcontainedinanyoftheirreportsorinanycommunicationabouttheirtestsorinvestigations.

III.

Introduction

AnintroductiontodataintensivecomputingisgiveninWikipediahttp:

//en.wikipedia.org/wiki/Data_Intensive_Computing.There,itisdefinedasaclassofparallelcomputingapplicationswhichuseadataparallelapproachtoprocessinglargevolumesofdatatypicallyterabytesorpetabytesinsize.TheydevotemostoftheirprocessingtimetoI/Oandmanipulationofdataratherthancomputation.Thisisorthogonaltowhatwearefamiliarwithfrommodellingandsimulationonhighperformancecomputingsystems.

Dataintensivecomputingreferstocapturing,managing,analysingandunderstandingdataatvolumesandratesthatpushthefrontiersofcurrenttechnologies.Thechallengeistoprovidethehardwarearchitecturesandrelatedsoftwaresystemsandtechniqueswhicharecapableoftransformingultra-largedataintovaluableinformationandultimatelyknowledge.

TheNationalScienceFoundationhasnotedthatdataintensivecomputingrequiresa``fundamentallydifferentsetofprinciples''toothercomputingapproaches.Theyfundedaprogrammetoseek``increasedunderstandingofthecapabilitiesandlimitationsofdataintensivecomputing''.Thekeyfocusareasare:

∙Enhamcementstoparallelprogrammingtoaddresstheprocessingofdata;

∙Programmingabstractionsincludingmodels,languages,andalgorithmswhichallowanaturalexpressionofparalleldataprocessing;

∙Designofdataintensivecomputingplatformstoprovidehighlevelsofreliability,efficiency,availabilityandscalability;

∙Identifyingapplicationsthatcanexploitthiscomputingparadigmanddetermininghowresearchshouldevolvetomakebestuseofsuchapplications.

PacificNorthwestNationalLabs,respondingtotheNSFcall,defineddataintensivecomputingas``capturing,managing,analyzing,andunderstandingdataatvolumesandratesthatpushthefrontiersofcurrenttechnologies''.Theybelievethattoaddresstherapidlygrowingdatavolumesandcomplexityrequires``epochaladvancesinsoftware,hardwareandalgorithmdevelopment''whichcanscalereadilywithsizeofthedataandprovideeffectiveandtimelyanalysisandprocessingresults.

Thereareseveralimportantcommoncharacteristicsofdataintensivecomputingsystemsthatdistinguishthemfromotherformsofcomputing.

1.Theprincipleofco-locatingthedataandapplicationsoralgorithms.Toachievehighperformanceindataintensivecomputing,itisimportanttominimisemovementofdata.Forthisreasonitisusefultoexecuteapplicationsonthenodeswherethedataresides.Highbandwidthandlowlatencynetworkingusingtechnologies,suchasInfiniBandwhichenablesRDMA,allowdatatobestoredinaseparatenearbyrepositoryandprovideperformancecomparabletodataonlocaldisk.

2.Theprogrammingmodelused.Typicaldataintensivecomputingapplicationsareexpressedintermsofhighleveloperationsondata,andtheruntimesystemtransparentlycontrolsthescheduling,execution,loadbalancing,communicationsandmovementofcomputationanddataacrossthedistributedcomputingcluster.Theprogrammingabstractionandlanguagetoolsallowtheprocessingtobeexpressedintermsofdataflowsandtransformationsincorporatingnewprogramminglanguagesandsharedlibrariesofcommondatamanipulationalgorithmssuchassorting.Adatabaseisoftenusedasoptimisationsarewellknown.

3.Afocusonreliabilityandavailability.Dataintensivecomputingsystemsmustbedesignedtobefaulttolerant.Thistypicallyinvolvesredundantcopiesofdataondisk,storageofintermediateprocessingresultsondisk,automaticdetectionofnodeorprocessingfailuresandselectiverollbackorre-computationofresults.Databasetechnologiesarealsousedforthispurpose.

4.Theinherentscalabilityoftheunderlyinghardwareandsoftwarearchitecture.Dataintensivecomputingsystemscantypicallybescaledinalinearfashiontoaccommodatevirtuallyanyamountofdata,ortomeettimecriticalperformancerequirementsbysimplyaddingadditionalprocessingandstoragenodes.Thenumberofnodesandprocessingtasksassignedforaspecificapplicationcanbevariableorfixeddependingonthehardware,software,communicationsanddistributedfilesystemarchitecture.

Notethatthesecriteriaimplicitlyprecludein-memorysolutions(butseemorebelow).

Avarietyofsystemarchitectureshavebeenimplementedfordataintensivecomputingandlargescaledataanalysisapplicationsincludingparallelanddistributedrelationaldatabasemanagementsystemswhichhavebeenavailabletorunonclustersformorethantwodecades.Thisassumesthedataisstructuredandcanbemappedontoasetofconnectedtablesformanipulation.Howevermostdatagrowthiswithdatainun-structuredoresemi-structuredformandnewprocessingparadigmswithmoreflexiblemodelsareneeded.EmergingsolutionsincludetheMapReducearchitecturepioneeredbyGoogle [8]andnowavailableinanopensourceimplementationcalledApacheHadoop,see [42,2](noteDryad/LINQhasbeenusedonWindowsplatforms).

IV.Amdahl'sLaws

Amdahl'sLawsareasfollows.

1.Amdahl'sparallelismlaw:

IfacomputationhasaserialpartwhichtakestimeStoexecuteandaparallelcomponentwhichtakestimeP/NtoexecuteonNprocessors,thenthespeeduponNis(S+P)/(S+P/N).Themaximimspeedupistherefore(S+P)/S.

2.Amdahl'sbalancedsystemlaw:

AsystemneedsonebitofI/Opersecondperinstructionpersecond.

3.Amdahl'smemorylaw:

alpha=1:

thatistheMB/MIPSratio,inabalancedsystemis1.Thatisonebyteofmemorypercpuinstructioncycle.

4.Amdahl'sI/Olaw:

ProgramsdooneI/Oper50,000instructions

LookingatAmdahl'sbalancedsystemlaw,wetypicallyseethefollowing.

∙TheAmdahlnumberforsuper-computersrunninghighperformancecomputingapplicationsrangesaround10;

∙ComputationheavydataintensivesimulationstendtousehardwarewithAmdahlnumberaround10;

∙Dataintensiveanalyticalapplicationsareintherangeof10oreven1.

V.TheGrayWulf

Weshouldbeawarethatwithmulti-levelcaching,amuchlowerI/OtoMIPSratiocoupledwithalargeenoughmemorycanstillprovidesatisfactoryperformance [35].Thisishoweveronlytrueiftheproblemfitsinmemory.

Typically,cachebasedsystemswithbranchpredictionandspeculationhardwareinCPUarenotusefulforapplicationswhichrequirestreamingdata.Heretheorderofmagnitudemis-matchbetweendiskmemoryandCPUbandwidthhamperstheprocessing.Amorebalancedsystemisrequired.BothFPGAsandGPUshavebeeninvestigatedforthispurpose.SSDdevicesmightalsobeusedtoreducethebandwidthtodiskandtheysignificantlyreducetheseektimeforun-structureddata.SomedataintensivebenchmarkswerecreatedtotestdifferentarchitecturesbyGokhaleetal. [11].

SpecialpurposecomputersbasedonAmdahl'sbalancedsystemlawarebuilttoprocesslargevolumesofdatathatwillnotfitentirelyinmemory.TheoriginalmachinedesignedbyJimGrayofMicrosoftisnowknownastheGrayWulf [35].Thisistypicallyagoodsolutionforuniversitygroupswhohosttheirownlargedatasets(itisdifficulttomoveover1 TBofdataovertheinternet)butwhocouldmaintainareasonablycompactcommodityclustersolution.

ThegoalsoftheGrayWulfprojectwereasfollows.

∙supportanalysisofPetaBytedatasets;

∙provideveryhighbandwidthsequentialaccesstodata;

∙supportavarietyofaccesspatterns(seebelow);

∙supportasetofsimpletoolsfordatabasedesign;

∙supportasetoftoolsforfastingestofdata.

Itisfound,forinstancebyGokhaleetal. [11]thatusingspecialpurposehardwarecanremovetheCPUbottlneckandultimatelythesystembecomeslimitedbybandwidthtodisk.

Someexamplesofmachinesbasedonthisconceptaregivenbelow.

VI.JHU-1

TheoriginalGrayWulfatJohnsHopkinsUniversityhas416CPUscapableof1,107 GI/swithatotalmemoryof1,152 GB.DiskI/Operformancewasatotalof70 GB/sfromjustover1PBofstorage.ThisresultsinanAmdahlmemorynumberof1.04andI/Onumberof0.506.Themodularstructureofthissolutionisshowninthefollowingtable.

No.Servers

CPUea

Memoryea

Diskea

Interconnect

Wideareainterconnect,10Gb/sFC

Tier1

2

16core

128GB

11.25TB

InfiniBand

 

DellR900

 

1xMD1000asbelow

QLogic20Gb/s

Tier2

4

16core

64GB

33.75TB

InfiniBand

 

DellR900

 

3xMD1000asbelow

QLogic20Gb/s

Tier3

40

8core

16GB

22.5TB

InfiniBand

 

Dell29502.66GHz

 

2xMD1000SAStotal30x750GB7,200RPMSATAdiscs

QLogic20Gb/s

Furtherdetailsaregivenin [35].

ThisisbasicallyadistributedSQLservercloud.ApplicationsarerunontheTier-1serverswhicharealsousedforremoteaccessanddatatransfer.

VII.JHU-2

AsecondexampleofamoreenergyefficientsolutionhasbeenbuiltusingAtom/Iontechnology.

No.Servers

CPUea

Memoryea

Diskea

Interconnect

Tier1

36

2xAtom+16xGPU

4GB

1x120GBSSD

 

 

ZotacN330IntelAtom/nVidiaIon1.6GHz

展开阅读全文
相关资源
猜你喜欢
相关搜索

当前位置:首页 > 幼儿教育 > 育儿知识

copyright@ 2008-2022 冰豆网网站版权所有

经营许可证编号:鄂ICP备2022015515号-1