Data Intensive Computing详细讨论.docx
《Data Intensive Computing详细讨论.docx》由会员分享,可在线阅读,更多相关《Data Intensive Computing详细讨论.docx(37页珍藏版)》请在冰豆网上搜索。
DataIntensiveComputing详细讨论
I.DataIntensiveComputing
RobAllan
ComputationalScienceandEngineeringDepartment,
*****DRAFT*****
STFCDaresburyLaboratory,Daresbury,WarringtonWA44AD
II.Abstract:
Thisreportlooksatavarietyofrequirementsforandapproachestodataintensivecomputing.Italsoconsidersnovelwaystoextractinformationfromlargedatasets,particularlyifthiscanbecarriedout``insitu''avoidingtheneedforlengthyfiletransfers.
© ScienceandTechnologyFacilitiesCouncil2011.NeithertheCouncilnortheLaboratoryacceptanyresponsibilityforlossordamagearisingfromtheuseofinformationcontainedinanyoftheirreportsorinanycommunicationabouttheirtestsorinvestigations.
III.
Introduction
AnintroductiontodataintensivecomputingisgiveninWikipediahttp:
//en.wikipedia.org/wiki/Data_Intensive_Computing.There,itisdefinedasaclassofparallelcomputingapplicationswhichuseadataparallelapproachtoprocessinglargevolumesofdatatypicallyterabytesorpetabytesinsize.TheydevotemostoftheirprocessingtimetoI/Oandmanipulationofdataratherthancomputation.Thisisorthogonaltowhatwearefamiliarwithfrommodellingandsimulationonhighperformancecomputingsystems.
Dataintensivecomputingreferstocapturing,managing,analysingandunderstandingdataatvolumesandratesthatpushthefrontiersofcurrenttechnologies.Thechallengeistoprovidethehardwarearchitecturesandrelatedsoftwaresystemsandtechniqueswhicharecapableoftransformingultra-largedataintovaluableinformationandultimatelyknowledge.
TheNationalScienceFoundationhasnotedthatdataintensivecomputingrequiresa``fundamentallydifferentsetofprinciples''toothercomputingapproaches.Theyfundedaprogrammetoseek``increasedunderstandingofthecapabilitiesandlimitationsofdataintensivecomputing''.Thekeyfocusareasare:
∙Enhamcementstoparallelprogrammingtoaddresstheprocessingofdata;
∙Programmingabstractionsincludingmodels,languages,andalgorithmswhichallowanaturalexpressionofparalleldataprocessing;
∙Designofdataintensivecomputingplatformstoprovidehighlevelsofreliability,efficiency,availabilityandscalability;
∙Identifyingapplicationsthatcanexploitthiscomputingparadigmanddetermininghowresearchshouldevolvetomakebestuseofsuchapplications.
PacificNorthwestNationalLabs,respondingtotheNSFcall,defineddataintensivecomputingas``capturing,managing,analyzing,andunderstandingdataatvolumesandratesthatpushthefrontiersofcurrenttechnologies''.Theybelievethattoaddresstherapidlygrowingdatavolumesandcomplexityrequires``epochaladvancesinsoftware,hardwareandalgorithmdevelopment''whichcanscalereadilywithsizeofthedataandprovideeffectiveandtimelyanalysisandprocessingresults.
Thereareseveralimportantcommoncharacteristicsofdataintensivecomputingsystemsthatdistinguishthemfromotherformsofcomputing.
1.Theprincipleofco-locatingthedataandapplicationsoralgorithms.Toachievehighperformanceindataintensivecomputing,itisimportanttominimisemovementofdata.Forthisreasonitisusefultoexecuteapplicationsonthenodeswherethedataresides.Highbandwidthandlowlatencynetworkingusingtechnologies,suchasInfiniBandwhichenablesRDMA,allowdatatobestoredinaseparatenearbyrepositoryandprovideperformancecomparabletodataonlocaldisk.
2.Theprogrammingmodelused.Typicaldataintensivecomputingapplicationsareexpressedintermsofhighleveloperationsondata,andtheruntimesystemtransparentlycontrolsthescheduling,execution,loadbalancing,communicationsandmovementofcomputationanddataacrossthedistributedcomputingcluster.Theprogrammingabstractionandlanguagetoolsallowtheprocessingtobeexpressedintermsofdataflowsandtransformationsincorporatingnewprogramminglanguagesandsharedlibrariesofcommondatamanipulationalgorithmssuchassorting.Adatabaseisoftenusedasoptimisationsarewellknown.
3.Afocusonreliabilityandavailability.Dataintensivecomputingsystemsmustbedesignedtobefaulttolerant.Thistypicallyinvolvesredundantcopiesofdataondisk,storageofintermediateprocessingresultsondisk,automaticdetectionofnodeorprocessingfailuresandselectiverollbackorre-computationofresults.Databasetechnologiesarealsousedforthispurpose.
4.Theinherentscalabilityoftheunderlyinghardwareandsoftwarearchitecture.Dataintensivecomputingsystemscantypicallybescaledinalinearfashiontoaccommodatevirtuallyanyamountofdata,ortomeettimecriticalperformancerequirementsbysimplyaddingadditionalprocessingandstoragenodes.Thenumberofnodesandprocessingtasksassignedforaspecificapplicationcanbevariableorfixeddependingonthehardware,software,communicationsanddistributedfilesystemarchitecture.
Notethatthesecriteriaimplicitlyprecludein-memorysolutions(butseemorebelow).
Avarietyofsystemarchitectureshavebeenimplementedfordataintensivecomputingandlargescaledataanalysisapplicationsincludingparallelanddistributedrelationaldatabasemanagementsystemswhichhavebeenavailabletorunonclustersformorethantwodecades.Thisassumesthedataisstructuredandcanbemappedontoasetofconnectedtablesformanipulation.Howevermostdatagrowthiswithdatainun-structuredoresemi-structuredformandnewprocessingparadigmswithmoreflexiblemodelsareneeded.EmergingsolutionsincludetheMapReducearchitecturepioneeredbyGoogle [8]andnowavailableinanopensourceimplementationcalledApacheHadoop,see [42,2](noteDryad/LINQhasbeenusedonWindowsplatforms).
IV.Amdahl'sLaws
Amdahl'sLawsareasfollows.
1.Amdahl'sparallelismlaw:
IfacomputationhasaserialpartwhichtakestimeStoexecuteandaparallelcomponentwhichtakestimeP/NtoexecuteonNprocessors,thenthespeeduponNis(S+P)/(S+P/N).Themaximimspeedupistherefore(S+P)/S.
2.Amdahl'sbalancedsystemlaw:
AsystemneedsonebitofI/Opersecondperinstructionpersecond.
3.Amdahl'smemorylaw:
alpha=1:
thatistheMB/MIPSratio,inabalancedsystemis1.Thatisonebyteofmemorypercpuinstructioncycle.
4.Amdahl'sI/Olaw:
ProgramsdooneI/Oper50,000instructions
LookingatAmdahl'sbalancedsystemlaw,wetypicallyseethefollowing.
∙TheAmdahlnumberforsuper-computersrunninghighperformancecomputingapplicationsrangesaround10;
∙ComputationheavydataintensivesimulationstendtousehardwarewithAmdahlnumberaround10;
∙Dataintensiveanalyticalapplicationsareintherangeof10oreven1.
V.TheGrayWulf
Weshouldbeawarethatwithmulti-levelcaching,amuchlowerI/OtoMIPSratiocoupledwithalargeenoughmemorycanstillprovidesatisfactoryperformance [35].Thisishoweveronlytrueiftheproblemfitsinmemory.
Typically,cachebasedsystemswithbranchpredictionandspeculationhardwareinCPUarenotusefulforapplicationswhichrequirestreamingdata.Heretheorderofmagnitudemis-matchbetweendiskmemoryandCPUbandwidthhamperstheprocessing.Amorebalancedsystemisrequired.BothFPGAsandGPUshavebeeninvestigatedforthispurpose.SSDdevicesmightalsobeusedtoreducethebandwidthtodiskandtheysignificantlyreducetheseektimeforun-structureddata.SomedataintensivebenchmarkswerecreatedtotestdifferentarchitecturesbyGokhaleetal. [11].
SpecialpurposecomputersbasedonAmdahl'sbalancedsystemlawarebuilttoprocesslargevolumesofdatathatwillnotfitentirelyinmemory.TheoriginalmachinedesignedbyJimGrayofMicrosoftisnowknownastheGrayWulf [35].Thisistypicallyagoodsolutionforuniversitygroupswhohosttheirownlargedatasets(itisdifficulttomoveover1 TBofdataovertheinternet)butwhocouldmaintainareasonablycompactcommodityclustersolution.
ThegoalsoftheGrayWulfprojectwereasfollows.
∙supportanalysisofPetaBytedatasets;
∙provideveryhighbandwidthsequentialaccesstodata;
∙supportavarietyofaccesspatterns(seebelow);
∙supportasetofsimpletoolsfordatabasedesign;
∙supportasetoftoolsforfastingestofdata.
Itisfound,forinstancebyGokhaleetal. [11]thatusingspecialpurposehardwarecanremovetheCPUbottlneckandultimatelythesystembecomeslimitedbybandwidthtodisk.
Someexamplesofmachinesbasedonthisconceptaregivenbelow.
VI.JHU-1
TheoriginalGrayWulfatJohnsHopkinsUniversityhas416CPUscapableof1,107 GI/swithatotalmemoryof1,152 GB.DiskI/Operformancewasatotalof70 GB/sfromjustover1PBofstorage.ThisresultsinanAmdahlmemorynumberof1.04andI/Onumberof0.506.Themodularstructureofthissolutionisshowninthefollowingtable.
No.Servers
CPUea
Memoryea
Diskea
Interconnect
Wideareainterconnect,10Gb/sFC
Tier1
2
16core
128GB
11.25TB
InfiniBand
DellR900
1xMD1000asbelow
QLogic20Gb/s
Tier2
4
16core
64GB
33.75TB
InfiniBand
DellR900
3xMD1000asbelow
QLogic20Gb/s
Tier3
40
8core
16GB
22.5TB
InfiniBand
Dell29502.66GHz
2xMD1000SAStotal30x750GB7,200RPMSATAdiscs
QLogic20Gb/s
Furtherdetailsaregivenin [35].
ThisisbasicallyadistributedSQLservercloud.ApplicationsarerunontheTier-1serverswhicharealsousedforremoteaccessanddatatransfer.
VII.JHU-2
AsecondexampleofamoreenergyefficientsolutionhasbeenbuiltusingAtom/Iontechnology.
No.Servers
CPUea
Memoryea
Diskea
Interconnect
Tier1
36
2xAtom+16xGPU
4GB
1x120GBSSD
ZotacN330IntelAtom/nVidiaIon1.6GHz