学术英语论文文档格式.docx

资源描述

学术英语论文文档格式.docx

《学术英语论文文档格式.docx》由会员分享，可在线阅读，更多相关《学术英语论文文档格式.docx（12页珍藏版）》请在冰豆网上搜索。

学术英语论文文档格式.docx

scientiststhroughoutindustryandacademiaarealreadyusingCUDAtoachievedramaticspeedupsonproductionandresearchcodes.Inthispaper,weproposeahybridparallelprogrammingapproachusinghybridCUDAandMPIprogramming,whichpartitionloopiterationsaccordingtothenumberofC1060GPUnodesinaGPUclusterwhichconsistsofoneC1060andoneS1070.LoopiterationsassignedtooneMPIprocessareprocessedinparallelbyCUDArunbytheprocessorcoresinthesamecomputationalnode.

Keywords:

CUDA,GPU,MPI,OpenMP,hybrid,parallelprogramming

I.INTRODUCTION

Nowadays,NVIDIA’sCUDA[1,16]isageneralpurposescalableparallelprogrammingmodelforwritinghighlyparallelapplications.Itprovidesseveralkeyabstractions–ahierarchyofthreadblocks,sharedmemory,andbarriersynchronization.ThismodelhasprovenquitesuccessfulatprogrammingmultithreadedmanycoreGPUsandscalestransparentlytohundredsofcores:

scientiststhroughoutindustryandacademiaarealreadyusingCUDA[1,16]toachievedramaticspeedupsonproductionandresearchcodes.

InNVDIAtheCUDAchip,alltothecoreofhundredsofwaystoconstructtheirchips,inherewewilltrytouseNVIDIAtoprovidecomputingequipmentforparallelcomputing.Thispaperproposesasolutiontonotonlysimplifytheuseofhardwareaccelerationinconventionalgeneralpurposeapplications,butalsotokeeptheapplicationcodeportable.Inthispaper,weproposeaparallelprogrammingapproachusinghybridCUDA,OpenMPandMPI[3]programming,whichpartitionloopiterationsaccordingtotheperformanceweightingofmulti-core[4]nodesinacluster.BecauseiterationsassignedtooneMPIprocessareprocessedinparallelbyOpenMPthreadsrunbytheprocessorcoresinthesamecomputationalnode,thenumberofloopiterationsallocatedtoonecomputationalnodeateachschedulingstepdependsonthenumberofprocessorcoresinthatnode.

Inthispaper,weproposeageneralapproachthatusesperformancefunctionstoestimateperformanceweightsforeachnode.Toverifytheproposedapproach,aheterogeneousclusterandahomogeneousclusterwerebuilt.Inourimplementation,themasternodealsoparticipatesincomputation,whereasinpreviousschemes,onlyslavenodesdocomputationwork.Empiricalresultsshowthatinheterogeneousandhomogeneousclustersenvironments,theproposedapproachimprovedperformanceoverallpreviousschemes.

Therestofthispaperisorganizedasfollows.InSection2,weintroduceseveraltypicalandwell-knownself-schedulingschemes,andafamousbenchmarkusedtoanalyzecomputersystemperformance.InSection3,wedefineourmodelanddescribeourapproach.OursystemconfigurationisthenspecifiedinSection4,andexperimentalresultsforthreetypesofapplicationprogramarepresented.ConcludingremarksandfutureworkaregiveninSection5.

II.BACKGROUNDREVIEW

A.HistoryofGPUandCUDA

Inthepast,wehavetousemorethanonecomputertomultipleCPUparallelcomputing,asshowninthelastchipinthehistoryofthebeginningoftheshowdoesnotneedalotofcomputation,thengraduallytheneedforthegameandeventhegraphicswereandtheneedfor3D,3Dacceleratorcardappeared,andgraduallywebegantodisplaychipforprocessing,begantoshowseparatechips,andevenmadea

similarintheirCPUchips,thatisGPU.WeknowthatGPUcomputingcouldbeusedtogettheanswerswewant,butwhydowechoosetousetheGPU?

ThisslideshowsthecurrentCPUandGPUcomparison.First,wecanseeonlyamaximumofeightcoreCPUnow,buttheGPUhasgrownto260core,thecorenumber,we'

llknowalotofparallelprogramsforGPUcomputing,despitehisrelativelylowfrequencyofcore,weIbelievealargenumberofparallelcomputingpowercouldbeweakerthanasingleissue.Next,weknowthattherearewithintheGPUmemory,andmoreaccesstomainmemoryandGPUCPUGPUaccessonthememorycapacity,wefindthatthespeedofaccessingGPUfasterthanCPUby10times,awholeworse90GB/s,Thisis

quitealarminggap,ofcourse,thisalsomeansthatwhencomputingthetimerequiredtoaccesslargeamountsofdatacanhaveagoodGPUtoimprove.

CPUusingadvancedflowcontrolsuchasbranchpredictordelaybranchandalargecachetoreducememoryaccesslatency,andGPU'

scacheandarelativelysmallnumberofflowcontrolnorhissimple,sothemethodistousealotofGPUcomputingdevicestocoveruptheproblemofmemorylatency,thatis,assuminganaccessmemoryGPUtakes5secondsofthetime,butifthereare100threadsimultaneousaccessto,thetimeis5seconds,buttheassumptionthatCPUtimememoryaccesstimeis0.1seconds,ifthe100threadaccess,thetimeis10seconds,therefore,GPUparallelprocessingcanbeusedtohideeveninaccessmemorythan

CPUspeed.GPUisdesignedsuchthatmoretransistorsaredevotedtodataprocessingratherthandatacachingandflowcontrol,asschematicallyillustratedbyFigure1.

Therefore,weinthearithmeticlogicbyGPUadvantage,tryingtouseNVIDIA'

smulti-coreavailabletohelpusalotofcomputation,andwewillprovideNVIDIAwithsomanycoreprograms,andNVIDIACorporationtoprovidetheAPIofparallelprogramminglargenumberofoperationstocarryout.

WemustusetheformprovidedbyNVIDIACorporationGPUcomputingtorunit?

Notreally.WecanuseNVIDIACUDA,ATICTMandapplemadeOpenCL（OpenComputingLanguage）,isthedevelopmentofCUDAisoneoftheearliestandmostpeopleatthisstageinthelanguagebutwiththeNVIDIACUDAonlysupportsitsowngraphicscard,fromwhereweYoucanseeatthisstagetouseGPUgraphicscardwiththeoperatorofalmostallofNVIDIA,ATIalsohasdevelopeditsownlanguageofCTM,APPLEalsoproposedOpenCL（OpenComputingLanguage）,whichOpenCLhasbeensupportedbyNVIDIAandATI,butATICTMhasalsogivenupthelanguageofanother,bytheuseofthepreviousrelationshipbetweentheGPU,usuallyonlysupportsingle

precisionfloating-pointoperations,andinscience,precisionisaveryimportantindicator,therefore,introducedthisyearcomputinggraphicscardhastosupportaDoubleprecisionfloating-pointoperations.

B.CUDAProgramming

CUDA（anacronymforComputeUnifiedDeviceArchitecture）isaparallelcomputing[2]architecturedevelopedbyNVIDIA.CUDAisthecomputingengineinNVIDIAgraphicsprocessingunitsorGPUsthatisaccessibletosoftwaredevelopersthroughindustrystandardprogramminglanguages.TheCUDAsoftwarestackiscomposedofseverallayersasillustratedinFigure2:

ahardwaredriver,anapplicationprogramminginterface（API）anditsruntime,andtwohigher-levelmathematicallibrariesofcommonusage,CUFFT[17]andCUBLAS[18].Thehardwarehasbeendesignedtosupportlightweightdriverandruntimelayers,resultinginhighperformance.CUDAarchitecturesupportsarangeofcomputationalinterfacesincludingOpenGL[9]andDirectCompute.CUDA’sparallelprogrammingmodelisdesignedtoovercomethischallengewhilemaintainingalowlearningcurveforprogrammerfamiliarwithstandardprogramminglanguagessuchasC.Atitscorearethreekeyabstractions–ahierarchyofthreadgroups,sharedmemories,andbarriersynchronization–thataresimplyexposedtotheprogrammerasaminimalsetof

languageextensions.

Theseabstractionsprovidefine-graineddataparallelismandthreadparallelism,nestedwithincoarse-graineddataparallelismandtaskparallelism.Theyguidetheprogrammertopartitiontheproblemintocoarsesub-problemsthatcanbesolvedindependentlyinparallel,andthenintofinerpiecesthatcanbesolvedcooperativelyinparallel.Suchadecompositionpreserveslanguageexpressivitybyallowingthreadstocooperatewhensolvingeachsub-problem,andatthesametimeenablestransparentscalabilitysinceeachsub-problemcanbescheduledtobesolvedonanyoftheavailableprocessorcores:

AcompiledCUDAprogramcanthereforeexecuteonanynumberofprocessorcores,andonlytheruntimesystemneedstoknowthephysicalprocessorcount.

C.CUDAProcessingflow

Infollowillustration,CUDAprocessingflowisdescribedasFigure3[16].Thefirststep:

copydatafrommainmemorytoGPUmemory,second:

CPUinstructstheprocesstoGPU,third:

GPUexecuteparallelineachcore,finally:

copytheresultfromGPUmemorytomainmemory.

III.SYSTEMHARDWARE

A.TeslaC1060GPUComputingProcessor

TheNVIDIA®

Tesla™C1060transformsaworkstationintoahigh-performancecomputerthatoutperformsasmallcluster.Thisgivestechnicalprofessionalsadedicatedcomputingresourceattheirdesk-sidethatismuchfasterandmoreenergy-efficientthanasharedclusterinthedatacenter.TheNVIDIA®

Tesla™C1060computingprocessorboardwhichconsistsof240coresisaPCIExpress2.0formfactorcomputingadd-incardbasedontheNVIDIATeslaT10graphicsprocessingunit（GPU）.Thisboardistargetedashigh-performancecomputing（HPC）solutionforPCIExpresssystems.TheTeslaC1060[15]iscapableof933GFLOPs/s[13]ofprocessingperformanceandcomesstandardwith4GBofGDDR3memoryat102GB/sbandwidth.

AcomputersystemwithanavailablePCIExpress*16slotisrequiredfortheTeslaC1060.ForthebestsystembandwidthbetweenthehostprocessorandtheTeslaC1060,itisrecommended（butno

展开阅读全文