外文翻译GPU集群的混合并行编程.docx

上传人:b****6 文档编号:8187500 上传时间:2023-01-29 格式:DOCX 页数:9 大小:77.05KB
下载 相关 举报
外文翻译GPU集群的混合并行编程.docx_第1页
第1页 / 共9页
外文翻译GPU集群的混合并行编程.docx_第2页
第2页 / 共9页
外文翻译GPU集群的混合并行编程.docx_第3页
第3页 / 共9页
外文翻译GPU集群的混合并行编程.docx_第4页
第4页 / 共9页
外文翻译GPU集群的混合并行编程.docx_第5页
第5页 / 共9页
点击查看更多>>
下载资源
资源描述

外文翻译GPU集群的混合并行编程.docx

《外文翻译GPU集群的混合并行编程.docx》由会员分享,可在线阅读,更多相关《外文翻译GPU集群的混合并行编程.docx(9页珍藏版)》请在冰豆网上搜索。

外文翻译GPU集群的混合并行编程.docx

外文翻译GPU集群的混合并行编程

1700单词,9100英文字符,3100汉字

出处:

YangCT,HuangCL,LinCF,etal.HybridParallelProgrammingonGPUClusters[C]//InternationalSymposiumonParallelandDistributedProcessingwithApplications.IEEEComputerSociety,2010:

142-147.

附录1

HybridParallelProgrammingonGPUClusters

Abstract—Nowadays,NVIDIA,sCUDAisageneralpurposescalableparallelprogrammingmodelforwritinghighlyparallelapplications.Itprovidesseveralkeyabstractions-ahierarchyofthreadblocks,sharedmemory,andbarriersynchronization.ThismodelhasprovenquitesuccessfulatprogrammingmultithreadedmanycoreGPUsandscalestransparentlytohundredsofcores:

scientiststhroughoutindustryandacademiaarealreadyusingCUDAtoachievedramaticspeedupsonproductionandresearchcodes.Inthispaper,weproposeahybridparallelprogrammingapproachusinghybridCUDAandMPIprogramming,whichpartitionloopiterationsaccordingtothenumberofC1060GPUnodesinaGPUclusterwhichconsistsofoneC1060andoneS1070.LoopiterationsassignedtooneMPIprocessareprocessedinparallelbyCUDArunbytheprocessorcoresinthesamecomputationalnode.

Keywords:

CUDA,GPU,MPI,OpenMP,hybrid,parallelprogramming

I.INTRODUCTION

Nowadays,NVIDIA’sCUDAisageneralpurposescalableparallelprogrammingmodelforwritinghighlyparallelapplications.Itprovidesseveralkeyabstractions-ahierarchyofthreadblocks,sharedmemory,andbarriersynchronization.ThismodelhasprovenquitesuccessfulatprogrammingmultithreadedmanycoreGPUsandscalestransparentlytohundredsofcores:

scientiststhroughoutindustryandacademiaarealreadyusingCUDAtoachievedramaticspeedupsonproductionandresearchcodes.

InNVDIAtheCUDAchip,alltothecoreofhundredsofwaystoconstructtheirchips,inherewewilltrytouseNVIDIAtoprovidecomputingequipmentforparallelcomputing.Thispaperproposesasolutiontonotonlysimplifytheuseofhardwareaccelerationinconventionalgeneralpurposeapplications,butalsotokeeptheapplicationcodeportable.Inthispaper,weproposeaparallelprogrammingapproachusinghybridCUDA,OpenMPandMPprogramming,whichpartitionloopiterationsaccordingtotheperformanceweightingofmulti-corenodesinacluster.BecauseiterationsassignedtooneMPIprocessareprocessedinparallelbyOpenMPthreadsrunbytheprocessorcoresinthesamecomputationalnode,thenumberofloopiterationsallocatedtoonecomputationalnodeateachschedulingstepdependsonthenumberofprocessorcoresinthatnode.

Inthispaper,weproposeageneralapproachthatusesperformancefunctionstoestimateperformanceweightsforeachnode.Toverifytheproposedapproach,aheterogeneousclusterandahomogeneousclusterwerebuilt.Inourmplementation,themasternodealsoparticipatesincomputation,whereasinpreviousschemes,onlyslavenodesdocomputationwork.Empiricalresultsshowthatinheterogeneousandhomogeneousclustersenvironments,theproposedapproachimprovedperformanceoverallpreviousschemes.

Therestofthispaperisorganizedasfollows.InSection2,weintroduceseveraltypicalandwell-knownself-schedulingschemes,andafamousbenchmarkusedtoanalyzecomputersystemperformance.InSection3,wedefineourmodelanddescribeourapproach.OursystemconfigurationisthenspecifiedinSection4,andexperimentalresultsforthreetypesofapplicationprogramarepresented.ConcludingremarksandfutureworkaregiveninSection5.

II.BACKGROUNDREVIEW

A.HistoryofGPUandCUDA

Inthepast,wehavetousemorethanonecomputertomultipleCPUparallelcomputing,asshowninthelastchipinthehistoryofthebeginningoftheshowdoesnotneedalotofcomputation,thengraduallytheneedforthegameandeventhegraphicswereandtheneedfor3D,3Dacceleratorcardappeared,andgraduallywebegantodisplaychipforprocessing,begantoshowseparatechips,andevenmadeasimilarintheirCPUchips,thatisGPU.WeknowthatGPUcomputingcouldbeusedtogettheanswerswewant,butwhydowechoosetousetheGPU?

ThisslideshowsthecurrentCPUandGPUcomparison.First,wecanseeonlyamaximumofeightcoreCPUnow,buttheGPUhasgrownto260core,thecorenumber,we'llknowalotofparallelprogramsforGPUcomputing,despitehisrelativelylowfrequencyofcore,weIbelievealargenumberofparallelcomputingpowercouldbeweakerthanasingleissue.Next,weknowthattherearewithintheGPUmemory,andmoreaccesstomainmemoryandGPUCPUGPUaccessonthememorycapacity,wefindthatthespeedofaccessingGPUfasterthanCPUby10times,awholeworse90GB/s,Thisisquitealarminggap,ofcourse,thisalsomeansthatwhencomputingthetimerequiredtoaccesslargeamountsofdatacanhaveagoodGPUtoimprove.

CPUusingadvancedflowcontrolsuchasbranchpredictordelaybranchandalargecachetoreducememoryaccesslatency,andGPU'scacheandarelativelysmallnumberofflowcontrolnorhissimple,sothemethodistousealotofGPUcomputingdevicestocoveruptheproblemofmemorylatency,thatis,assuminganaccessmemoryGPUtakes5secondsofthetime,butifthereare100threadsimultaneousaccessto,thetimeis5seconds,buttheassumptionthatCPUtimememoryaccesstimeis0.1seconds,ifthe100threadaccess,thetimeis10seconds,therefore,GPUparallelprocessingcanbeusedtohideeveninaccessmemorythanCPUspeed.GPUisdesignedsuchthatmoretransistorsaredevotedtodataprocessingratherthandatacachingandflowcontrol,asschematicallyillustratedbyFigure1.

Therefore,weinthearithmeticlogicbyGPUadvantage,tryingtouseNVIDIA'smulti-coreavailabletohelpusalotofcomputation,andwewillprovideNVIDIAwithsomanycoreprograms,andNVIDIACorporationtoprovidetheAPIofparallelprogramminglargenumberofoperationstocarryout.

WemustusetheformprovidedbyNVIDIACorporationGPUcomputingtorunit?

Notreally.WecanuseNVIDIACUDA,ATICTMandapplemadeOpenCL(OpenComputingLanguage),isthedevelopmentofCUDAisoneoftheearliestandmostpeopleatthisstageinthelanguagebutwiththeNVIDIACUDAonlysupportsitsowngraphicscard,fromwhereweYoucanseeatthisstagetouseGPUgraphicscardwiththeoperatorofalmostallofNVIDIA,ATIalsohasdevelopeditsownlanguageofCTM,APPLEalsoproposedOpenCL(OpenComputingLanguage),whichOpenCLhasbeensupportedbyNVIDIAandATI,butATICTMhasalsogivenupthelanguageofanother,bytheuseofthepreviousrelationshipbetweentheGPU,usuallyonlysupportsingleprecisionfloating-pointoperations,andinscience,precisionisaveryimportantindicator,therefore,introducedthisyearcomputinggraphicscardhastosupportaDoubleprecisionfloating-pointoperations.

B.CUDAProgramming

CUDA(anacronymforComputeUnifiedDeviceArchitecture)isaparallelcomputingarchitecturedevelopedbyNVIDIA.CUDAisthecomputingengineinNVIDIAgraphicsprocessingunitsorGPUsthatisaccessibletosoftwaredevelopersthroughindustrystandardprogramminglanguages.TheCUDAsoftwarestackiscomposedofseverallayersasillustratedinFigure2:

ahardwaredriver,anapplicationprogramminginterface(API)anditsruntime,andtwohigher-levelmathematicallibrariesofcommonusage,CUFFTandCUBLAS.Thehardwarehasbeendesignedtosupportlightweightdriverandruntimelayers,resultinginhighperformance.CUDAarchitecturesupportsarangeofcomputationalinterfacesincludingOpenGLandDirectCompute.CUDA5sparallelprogrammingmodelisdesignedtoovercomethischallengewhilemaintainingalowlearningcurveforprogrammerfamiliarwithstandardprogramminglanguagessuchasC.Atitscorearethreekeyabstractions-ahierarchyofthreadgroups,sharedmemories,andbarriersynchronization-thataresimplyexposedtotheprogrammerasaminimalsetoflanguageextensions.

C.CUDAProcessingflow

Infollowillustration,CUDAprocessingflowisdescribedasFigure3.Thefirststep:

copydatafrommainmemorytoGPUmemory,second:

CPUinstructstheprocesstoGPU,third:

GPUexecuteparallelineachcore,finally:

copytheresultfromGPUmemorytomainmemory.

III.SYSTEMHARDWARE

A.TeslaC1060GPUComputingProcessor

TheNVIDIA®Tesla™C1060transformsaworkstationintoahigh-performancecomputerthatoutperformsasmallcluster.Thisgivestechnicalprofessionalsadedicatedcomputingresourceattheirdesk-sidethatismuchfasterandmoreenergy-efficientthanasharedclusterinthedatacenter.TheNVIDIA®Tesla™C1060computingprocessorboardwhichconsistsof240coresisaPCIExpress2.0formfactorcomputingadd-incardbasedontheNVIDIATeslaT10graphicsprocessingunit(GPU).Thisboardistargetedashigh-performancecomputing(HPC)solutionforPCIExpresssystems.TheTeslaC1060iscapableof933GFLOPs/sofprocessingperformanceandcomesstandardwith4GBofGDDR3memoryat102GB/sbandwidth.

AcomputersystemwithanavailablePCIExpress*16slotisrequiredfortheTeslaC1060.ForthebestsystembandwidthbetweenthehostprocessorandtheTeslaC1060,itisrecommended(butnotrequired)thattheTeslaC1060beinstalledinaPCIExpressx16Gen2slot.TheTeslaC1060isbasedonthemassivelyparallel,many-coreTeslaprocessor,whichiscoupledwiththestandardCUDACProgramming[14]environmenttosimplifymany-coreprogramming.

B.TeslaS1070GPUComputingSystem

TheNVIDIA®Tesla™S1070computingsystemspeedsthetransitiontoenergy-efficientparallelcomputing.With960processorcoresandastandardsimplifiesapplicationdevelopment,Teslasolvetheworld’smostimportantcomputingchallenges--morequicklyandaccurately.TheNVIDIAComputingSystemisarac

展开阅读全文
相关资源
猜你喜欢
相关搜索

当前位置:首页 > 小学教育 > 语文

copyright@ 2008-2022 冰豆网网站版权所有

经营许可证编号:鄂ICP备2022015515号-1