Branch target buffer design for embedded processors.docx
《Branch target buffer design for embedded processors.docx》由会员分享,可在线阅读,更多相关《Branch target buffer design for embedded processors.docx(25页珍藏版)》请在冰豆网上搜索。
Branchtargetbufferdesignforembeddedprocessors
Branchtargetbufferdesignforembeddedprocessors
NadavLevison,ShlomoWeiss*
Dept.ofElectricalEngineering–Systems,TelAvivUniversity,TelAviv69978,Israel
Thedemandforembeddedapplicationprocessorsthatsupportmulti-taskingoperatingsystemandcanexecutecomplexapplicationsbringthemclosertogeneralpurposeprocessors.ThesestrongprocessorshavealimitedpowersourcebecausetheyareusuallyfoundinportabledevicessuchassmartphonesandotherPDAs,andarepoweredbybatteries.TheBranchTargetBuffer(BTB),whichiscommonlyusedingeneralpurposeprocessors,isbecomingprevalentinhigh-endembeddedprocessorsinordertosupportlongpipelinesandmitigatehighmisspenalties.However,theBTBisamajorpowerconsumerbecauseitisalargeSRAMstructurethatisaccessedalmosteverycycle.
WeproposetwoBTBdesignsthatfitthetightpowerbudgetsofembeddedprocessors.Inthefirstdesign,thepowerconsumptionofasingleBTBaccessisreducedbyreadingonlythelowerpartofthepredictedtargetaddressbits.Thisdesignhaspowersavingsofupto25%dynamicpower,witheffectivelynoperformancedegradation.Intheseconddesign,weavoidredundantBTBaccessestothesamesetbyusingasmallbufferthatholdsthemostrecentlyaccessedset.Thisdesignresultsin75%dynamicpowersavingsatthecostofupto0.64%systemslowdownina2-wayBTB,and80%dynamicpowersavingsatthecostofupto0.58%systemslowdownina4-wayBTB._2010ElsevierB.V.Allrightsreserved.
1.Introduction
In1994IBMandBellSouthlaunchedanewmobilephonecalledSimonPersonalCommunicator.Apartfromthecommonmobilephonecapabilities,theSimonhadadditionalfeaturessuchasacalendar,sendingandreceivingE-mailsandfaxes,games,andanaddressbook.Althoughitwasbig,heavy,andcostly,theSimonisconsideredtobethefirstsmartphone.Sincethensmartphoneshavebecomepowerful,easytouse,popular
devicesthatsupportawiderangeoffunctionsandreplaceseveralspecial-purposegadgetswithasinglehighlyintegrateddevice.
Asmartphoneisusuallydefinedasacommonmobilephonethatcanalsofunctionasapersonaldigitalassistant(PDA).Thefeatures[8]thatmightbeexpectedinamodernsmartphoneincludetheabilitytorunamulti-taskingoperatingsystem,alargedisplay,internetaccess,E-mail,SMS,personalinformationmanagement,voicecommunication,WiFi,stillandvideocamera,musicplayer,GPSandmore.Inordertosupportthislargevarietyoftasks,today’ssmartphonesareusuallybuiltwithanapplicationprocessor,alongwiththeubiquitousdigitalsignalprocessor(DSP)andotherad-hochardwareaccelerators.
1.1.ARMCortex-A8integratedintheTI-OMAP3
TheTexasInstrumentOMAP3family,foundinthenewestPalmPreandSamsungsmartphones,isanexampleofanintegratedcircuitthatincorporatesanapplicationprocessorandotherhardwareaccelerators.TheTI-OMAP3architecture(Fig.1)hasfourbasicblocks:
TheARMCortex-A8applicationprocessor,2D/3DGraphic
Accelerator,ImageVideoAudioAccelerator(IVA2+)andImageSignalProcessor(ISP).TheARMCortex-A8processor[2]runstheoperatingsystemandavarietyofapplications.Itisadual-issuesuperscalarprocessorwitha13stagepipeline,integratedL2cache,andadvanceddynamicbranchprediction.Apowerfulandlowpowerprocessor,itisproducedina65nmfabricationprocessandcanrunatthemaximumspeedof1.1GHz.
TheevolutionoftheARMCortex-A8predecessorsdemonstratestheincreasingdemandforstrongerembeddedprocessors.ARMprocessorsarewidelyusedincellularphonesandPDAsanditisestimatedthat99percentoftheworld’ssmartphonesemploy[31]ARMtechnology.InTable1,fourselectedARMembeddedprocessors
fromthelast15yearsareshown.1
AsTable1illustrates,embeddedprocessorsarebecomingstronger:
widerissue,longerpipelines,largerexecutionwindow,andbiggeron-chipcachememories.Lookingafewyearsahead,thenextgenerationofembeddedprocessorswilllikelybemulticoreprocessors.OneofthelatestexamplesistheARMCortex-A9MPCore_adualcoreSMPprocessor_integratedintheTexasInstrumentsOMAP4.
1.2.Reducingpowerinembeddedprocessors
Anyportableelectronicdevicesuchasacellularphone,andespeciallysmartphones,mustmanagepowerconsumptionwiselybecauseofthelimitedcapacityofthebattery.Batterycapacitydoesnotimproveasfastasmicroelectronicstechnology,andthesystemenergybudgetisverylimited[29].Thereforeamajoreffortisrequiredtoreducethepowerconsumptionofeveryelementinportabledevices.Inthisworkwefocusonreducingpowerintheapplicationprocessorcomponent.
Oneofthemajordisadvantagesofalongpipelineinsuperscalarprocessors,suchastheCortex-A8,isthehighbranchmispredictionpenalty.Whenbranchmispredictionisdetected,thepipelinemustbeflushedandalltheinstructionsthatfollowthemispredictedbranchmustbecanceled.Themispredictionpenaltyishigherinlongerpipelines.Tominimizethemispredictionpenaltyapowerfulbranchpredictionmechanismisusuallyused[2,32].
Mostprocessorsthatusedynamicbranchpredictionimplementtwokindsofmechanisms:
directionpredictionusedtopredictwhetherabranchistakenornot,andtargetaddresspredictionthatpredictsthetargetaddressoftakenbranches.TheaddresspredictionisusuallyimplementedusingaBranchTargetBuffer,orBTB_astructurethatholdsbranchtargetaddressesofbranchesthatwererecentlyexecuted.TheARMCortex-A8processor,whichhasa13cyclebranchmispredictionpenalty,usesa512-entry,2-way
BTB,anda4096-entryglobalhistorybuffer[2].However,thesestructurescontributetothetotalprocessorpowerconsumptionbecausetheyareSRAMstructuresthatareaccessedinalmosteverycycle.Hencealow-powerBTBisessentialishigh-endembeddedprocessors.
TheresearchpresentedinthispapertargetstheBTBpowerconsumptionproblem.Weproposetwodifferentmechanisms.Thefirstone,SplitDataArray(SDA)BTB,isbasedontheobservationthatmostbranchinstructionsareshortdistanceandthereforedynamicpowercanbesavedbynotaccessingallofthepredictedtarget
addressbits.TheBTBdataarrayinthisdesignisdividedintotwoarrays:
alowdataarraytoholdthelowerpartofthepredictedtargetaddress,andahighdataarraytoholdtheremainingbits.ThelowdataarrayisaccessedineveryBTBaccesswhilethehighdataarrayisaccessedonlywhenneeded.Thesecondmechanism,BTBwithaset-buffer,isbasedonthelocalityofreferencepropertyofbranchaddresses.Inthisdesign,theindexfieldinthebranchaddressisshiftedleftwhenaccessingtheBTB.ThisshiftincreasestheprobabilitythattwosuccessiveBTBaccessesaretothesameBTBset,andthereforeitisworthwhiletobuffertheentiresetwhenaccessed.Aset-bufferisprovidedforthispurpose.If,asexpected,thenextBTBreferenceistothesameset,thepredictioncanbereadfromtheset-buffer,savinganaccesstotheBTB.
1.3.Paperoutline
InSection2wepresentthefirstBTBmechanismSplitDataArray(SDA)BTB,includingmotivation,design,andresults.Designandmotivationofthesecondmechanism_BTBwithaset-buffer_arepresentedinSection3.ThesimulationsetupforbothdesignsisdescribedinSection2.4.RelatedworkisdescribedinSection4andthepaperendswithasummaryandconclusionsinSection5.
2.SDABTB
InthissectionwegiveadescriptionoftheSplitDataArray(SDA)BTB.InSection2.1wediscussthedesignmotivation,andinSection2.2wepresentthegeneralstructureandthewaytheBTBisaccessed.InSection2.3wediscusspowerandtimingissuesrelatedtothisdesign.ResultarepresentedanddiscussedinSection2.5.
2.1.Motivation
Inordertodeterminetherelationbetweenthebranchinstructionaddress(BA)andthebranchtargetaddress(TA)wedefinetheHighestRelevantBitorHRBofthebranchtargetaddressusingthetwofollowingequations:
whereSisthesetofallthebitpositionsinwhichthebranchaddressisdifferentthanthetargetaddress,assuminga32bitsaddressspace(bit0istheLSB).Thebitsinthepositions0and1arenotstoredintheBTBbecauseweassumeallinstructionsare4-bytelongandarealignedinthememory.HRBisdefinedasthemaximumonS,i.e.theleftmostbitinthetargetaddressthatisdifferentinthebranchaddress.NotethatHRBdoesnotindicatethedistanceofthebranch.Forexample,ifthebranchaddressis0x0000FFFCandthetargetaddressis0x00010000,althoughthebranchdistanceisonlyoneinstructionforwardtheHRBis16.InthesimulationsweranontheSPEC2000programs[1]wediscoveredthattheaverageHRBisverylow.Inatleast47%oftheBTBaccessesHRBislessthanorequaltoeight,andinatleast75%oftheBTBaccessesHRBislessthanorequalto12.Thereasonsforthisbehaviorarelistedbelow.
Textsize:
Programtextsizeisusuallysmall.Accordingtooursimulations,theSPEC2000[1]averagetextsizeisonly194kinstructions.TheproposedBTBmechanismisbasedontheobservationthatwhenaccessingtheBTBasignificantpartofthepredictedtargetaddressisalreadyknown.Thehigherbitsinthebranchaddressareidenticaltothehigherbitsofthetargetaddressandthereforethesebitscanbebypasseddirectlyfromtheaddressofthebranchinstruction(thatisfromtheprogramcounter)insteadofreadingthemfromtheBTB.Dynamicpowerisreducedbyaccessingonlythelowerpartofthetargetaddress.
2.2.SDA-BTBdesign
InatraditionalBTBdesigneachlineiscomposedoffewfields:
Thetagfieldwhichholdsthebranchident