Branch target buffer design for embedded processors.docx

资源描述

Branch target buffer design for embedded processors.docx

《Branch target buffer design for embedded processors.docx》由会员分享，可在线阅读，更多相关《Branch target buffer design for embedded processors.docx（25页珍藏版）》请在冰豆网上搜索。

Branch target buffer design for embedded processors.docx

Branchtargetbufferdesignforembeddedprocessors

NadavLevison,ShlomoWeiss*

Dept.ofElectricalEngineering–Systems,TelAvivUniversity,TelAviv69978,Israel

Thedemandforembeddedapplicationprocessorsthatsupportmulti-taskingoperatingsystemandcanexecutecomplexapplicationsbringthemclosertogeneralpurposeprocessors.ThesestrongprocessorshavealimitedpowersourcebecausetheyareusuallyfoundinportabledevicessuchassmartphonesandotherPDAs,andarepoweredbybatteries.TheBranchTargetBuffer（BTB）,whichiscommonlyusedingeneralpurposeprocessors,isbecomingprevalentinhigh-endembeddedprocessorsinordertosupportlongpipelinesandmitigatehighmisspenalties.However,theBTBisamajorpowerconsumerbecauseitisalargeSRAMstructurethatisaccessedalmosteverycycle.

WeproposetwoBTBdesignsthatfitthetightpowerbudgetsofembeddedprocessors.Inthefirstdesign,thepowerconsumptionofasingleBTBaccessisreducedbyreadingonlythelowerpartofthepredictedtargetaddressbits.Thisdesignhaspowersavingsofupto25%dynamicpower,witheffectivelynoperformancedegradation.Intheseconddesign,weavoidredundantBTBaccessestothesamesetbyusingasmallbufferthatholdsthemostrecentlyaccessedset.Thisdesignresultsin75%dynamicpowersavingsatthecostofupto0.64%systemslowdownina2-wayBTB,and80%dynamicpowersavingsatthecostofupto0.58%systemslowdownina4-wayBTB._2010ElsevierB.V.Allrightsreserved.

1.Introduction

In1994IBMandBellSouthlaunchedanewmobilephonecalledSimonPersonalCommunicator.Apartfromthecommonmobilephonecapabilities,theSimonhadadditionalfeaturessuchasacalendar,sendingandreceivingE-mailsandfaxes,games,andanaddressbook.Althoughitwasbig,heavy,andcostly,theSimonisconsideredtobethefirstsmartphone.Sincethensmartphoneshavebecomepowerful,easytouse,popular

devicesthatsupportawiderangeoffunctionsandreplaceseveralspecial-purposegadgetswithasinglehighlyintegrateddevice.

Asmartphoneisusuallydefinedasacommonmobilephonethatcanalsofunctionasapersonaldigitalassistant（PDA）.Thefeatures[8]thatmightbeexpectedinamodernsmartphoneincludetheabilitytorunamulti-taskingoperatingsystem,alargedisplay,internetaccess,E-mail,SMS,personalinformationmanagement,voicecommunication,WiFi,stillandvideocamera,musicplayer,GPSandmore.Inordertosupportthislargevarietyoftasks,today’ssmartphonesareusuallybuiltwithanapplicationprocessor,alongwiththeubiquitousdigitalsignalprocessor（DSP）andotherad-hochardwareaccelerators.

1.1.ARMCortex-A8integratedintheTI-OMAP3

TheTexasInstrumentOMAP3family,foundinthenewestPalmPreandSamsungsmartphones,isanexampleofanintegratedcircuitthatincorporatesanapplicationprocessorandotherhardwareaccelerators.TheTI-OMAP3architecture（Fig.1）hasfourbasicblocks:

TheARMCortex-A8applicationprocessor,2D/3DGraphic

Accelerator,ImageVideoAudioAccelerator（IVA2+）andImageSignalProcessor（ISP）.TheARMCortex-A8processor[2]runstheoperatingsystemandavarietyofapplications.Itisadual-issuesuperscalarprocessorwitha13stagepipeline,integratedL2cache,andadvanceddynamicbranchprediction.Apowerfulandlowpowerprocessor,itisproducedina65nmfabricationprocessandcanrunatthemaximumspeedof1.1GHz.

TheevolutionoftheARMCortex-A8predecessorsdemonstratestheincreasingdemandforstrongerembeddedprocessors.ARMprocessorsarewidelyusedincellularphonesandPDAsanditisestimatedthat99percentoftheworld’ssmartphonesemploy[31]ARMtechnology.InTable1,fourselectedARMembeddedprocessors

fromthelast15yearsareshown.1

AsTable1illustrates,embeddedprocessorsarebecomingstronger:

widerissue,longerpipelines,largerexecutionwindow,andbiggeron-chipcachememories.Lookingafewyearsahead,thenextgenerationofembeddedprocessorswilllikelybemulticoreprocessors.OneofthelatestexamplesistheARMCortex-A9MPCore_adualcoreSMPprocessor_integratedintheTexasInstrumentsOMAP4.

1.2.Reducingpowerinembeddedprocessors

Anyportableelectronicdevicesuchasacellularphone,andespeciallysmartphones,mustmanagepowerconsumptionwiselybecauseofthelimitedcapacityofthebattery.Batterycapacitydoesnotimproveasfastasmicroelectronicstechnology,andthesystemenergybudgetisverylimited[29].Thereforeamajoreffortisrequiredtoreducethepowerconsumptionofeveryelementinportabledevices.Inthisworkwefocusonreducingpowerintheapplicationprocessorcomponent.

Oneofthemajordisadvantagesofalongpipelineinsuperscalarprocessors,suchastheCortex-A8,isthehighbranchmispredictionpenalty.Whenbranchmispredictionisdetected,thepipelinemustbeflushedandalltheinstructionsthatfollowthemispredictedbranchmustbecanceled.Themispredictionpenaltyishigherinlongerpipelines.Tominimizethemispredictionpenaltyapowerfulbranchpredictionmechanismisusuallyused[2,32].

Mostprocessorsthatusedynamicbranchpredictionimplementtwokindsofmechanisms:

directionpredictionusedtopredictwhetherabranchistakenornot,andtargetaddresspredictionthatpredictsthetargetaddressoftakenbranches.TheaddresspredictionisusuallyimplementedusingaBranchTargetBuffer,orBTB_astructurethatholdsbranchtargetaddressesofbranchesthatwererecentlyexecuted.TheARMCortex-A8processor,whichhasa13cyclebranchmispredictionpenalty,usesa512-entry,2-way

BTB,anda4096-entryglobalhistorybuffer[2].However,thesestructurescontributetothetotalprocessorpowerconsumptionbecausetheyareSRAMstructuresthatareaccessedinalmosteverycycle.Hencealow-powerBTBisessentialishigh-endembeddedprocessors.

TheresearchpresentedinthispapertargetstheBTBpowerconsumptionproblem.Weproposetwodifferentmechanisms.Thefirstone,SplitDataArray（SDA）BTB,isbasedontheobservationthatmostbranchinstructionsareshortdistanceandthereforedynamicpowercanbesavedbynotaccessingallofthepredictedtarget

addressbits.TheBTBdataarrayinthisdesignisdividedintotwoarrays:

alowdataarraytoholdthelowerpartofthepredictedtargetaddress,andahighdataarraytoholdtheremainingbits.ThelowdataarrayisaccessedineveryBTBaccesswhilethehighdataarrayisaccessedonlywhenneeded.Thesecondmechanism,BTBwithaset-buffer,isbasedonthelocalityofreferencepropertyofbranchaddresses.Inthisdesign,theindexfieldinthebranchaddressisshiftedleftwhenaccessingtheBTB.ThisshiftincreasestheprobabilitythattwosuccessiveBTBaccessesaretothesameBTBset,andthereforeitisworthwhiletobuffertheentiresetwhenaccessed.Aset-bufferisprovidedforthispurpose.If,asexpected,thenextBTBreferenceistothesameset,thepredictioncanbereadfromtheset-buffer,savinganaccesstotheBTB.

1.3.Paperoutline

InSection2wepresentthefirstBTBmechanismSplitDataArray（SDA）BTB,includingmotivation,design,andresults.Designandmotivationofthesecondmechanism_BTBwithaset-buffer_arepresentedinSection3.ThesimulationsetupforbothdesignsisdescribedinSection2.4.RelatedworkisdescribedinSection4andthepaperendswithasummaryandconclusionsinSection5.

2.SDABTB

InthissectionwegiveadescriptionoftheSplitDataArray（SDA）BTB.InSection2.1wediscussthedesignmotivation,andinSection2.2wepresentthegeneralstructureandthewaytheBTBisaccessed.InSection2.3wediscusspowerandtimingissuesrelatedtothisdesign.ResultarepresentedanddiscussedinSection2.5.

2.1.Motivation

Inordertodeterminetherelationbetweenthebranchinstructionaddress（BA）andthebranchtargetaddress（TA）wedefinetheHighestRelevantBitorHRBofthebranchtargetaddressusingthetwofollowingequations:

whereSisthesetofallthebitpositionsinwhichthebranchaddressisdifferentthanthetargetaddress,assuminga32bitsaddressspace（bit0istheLSB）.Thebitsinthepositions0and1arenotstoredintheBTBbecauseweassumeallinstructionsare4-bytelongandarealignedinthememory.HRBisdefinedasthemaximumonS,i.e.theleftmostbitinthetargetaddressthatisdifferentinthebranchaddress.NotethatHRBdoesnotindicatethedistanceofthebranch.Forexample,ifthebranchaddressis0x0000FFFCandthetargetaddressis0x00010000,althoughthebranchdistanceisonlyoneinstructionforwardtheHRBis16.InthesimulationsweranontheSPEC2000programs[1]wediscoveredthattheaverageHRBisverylow.Inatleast47%oftheBTBaccessesHRBislessthanorequaltoeight,andinatleast75%oftheBTBaccessesHRBislessthanorequalto12.Thereasonsforthisbehaviorarelistedbelow.

Textsize:

Programtextsizeisusuallysmall.Accordingtooursimulations,theSPEC2000[1]averagetextsizeisonly194kinstructions.TheproposedBTBmechanismisbasedontheobservationthatwhenaccessingtheBTBasignificantpartofthepredictedtargetaddressisalreadyknown.Thehigherbitsinthebranchaddressareidenticaltothehigherbitsofthetargetaddressandthereforethesebitscanbebypasseddirectlyfromtheaddressofthebranchinstruction（thatisfromtheprogramcounter）insteadofreadingthemfromtheBTB.Dynamicpowerisreducedbyaccessingonlythelowerpartofthetargetaddress.

2.2.SDA-BTBdesign

InatraditionalBTBdesigneachlineiscomposedoffewfields:

Thetagfieldwhichholdsthebranchident

展开阅读全文