Branch target buffer design for embedded processors.docx

上传人:b****4 文档编号:27118751 上传时间:2023-06-27 格式:DOCX 页数:25 大小:34.39KB
下载 相关 举报
Branch target buffer design for embedded processors.docx_第1页
第1页 / 共25页
Branch target buffer design for embedded processors.docx_第2页
第2页 / 共25页
Branch target buffer design for embedded processors.docx_第3页
第3页 / 共25页
Branch target buffer design for embedded processors.docx_第4页
第4页 / 共25页
Branch target buffer design for embedded processors.docx_第5页
第5页 / 共25页
点击查看更多>>
下载资源
资源描述

Branch target buffer design for embedded processors.docx

《Branch target buffer design for embedded processors.docx》由会员分享,可在线阅读,更多相关《Branch target buffer design for embedded processors.docx(25页珍藏版)》请在冰豆网上搜索。

Branch target buffer design for embedded processors.docx

Branchtargetbufferdesignforembeddedprocessors

Branchtargetbufferdesignforembeddedprocessors

NadavLevison,ShlomoWeiss*

Dept.ofElectricalEngineering–Systems,TelAvivUniversity,TelAviv69978,Israel

Thedemandforembeddedapplicationprocessorsthatsupportmulti-taskingoperatingsystemandcanexecutecomplexapplicationsbringthemclosertogeneralpurposeprocessors.ThesestrongprocessorshavealimitedpowersourcebecausetheyareusuallyfoundinportabledevicessuchassmartphonesandotherPDAs,andarepoweredbybatteries.TheBranchTargetBuffer(BTB),whichiscommonlyusedingeneralpurposeprocessors,isbecomingprevalentinhigh-endembeddedprocessorsinordertosupportlongpipelinesandmitigatehighmisspenalties.However,theBTBisamajorpowerconsumerbecauseitisalargeSRAMstructurethatisaccessedalmosteverycycle.

WeproposetwoBTBdesignsthatfitthetightpowerbudgetsofembeddedprocessors.Inthefirstdesign,thepowerconsumptionofasingleBTBaccessisreducedbyreadingonlythelowerpartofthepredictedtargetaddressbits.Thisdesignhaspowersavingsofupto25%dynamicpower,witheffectivelynoperformancedegradation.Intheseconddesign,weavoidredundantBTBaccessestothesamesetbyusingasmallbufferthatholdsthemostrecentlyaccessedset.Thisdesignresultsin75%dynamicpowersavingsatthecostofupto0.64%systemslowdownina2-wayBTB,and80%dynamicpowersavingsatthecostofupto0.58%systemslowdownina4-wayBTB._2010ElsevierB.V.Allrightsreserved.

1.Introduction

In1994IBMandBellSouthlaunchedanewmobilephonecalledSimonPersonalCommunicator.Apartfromthecommonmobilephonecapabilities,theSimonhadadditionalfeaturessuchasacalendar,sendingandreceivingE-mailsandfaxes,games,andanaddressbook.Althoughitwasbig,heavy,andcostly,theSimonisconsideredtobethefirstsmartphone.Sincethensmartphoneshavebecomepowerful,easytouse,popular

devicesthatsupportawiderangeoffunctionsandreplaceseveralspecial-purposegadgetswithasinglehighlyintegrateddevice.

Asmartphoneisusuallydefinedasacommonmobilephonethatcanalsofunctionasapersonaldigitalassistant(PDA).Thefeatures[8]thatmightbeexpectedinamodernsmartphoneincludetheabilitytorunamulti-taskingoperatingsystem,alargedisplay,internetaccess,E-mail,SMS,personalinformationmanagement,voicecommunication,WiFi,stillandvideocamera,musicplayer,GPSandmore.Inordertosupportthislargevarietyoftasks,today’ssmartphonesareusuallybuiltwithanapplicationprocessor,alongwiththeubiquitousdigitalsignalprocessor(DSP)andotherad-hochardwareaccelerators.

1.1.ARMCortex-A8integratedintheTI-OMAP3

TheTexasInstrumentOMAP3family,foundinthenewestPalmPreandSamsungsmartphones,isanexampleofanintegratedcircuitthatincorporatesanapplicationprocessorandotherhardwareaccelerators.TheTI-OMAP3architecture(Fig.1)hasfourbasicblocks:

TheARMCortex-A8applicationprocessor,2D/3DGraphic

Accelerator,ImageVideoAudioAccelerator(IVA2+)andImageSignalProcessor(ISP).TheARMCortex-A8processor[2]runstheoperatingsystemandavarietyofapplications.Itisadual-issuesuperscalarprocessorwitha13stagepipeline,integratedL2cache,andadvanceddynamicbranchprediction.Apowerfulandlowpowerprocessor,itisproducedina65nmfabricationprocessandcanrunatthemaximumspeedof1.1GHz.

TheevolutionoftheARMCortex-A8predecessorsdemonstratestheincreasingdemandforstrongerembeddedprocessors.ARMprocessorsarewidelyusedincellularphonesandPDAsanditisestimatedthat99percentoftheworld’ssmartphonesemploy[31]ARMtechnology.InTable1,fourselectedARMembeddedprocessors

fromthelast15yearsareshown.1

AsTable1illustrates,embeddedprocessorsarebecomingstronger:

widerissue,longerpipelines,largerexecutionwindow,andbiggeron-chipcachememories.Lookingafewyearsahead,thenextgenerationofembeddedprocessorswilllikelybemulticoreprocessors.OneofthelatestexamplesistheARMCortex-A9MPCore_adualcoreSMPprocessor_integratedintheTexasInstrumentsOMAP4.

1.2.Reducingpowerinembeddedprocessors

Anyportableelectronicdevicesuchasacellularphone,andespeciallysmartphones,mustmanagepowerconsumptionwiselybecauseofthelimitedcapacityofthebattery.Batterycapacitydoesnotimproveasfastasmicroelectronicstechnology,andthesystemenergybudgetisverylimited[29].Thereforeamajoreffortisrequiredtoreducethepowerconsumptionofeveryelementinportabledevices.Inthisworkwefocusonreducingpowerintheapplicationprocessorcomponent.

Oneofthemajordisadvantagesofalongpipelineinsuperscalarprocessors,suchastheCortex-A8,isthehighbranchmispredictionpenalty.Whenbranchmispredictionisdetected,thepipelinemustbeflushedandalltheinstructionsthatfollowthemispredictedbranchmustbecanceled.Themispredictionpenaltyishigherinlongerpipelines.Tominimizethemispredictionpenaltyapowerfulbranchpredictionmechanismisusuallyused[2,32].

Mostprocessorsthatusedynamicbranchpredictionimplementtwokindsofmechanisms:

directionpredictionusedtopredictwhetherabranchistakenornot,andtargetaddresspredictionthatpredictsthetargetaddressoftakenbranches.TheaddresspredictionisusuallyimplementedusingaBranchTargetBuffer,orBTB_astructurethatholdsbranchtargetaddressesofbranchesthatwererecentlyexecuted.TheARMCortex-A8processor,whichhasa13cyclebranchmispredictionpenalty,usesa512-entry,2-way

BTB,anda4096-entryglobalhistorybuffer[2].However,thesestructurescontributetothetotalprocessorpowerconsumptionbecausetheyareSRAMstructuresthatareaccessedinalmosteverycycle.Hencealow-powerBTBisessentialishigh-endembeddedprocessors.

TheresearchpresentedinthispapertargetstheBTBpowerconsumptionproblem.Weproposetwodifferentmechanisms.Thefirstone,SplitDataArray(SDA)BTB,isbasedontheobservationthatmostbranchinstructionsareshortdistanceandthereforedynamicpowercanbesavedbynotaccessingallofthepredictedtarget

addressbits.TheBTBdataarrayinthisdesignisdividedintotwoarrays:

alowdataarraytoholdthelowerpartofthepredictedtargetaddress,andahighdataarraytoholdtheremainingbits.ThelowdataarrayisaccessedineveryBTBaccesswhilethehighdataarrayisaccessedonlywhenneeded.Thesecondmechanism,BTBwithaset-buffer,isbasedonthelocalityofreferencepropertyofbranchaddresses.Inthisdesign,theindexfieldinthebranchaddressisshiftedleftwhenaccessingtheBTB.ThisshiftincreasestheprobabilitythattwosuccessiveBTBaccessesaretothesameBTBset,andthereforeitisworthwhiletobuffertheentiresetwhenaccessed.Aset-bufferisprovidedforthispurpose.If,asexpected,thenextBTBreferenceistothesameset,thepredictioncanbereadfromtheset-buffer,savinganaccesstotheBTB.

1.3.Paperoutline

InSection2wepresentthefirstBTBmechanismSplitDataArray(SDA)BTB,includingmotivation,design,andresults.Designandmotivationofthesecondmechanism_BTBwithaset-buffer_arepresentedinSection3.ThesimulationsetupforbothdesignsisdescribedinSection2.4.RelatedworkisdescribedinSection4andthepaperendswithasummaryandconclusionsinSection5.

2.SDABTB

InthissectionwegiveadescriptionoftheSplitDataArray(SDA)BTB.InSection2.1wediscussthedesignmotivation,andinSection2.2wepresentthegeneralstructureandthewaytheBTBisaccessed.InSection2.3wediscusspowerandtimingissuesrelatedtothisdesign.ResultarepresentedanddiscussedinSection2.5.

2.1.Motivation

Inordertodeterminetherelationbetweenthebranchinstructionaddress(BA)andthebranchtargetaddress(TA)wedefinetheHighestRelevantBitorHRBofthebranchtargetaddressusingthetwofollowingequations:

whereSisthesetofallthebitpositionsinwhichthebranchaddressisdifferentthanthetargetaddress,assuminga32bitsaddressspace(bit0istheLSB).Thebitsinthepositions0and1arenotstoredintheBTBbecauseweassumeallinstructionsare4-bytelongandarealignedinthememory.HRBisdefinedasthemaximumonS,i.e.theleftmostbitinthetargetaddressthatisdifferentinthebranchaddress.NotethatHRBdoesnotindicatethedistanceofthebranch.Forexample,ifthebranchaddressis0x0000FFFCandthetargetaddressis0x00010000,althoughthebranchdistanceisonlyoneinstructionforwardtheHRBis16.InthesimulationsweranontheSPEC2000programs[1]wediscoveredthattheaverageHRBisverylow.Inatleast47%oftheBTBaccessesHRBislessthanorequaltoeight,andinatleast75%oftheBTBaccessesHRBislessthanorequalto12.Thereasonsforthisbehaviorarelistedbelow.

Textsize:

Programtextsizeisusuallysmall.Accordingtooursimulations,theSPEC2000[1]averagetextsizeisonly194kinstructions.TheproposedBTBmechanismisbasedontheobservationthatwhenaccessingtheBTBasignificantpartofthepredictedtargetaddressisalreadyknown.Thehigherbitsinthebranchaddressareidenticaltothehigherbitsofthetargetaddressandthereforethesebitscanbebypasseddirectlyfromtheaddressofthebranchinstruction(thatisfromtheprogramcounter)insteadofreadingthemfromtheBTB.Dynamicpowerisreducedbyaccessingonlythelowerpartofthetargetaddress.

2.2.SDA-BTBdesign

InatraditionalBTBdesigneachlineiscomposedoffewfields:

Thetagfieldwhichholdsthebranchident

展开阅读全文
相关资源
猜你喜欢
相关搜索

当前位置:首页 > 幼儿教育 > 育儿知识

copyright@ 2008-2022 冰豆网网站版权所有

经营许可证编号:鄂ICP备2022015515号-1