软件容错性.docx

上传人:b****6 文档编号:8836718 上传时间:2023-02-02 格式:DOCX 页数:15 大小:32.60KB
下载 相关 举报
软件容错性.docx_第1页
第1页 / 共15页
软件容错性.docx_第2页
第2页 / 共15页
软件容错性.docx_第3页
第3页 / 共15页
软件容错性.docx_第4页
第4页 / 共15页
软件容错性.docx_第5页
第5页 / 共15页
点击查看更多>>
下载资源
资源描述

软件容错性.docx

《软件容错性.docx》由会员分享,可在线阅读,更多相关《软件容错性.docx(15页珍藏版)》请在冰豆网上搜索。

软件容错性.docx

软件容错性

Abstract—Thispaperpresentsanddiscussestherationalebehindamethodforstructuringcomplexcomputingsystemsbytheuseofwhatweterm“recoveryblocks,”“conversations,”and“fault-tolerantinterfaces.”Theaimistofacilitatetheprovisionofdependableerrordetectionandrecoveryfacilitieswhichcancopewitherrorscausedbyresidualdesigninadequacies,particularlyinthesystemsoftware,ratherthanmerelytheoccasionalmalfunctioningofhardwarecomponents.

IndexTerms—Acceptancetest,alternateblock,checkpoint,conversation,errordetection,errorrecovery,recoveryblock,recursivecache.

Theconceptof“fault-tolerantcomputing”hasexistedforalongtime.Thefirstbookonthesubject[10]waspublishednolessthantenyearsago,butthenotionoffaulttolerancehasremainedalmostexclusivelythepreserveofthehardwaredesigner.Hardwarestructureshavebeendevelopedwhichcan“tolerate”faults,i.e.,continuetoprovidetherequiredfacilitiesdespiteoccasionalfailures,eithertransientorpermanent,ofinternalcomponentsandmodules.However,hardwarecomponentfailuresareonlyonesourceofunreliabilityincomputingsystems,decreasinginsignificanceascomponentreliabilityimproves,whilesoftwarefaultshavebecomeincreasinglyprevalentwiththesteadilyincreasingsizeandcomplexityofsoftwaresystems.

Ingeneral,fault-toleranthardwaredesignsareexpectedtobecorrect,i.e.,thetoleranceappliestocomponentfailuresratherthandesigninadequacies,althoughthedividinglinebetweenthetwomayonoccasionbedifficulttodefine.Butallsoftwarefaultsresultfromdesignerrors.Therelativefrequencyofsucherrorsreflectsthemuchgreaterlogicalcomplexityofthetypicalsoftwaredesigncomparedtothatofatypicalhardwaredesign.Thedifferenceincomplexityarisesfromthefactthatthe“machines”thathardwaredesignersproducehavearelativelysmallnumberofdistinctiveinternalstates,whereasthedesignerofevenasmallsoftwaresystemhas,bycomparison,anenormousnumberofdifferentstatestoconsider—thusonecanusuallyaffordtotreathardwaredesignsasbeing“correct,”butoftencannotdothesamewithsoftwareevenafterextensivevalidationefforts.(Thedifferenceinscaleisevidencedbythefactthatasoftwaresimulatorofacomputer,writtenatthelevelofdetailrequiredbythehardwaredesignerstoanalyzeandvalidatetheirlogicaldesign,isusuallyoneormoreordersofmagnitudesmallerthantheoperatingsystemsuppliedwiththatcomputer.)

Ifalldesigninadequaciescouldbeavoidedorremovedthiswouldsufficetoachievesoftwarereliability.(Wehereusetheterm“design”toinclude“implementation,”whichisactuallymerelylow-leveldesign,concerningitselfwithdetaileddesigndecisionswhosecorrectnessneverthelesscanbeasvitaltothecorrectfunctioningofthesoftwareasthatofanyhigh-leveldesigndecision.)Indeedmanywritersequatetheterms“softwarereliability”and“programcorrectness.”However,untilreliablecorrectnessproofs(relativetosomecorrectandadequatelydetailedspecification),whichcoverevenimplementationdetails,canbegivenforsystemsofarealisticsize,theonlyalternativemeansofincreasingsoftwarereliabilityistoincorporateprovisionsforsoftwarefaulttolerance.

Infactthereexistsophisticatedcomputingsystems,designedforenvironmentsrequiringnear-continuousservice,whichcontainadhocchecksandcheckpointingfacilitiesthatprovideameasureoftoleranceagainstsomesoftwareerrorsaswellashardwarefailures[11].Theyincidentallydemonstratethefactthatfaulttolerancedoesnotnecessarilyrequirediagnosingthecauseofthefault,orevendecidingwhetheritarisesfromthehardwareorthesoftware.Howevertherehasbeencomparativelylittlespecificresearchintotechniquesforachievingsoftwarefaulttolerance,andtheconstraintstheyimposeoncomputingsystemdesign.

ItwasconsiderationssuchasthesethatledtotheestablishmentattheUniversityofNewcastleuponTyneofaprojectonthedesignofhighlyreliablecomputingsystems,underthesponsorshipoftheScienceResearchCounciloftheUnitedKingdom.Theaimsoftheprojectwereandare“todevelop,andgivearealisticdemonstrationoftheutilityof,computerarchitectureandprogrammingtechniqueswhichwillenableasystemtohaveaveryhighprobabilityofcontinuingtogiveatrustworthyserviceinthepresenceofhardwarefaultsand/orsoftwareerrors,andduringtheirrepair.Amajoraimwillbetodeveloptechniqueswhichareofgeneralutility,ratherthanlimitedtospecialisedenvironments,andtoexplorepossibletradeoffsbetweenreliabilityandperformance.”Amodestnumberofreportsandpapershaveemanatedfromtheprojecttodate,includingageneraloverview[12],papersconcernedwithaddressingandprotection[6],[7],andapreliminaryaccountofourworkonerrordetectionandrecovery[5].Thepresentpaperendeavors

toprovidearathermoreextensivediscussionofourworkonsystemerrorrecoverytechniques,andconcentratesontechniquesforsystemstructuringwhichfacilitatesoftwarefaulttolerance.Acompanionpaper[1]presentsaproof-guidedmethodologyfordesigningtheerrordetectionroutinesthatourmethodrequires.

Allfaulttolerancemustbebasedontheprovisionofusefulredundancy,bothforerrordetectionanderrorrecovery.Insoftwaretheredundancyrequiredisnotsimplereplicationofprogramsbutredundancyofdesign.

Theschemeforfacilitatingsoftwarefaulttolerancethatwehavedevelopedcanberegardedasanalogoustowhathardwaredesignersterm“stand-bysparing.”Asthesystemoperates,checksaremadeontheacceptabilityoftheresultsgeneratedbyeachcomponent.Shouldoneofthesechecksfail,asparecomponentisswitchedintotaketheplaceoftheerroneouscomponent.Thesparecomponentis,ofcourse,notmerelyacopyofthemaincomponent.Ratheritisofindependentdesign,sothattherecanbehopethatitcancopewiththecircumstancesthatcausedthemaincomponenttofail.(Thesecircumstanceswillcomprisethedatathecomponentisprovidedwithand,inthecaseoferrorsduetofaultyprocesssynchronization,thetimingandformofitsinteractionswithotherprocesses.)

Incontrasttothenormalhardwarestand-bysparingscheme,thesparesoftwarecomponentisinvokedtocopewithmerelytheparticularsetofcircumstancesthatresultedinthefailureofthemaincomponent.Weassumethefailureofthiscomponenttobeduetoresidualdesigninadequacies,andhencethatsuchfailuresoccuronlyinexceptionalcircumstances.Thenumberofdifferentsetsofcircumstancesthatcanariseevenwithasoftwarecomponentofcomparativelymodestsizeisimmense.Thereforethesystemcanreverttotheuseofthemaincomponentforsubsequentoperations—inhardwarethiswouldnotnormallybedoneuntilthemaincomponenthadbeenrepaired.Thevarietyofundetectederrorswhichcouldhavebeenmadeinthedesignofanontrivialsoftwarecomponentisessentiallyinfinite.Duetothecomplexityofthecomponent,therelationshipbetweenanysucherroranditseffectatruntimemaybeveryobscure.Forthesereasonswebelievethatdiagnosisoftheoriginalcauseofsoftwareerrorsshouldbelefttohumanstodo,andshouldbedoneincomparativeleisure.Thereforeourschemeforsoftwarefaulttoleranceinnowaydependsonautomateddiagnosisofthecauseoftheerror—thiswouldsurelyresultonlyingreatlyincreasingthecomplexityandthereforetheerrorpronenessofthesystem.

Therecoveryblockschemeforachievingsoftwarefaulttolerancebymeansofstand-bysparinghastwoimportantcharacteristics.

1)Itincorporatesageneralsolutiontotheproblemofswitchingtotheuseofthesparecomponent,i.e.,ofrepairinganydamagedonebytheerroneousmaincomponent,andoftransferringcontroltotheappropriatesparecomponent.

2)Itprovidesamethodofexplicitlystructuringthesoftwaresystemwhichhastheeffectofensuringthattheextrasoftwareinvolvedinerrordetectionandinthesparecomponentsdoesnotaddtothecomplexityofthesystem,andsoreduceratherthanincreaseoverallsystemreliability.

Althoughthebasicrecoveryblockschemehasalreadybeendescribedelsewhere[5],itisconvenienttoincludeabriefaccountofithere.Wewillthendescribeseveralextensionstotheschemedirectedatmorecomplicatedsituationsthanthebasicschemewasintendedfor.Thuswestartbyconsideringtheproblemsoffaulttolerance,i.e.,oferrordetectionandrecovery,withinasinglesequentialprocessinwhichassignmentstostoredvariablesprovidetheonlymeansofmakingrecognizableprogress.Considerationsoftheproblemsofcommunicationwithotherprocesses,eitherwithinthecomputingsystem(e.g.,byasystemofpassingmessages,ortheuseofsharedstorage)orbeyondthecomputingsystem(e.g.,byexplicitinput-outputstatements)isdeferreduntilalatersection.

Theprogressofaprogramisbyitsexecutionofsequencesofthebasicoperationsofthecomputer.Clearly,errorcheckingforeachbasicoperationisoutofthequestion.Apartfromquestionsofexpense,absenceofanawarenessofthewiderscenewouldmakeitdifficulttoformulatethechecks.Wemustaimatachievingatolerablequantityofcheckingandexploitourknowledgeofthefunctionalstructureofthesystemtodistributethesecheckstobestadvantage.Itisstandardpracticetostructurethetextofaprogramofanysignificantcomplexityintoasetofblocks(bywhichtermweincludemodule,procedure,subroutine,paragraph,etc.)inordertosimplifythetaskofunderstandinganddocumentingtheprogram.Suchastructureallowsonetoprovi

展开阅读全文
相关资源
猜你喜欢
相关搜索

当前位置:首页 > 党团工作 > 入党转正申请

copyright@ 2008-2022 冰豆网网站版权所有

经营许可证编号:鄂ICP备2022015515号-1