lecture6.docx
《lecture6.docx》由会员分享,可在线阅读,更多相关《lecture6.docx(18页珍藏版)》请在冰豆网上搜索。
lecture6
Chapter3DiagnosticandRemedialMeasuresSet1
Itisimportanttoexaminetheaptnessofthemodelforthedatabeforeinferencebasedonthatmodelareundertaken.Inthischapter,wediscusssomesimplegraphicmethodsforstudyingtheappropriatenessofamodel,aswellassomestatisticaltestsfordoingso.Wealsoconsidersomeremedialtechniqueswhenmodel(2.1)isnotappropriateforthedata.
Lecture6Residuals
Departurefrommodel(2.1)(thesimplelinearregressionmodelwithnormalerror)tobestudiedbyresiduals
1.Theregressionfunctionisnotlinear.
2.Theerrortermsdonothaveconstantvariance.
3.Theerrortermsarenotindependent.
4.Themodelfitsallbutoneorafewoutlierobservations.
5.Theerrortermsarenotnormallydistributed.
6.Oneorseveralimportantpredictorvariableshavebeenomittedfromthemodel.
Residuals
Supposethedataset
hasbeenusedtofittheleastsquaresregressionline
.
Theithresidualis
.
Lemma1.Undermodel(2.1),for
theresidual
followsnormaldistributionthat
and
.
Note:
When
islargeandX’sarewellspaced,
and
.
Aswealreadyknow,
and
.
Remarks:
Theresiduals
areinasensetheestimatorsoftheunobservedrandomerrorterms
.
StudentizedandSemi-studentizedResiduals
Let
denotetheestimatedstandarddeviation(i.e.,standarderror)oftheithresidual.ThentheithStudentizedresidualisgivenby
.
SuchmodifiedresidualsaresaidtobeStudentizedsincetheyareobtainedfromtheithresidualbysubtractingthemean
anddividingbyitsstandarderror.Thisproceduremimicstheprocedureusedtocomputetheteststatisticfortestingahypothesisaboutthemeanofanormalpopulationwhenthepopulationvarianceisunknown.Suchateststatistichasastudentt-distribution(withdegreesoffreedomn-1).ThemodifiedresidualdefinedabovedosenotactuallyhasaStudenttdistribution,(
andSSEarenotindependent),butitisobtainedbythesametypeoftransformationusedtoconstructrandomvariableshavingtheStudenttdistribution.Thusthename,Studentized.
Remarks
1.Onpage103ofyourtextbook,theauthordiscussessemi-studentizedresiduals.Theithsemi-studentizedresidualissimply
.
When
islargeandtheX’sarewellspaced,
sothat
.Thesemi-studentizedresidualsarecertainlyeasiertocompute(assumingthatyouhadtomakethecomputationyourself),butSASwillcomputetheactualStudentizedresidualsuponrequest,sowhynotgowiththe“realthing”.
2.SAScanalsocomputewhatSAScallsRstudentizedresiduals.ThedifferencebetweenanordinarystudentizedresidualandRstudentizedresidualisthatfortherstudentizedresidual
isreplacedwith
where
.
Rstudentizedresidualswereproposedbybelsley,KuhandWelschintheirbook,RegressionDiagnostics(weiley1980).Inthecontestofsimpleregression,theyclaimthatforeachi,rstudenthasapproximatelyaStudenttdistributionwithn-3degreesoffreedom.Rstudentresidualsaregoodfordetectingoutlierssinceanobservationwithalargeresidualtendsinflatethe
.Deletingthisobservationincomputingthe
meansthat
.
However,theauthorofyourtextdonotdiscussRstudentizedresiduals(theycallthemdeletedstudentnizedresiduals)untilChapter9or10,sotokeepthingssimple,wewillwaittillthentodiscussthistypeofresidual.
ResidualPlots
Insimplelinearregression(onepredictorvariable),residualsareusuallyplottedagainsttheircorresponding
valueoragainsttheircorrespondingpredictedvalue
.(Inmultipleregression,wheretherearemanydifferent
’s,residualsareusuallyplottedagainst
.)Residualsareneverplottedagainstthecorrespondingactual
becausethesetermshaveapositivecovariance,whichwouldappearasapositivetrendintheplot.Ontheotherhand
and
areuncorrelated.
(
.
So,
and
.
)
Iftheassumptionsofmodel(2.1)arecorrectforthedata,theresiduals(plottedontheordinate)againsteither
or
(ontheabscissa)shouldberandomlydistributedabutthehorizontalaxis.
Example:
InthevehicleweightversusMPGexample,theresidualsandthestudentizedresidualsaregivenintheoutputofthefollowingSAScode:
PROCREGDATA=Cars;
MODELmpg=weight/R;
OUTPUTout=CarsoutP=PredMPGR=ResidualStudent=Stud_ResRstudent=Rstud_Res;
RUN;
goptionsreset=globalgunit=pctborder
ftext=swissbhtitle=6htext=3
hsize=8invsize=5incback=white;
/*graphinhsymbols,theirinterpolationsandcolors*/
symbol1v=circleh=3c=red;
symbol2v=squareh=4c=green;
symbol3v=diamondh=3c=red;
run;
titlecolor=blue'Stud_res,Rstud_resandresidualv.s.wgt';
PROCGPLOT;
PLOTResidual*weight=2Stud_Res*weight=1Rstud_Res*weight=3/legendoverlayVref=0;
RUN;
titlecolor=blue'Stud_resv.s.PredMPG';
PROCGPLOT;
PLOTStud_Res*Predmpg=3/Vref=0;
RUN;
OutputStatistics
DependentPredictedStdErrorStdErrorStudentCook's
ObsVariableValueMeanPredictResidualResidualResidual-2-1012D
118.300018.09290.16960.20710.3100.668||*|0.067
215.900016.30450.1381-0.40450.325-1.243|**||0.139
316.400016.50320.1259-0.10320.330-0.313|||0.007
417.500017.29810.11710.20190.3340.605||*|0.023
515.500015.50970.2067-0.0096770.287-0.0337|||0.000
618.800018.49030.20670.30970.2871.080||**|0.303
716.800016.50320.12590.29680.3300.899||*|0.059
816.500016.90060.1124-0.40060.335-1.195|**||0.080
916.500016.10580.15290.39420.3191.237||**|0.176
1017.800018.29160.1876-0.49160.300-1.641|***||0.528
SumofResiduals0
SumofSquaredResiduals0.99961
PredictedResidualSS(PRESS)1.60514
NonlinearityoftheRegressionFunction
Iftheplotoftheresiduals(orthestudentizedresiduals)againstthepredictorvariable(orthepredictedresponsevariable,
)isnotrandomlydistributedaboutthehorizontalaxis,itcouldbeanindicationthattheregressionfunctionisnonlinear.Nonlinearityoftheregressionfunctioncanalsobeascertainedfromthescatterplot,butthescatterplotisnotalwaysaseffectiveasaresidualplot.SeeFigure3.3onpage105.AlsoseeFigure3.4(b)onpage106.
Note:
The(studentized)residualplotsinourvehicleweightvsMPGexampleshowamoreorlessrandomdispersionaboutthehorizontalaxis.Thusthelinearmodelinthisinstanceappearstobeadequate.Thisconclusionissupportedbytherelativelysmallrootmeansquareof0.35348andrelativelyhigh
.
NonconstancyofErrorVariance
Aplotoftheresiduals(orthestudentizedresiduals)againstthepredictorvariable
orthepredictedresponsevariable
arealsousefulinaccessingwhetherofnottheerrorvarianceisconstantasassumedinthemodel.Ifthemagnitudeoftheresidualstendstoincreaseor(lesslikely)todecreaseas
increases,thisisindicativeofasituationinwhichtheerrorvarianceischangingasthevalueoftheindependentvariable
changes.Since
islinearlyrelatedtopredictorvariable
asimilarstatementcanbemaderegardingaplotoftheresiduals(orstudentizedresiduals)against
.Systematicchangesinthemagnitudeoftheresidualsviolatetheassumptionthattheerrorshaveconstantvariance.A“wedgeshaped”residualplotasinFigure3.4(c),page106,wouldtypifythissituation.Alesslikely,butpossiblesituationisiftheerrorvarianceisdecreasingas
isincreases.Thissituationwouldresultina“reversedwedge”plot.
Note:
thestudentizedresidualplotsintheweight-MPGexampledonotexhibitanywedge-shapedpattern,whichindicatesthatthevarianceismoreorlessconstant.However,theresidualplotinFigure3.5,page107,showsatendencyforthevariabilityoftheresidualstoincreasewith
.
Outliers
Outliersareextremeobservations.Theycanbeidentifiedbyresidualplotsagainsteither
or
.Studentizedplotsareparticularlyhelpfulinthiscontext.Aroughruleofthumb(whennislarge)istoconsideranobservation
whosestudentizedresidual
tobeanoutlier.Actuallythisruleisratherconservative.Amoreaggressiveruleistodeclareobservation
tobeanoutlierif
andsomestatisticsrecommendedusing2.5.MorerefinedproceduresforidentifyingoutlierswillbediscussedinChapter10.
Thebigquestionis“Shouldoutliers,onceidentified,bediscarded?
”Itisalwaystemptingtodiscardoutlierssincetheytendtodestroytheleastsquarefit,particularlyinsmalltomoderatesamples.So,theresidualplotsmayimproperlysuggestalackoffitofthelinearregressionmodel,inadditiontoflaggingtheoutlier.Figure3.7,page109,clearlyillustratesthissituation.
However,inmoststatisticians’opinions,outliershouldbediscardedifandonlyif
1.Theobservationcausingtheoutlierinvolvesandatainputerror,or
2.Theobservationcausingtheoutlierinvolvesanextraneouscase.
Byanextraneouscasewemeanthattheoutlyingobservationwascollectedunderconditionssubstantiallydifferentfromthatoftheotherobservations.Unfortunately,thestatisticianmaynotalwaysbeabletoascertainwhetherornotsituation1and/or2pertain.
Theautomaticdiscardingofoutlierscanresultinoverfittingthelinearmodeltotheremainingdatapoints.Furthermore,outliersmayconveysignificantinformation,suchaswhentheoutlieristheresultofinteractionofsomeotherpredictorvariable,whichisnotincludedinthemodel.
Note:
Figure3.6,page108,showsaresidualplotwithanoutlier.Therearenooutliersinourvehicledata.Refertotheresidualandstudentizedresidualplots.
Nonindependenceoftheerrorterms
Althoughtheactualerrortermsareassumedtobe