SAS Data Analysis Examples robust regression.docx
《SAS Data Analysis Examples robust regression.docx》由会员分享,可在线阅读,更多相关《SAS Data Analysis Examples robust regression.docx(9页珍藏版)》请在冰豆网上搜索。
SASDataAnalysisExamplesrobustregression
SASDataAnalysisExamples
RobustRegression
Robustregressionisanalternativetoleastsquaresregressionwhen dataiscontaminatedwithoutliersorinfluentialobservationsanditcanalsobeusedforthepurposeofdetectinginfluentialobservations.
Pleasenote:
Thepurposeofthispageistoshowhowtousevariousdataanalysiscommands.Itdoesnotcoverallaspectsoftheresearchprocesswhichresearchersareexpectedtodo. Inparticular,itdoesnotcoverdatacleaningandchecking,verificationofassumptions,modeldiagnosticsorpotentialfollow-upanalyses.
ThispagewasdevelopedusingSAS9.2.
Introduction
Let'sbeginourdiscussiononrobustregressionwithsometermsinlinearregression.
Residual:
Thedifferencebetweenthepredictedvalue(basedontheregressionequation)andtheactual,observedvalue.
Outlier:
Inlinearregression,anoutlierisanobservationwithlargeresidual. Inotherwords,itisanobservationwhosedependent-variablevalueisunusualgivenitsvalueonthepredictorvariables. Anoutliermayindicateasamplepeculiarityormayindicateadataentryerrororotherproblem.
Leverage:
Anobservationwithanextremevalueonapredictorvariableisapointwithhighleverage. Leverageisameasureofhowfaranindependentvariabledeviatesfromitsmean. Highleveragepointscanhaveagreatamountofeffectontheestimateofregressioncoefficients.
Influence:
Anobservationissaidtobeinfluentialifremovingtheobservationsubstantiallychangestheestimateoftheregressioncoefficients. Influencecanbethoughtofastheproductofleverageandoutlierness.
Cook'sdistance(orCook'sD):
Ameasurethatcombinestheinformationofleverageandresidualoftheobservation.
Robustregressioncanbeusedinanysituationinwhichyouwoulduseleastsquaresregression. Whenfittingaleastsquaresregression,wemightfindsomeoutliersorhighleveragedatapoints. Wehavedecidedthatthesedatapointsarenotdataentryerrors,neithertheyarefromadifferentpopulationthanmostofourdata.Sowehavenocompellingreasontoexcludethemfromtheanalysis. RobustregressionmightbeagoodstrategysinceitisacompromisebetweenexcludingthesepointsentirelyfromtheanalysisandincludingallthedatapointsandtreatingallthemequallyinOLSregression.Theideaofrobustregressionistoweightheobservationsdifferentlybasedonhowwellbehavedtheseobservationsare.Roughlyspeaking,itisaformofweightedandreweightedleastsquaresregression.
ProcrobustreginSAScommandimplementsseveralversionsofrobustregression.Inthispage,wewillshowM-estimationwithHuberandbisquareweighting.ThesetwoareverystandardandarecombinedasthedefaultweightingfunctioninStata'srobustregressioncommand. InHuberweighting,observationswithsmallresidualsgetaweightof1andthelargertheresidual,thesmallertheweight. Withbisquareweighting,allcaseswithanon-zeroresidualgetdown-weightedatleastalittle.
Descriptionoftheexampledata
Forourdataanalysisbelow,wewillusethedatasetcrime. Thisdataset appearsinStatisticalMethodsforSocialSciences,ThirdEditionbyAlanAgrestiandBarbaraFinlay(PrenticeHall,1997). Thevariablesarestateid(sid),statename(state),violentcrimesper100,000people(crime),murdersper1,000,000(murder), thepercentofthepopulationlivinginmetropolitanareas(pctmetro),thepercentofthepopulationthatiswhite(pctwhite),percentofpopulationwithahighschooleducationorabove(pcths),percentofpopulationlivingunderpovertyline(poverty),andpercentofpopulationthataresingleparents(single). Ithas51observations.Wearegoingtousepovertyandsingletopredictcrime.
datacrime;
infile"crime.csv"delimiter=","firstobs=2;
inputsidstate$crimemurderpctmetropctwhitepcthspovertysingle;
run;
procmeansdata=crime;
varcrimepovertysingle;
run;
TheMEANSProcedure
VariableNMeanStdDevMinimumMaximum
------------------------------------------------------------------------------
crime51612.8431373441.100322982.00000002922.00
poverty5114.25882354.58424158.000000026.4000000
single5111.32549022.12149418.400000022.1000000
------------------------------------------------------------------------------
Usingrobustregressionanalysis
Inmostcases,webeginbyrunninganOLSregressionanddoingsomediagnostics. WewillbeginbyrunninganOLSregression. Wecreateagraphshowingtheleverageversusthesquaredresiduals,labelingthepointswiththestateabbreviations.Todoso,weoutputtheresidualsandleverageinprocreg(alongwithCook's-D,whichwewilluselater).
procregdata=crime;
modelcrime=povertysingle;
outputout=tstudent=rescookd=cookdh=lev;
run;
quit;
datat;sett;
resid_sq=res*res;
run;
procsgplotdata=t;
scattery=levx=resid_sq/datalabel=state;
run;
quit;
Aswecansee,DC,FloridaandMississippihaveeitherhighleverageorlargeresiduals. WecandisplaytheobservationsthathaverelativelylargevaluesofCook'sD.Aconventionalcut-offpointis4/n,wherenisthenumberofobservationsinthedataset.Wewillusethiscriteriontoselectthevaluestodisplay.
procprintdata=t;
wherecookd>4/51;
varstatecrimepovertysinglecookd;
run;
Obsstatecrimepovertysinglecookd
1ak7619.114.30.12547
9fl120617.810.60.14259
25ms43424.714.70.61387
51dc292226.422.12.63625
WeprobablyshoulddropDCtobeginwithsinceitisnotevenastate.WeincludeitintheanalysisjusttoshowthatithaslargeCook'sDanddemonstratehowitwillbehandledbyprocrobustreg.Nowwewilllookattheresiduals.Wewillgenerateanewvariablecalledabsr1,whichistheabsolutevalueoftheresiduals(becausethesignoftheresidualdoesn'tmatter).Wethenprintthetenobservationswiththehighestabsoluteresidualvalues.
datat2;sett;
rabs=abs(res);
run;
procsortdata=t2;
bydescendingrabs;
run;
procprintdata=t2(obs=10);
run;
ppr
ccpe
mttoss
scumwpvici
trrehcenodr
Osaidtitrgrol_a
bitmerthtlekesb
sdeeroesyesdvqs
125ms43413.530.763.364.324.714.7-3.562990.613870.1266912.69493.56299
29fl12068.993.083.574.417.810.62.902660.142590.048328.42552.90266
351dc292278.5100.031.873.126.422.12.616452.636250.536026.84582.61645
446vt1143.627.098.480.810.011.0-1.742410.042720.040503.03601.74241
526mt1783.024.092.681.014.910.8-1.460880.016760.023012.13421.46088
621me1261.635.798.578.810.710.6-1.426740.022330.031862.03561.42674
71ak7619.041.875.286.69.114.3-1.397420.125470.161611.95281.39742
831nj6275.3100.080.876.710.99.61.354150.022290.035191.83371.35415
914il96011.484.081.076.213.611.51.338190.012660.020761.79081.33819
1020md99812.792.868.978.49.712.01.287090.035700.060721.65661.28709
Nowlet'srunourfirstrobustregression.Robustregressionisdonebyiteratedre-weightedleastsquares.Theprocedureforrunningrobustregressionisprocrobustreg.ThereareacoupleofestimatorsforIWLS.WearegoingtofirstusetheHuberweightsinthisexample.WecansavethefinalweightscreatedbytheIWLSprocess. Thiscanbeveryuseful.We willusethedatasett2generatedabove.
procrobustregdata=t2method=m(wf=huber);
modelcrime=povertysingle;
outputout=t3weight=wgt;
run;
ModelInformation
DataSetWORK.T2
DependentVariablecrime
NumberofIndependentVariables2
NumberofObservations51
MethodMEstimation
NumberofObservationsRead51
NumberofObservationsUsed51
SummaryStatistics
Standard
VariableQ1MedianQ3MeanDeviationMAD
poverty10.700013.100017.400014.25884.58424.2995
single10.000010.900012.100011.32552.12151.4826
crime326.0515.0780.0612.8441.1345.4
ParameterEstimates
Standard95%ConfidenceChi-
ParameterDFEstimateErrorLimitsSquarePr>ChiSq
Intercept1-1423.23167.5099-1751.54-1094.9172.19<.0001
poverty18.86948.0429-6.894424.63311.220.2701
single1169.001217.3795134.9381203.064494.56<.0001
Scale1181.7251
DiagnosticsSummary
Observation
TypeProportionCutoff
Outlier0.03923.0000
Goodness-of-Fit
StatisticValue
R-Square0.5257
AICR73.1089
BICR78.9100
Deviance2216391
procsortdata=t3;
bywgt;
run;
procprintdata=t3(obs=15);
varstatecrimepovertysinglerescookdlevwgt;
run;
Obsstatecrimepovertysinglerescookdlevwgt
1ms43424.714.7-3.562990.613870.126690.28886
2fl120617.810.62.902660.142590.048320.35947
3vt11410.011.0-1.742410.042720.040500.59545
4dc292226.422.12.616452.636250.536020.64980
5mt17814.910.8-1.460880.016760.023010.68630
6me12610.710.6-1.426740.022330.031860.72509
7nj62710.99.61.354150.022290.035190.73812
8il96013.611.51.338190.012660.020760.76600
9ak7619.114.3-1.397420.125470.161610.78039
10md9989.712.01.287090.035700.060720.79570
11ma80510.710.91.198540.016400.033110.83933
12la106226.414.9-1.021830.067000.161430.91528
13ca107818.212.51.015210.012310.034581.00000
14wy28613.310.8-0.966260.006670.020991.00000
15sc102318.712.30.912130.011110.038531.00000
Wecanseethatroughly,astheabsoluteresidualgoesdown,theweightgoesup. Inotherwords,caseswithalargeresidualstendtobedown-weighted.WecanalsoseethatthevaluesofCook'sDdon'treallycorrespondtotheweights. Thisoutputshowsusthattheobser