北大暑期课程《回归分析报告》Linear Regression Analysis讲义PKU8.docx
《北大暑期课程《回归分析报告》Linear Regression Analysis讲义PKU8.docx》由会员分享,可在线阅读,更多相关《北大暑期课程《回归分析报告》Linear Regression Analysis讲义PKU8.docx(19页珍藏版)》请在冰豆网上搜索。
北大暑期课程《回归分析报告》LinearRegressionAnalysis讲义PKU8
Class8:
polynomialregressionanddummyvariables
I.PolynomialRegression
Polynomialregressionisaminortopic.Becausethereislittlethatisnew.Whatisnewisthatyoumaywanttocreateanewvariablefromthesamedataset.
Thisisnecessaryifyouthinkthatthetrueregressionfunctionisnotlinearbutquadratic,youmightwanttotrytousethequadraticfunction,thatis,thefirstandthesecondorderregressors.
Forexample,weknowthatearningsincreasesasafunctionofage.Buttherelationshipisnotlinear.Therefore,weregressearningsonageandage2.Oneimportanttrickisthatifyouhavepolynomialregression,theregressionlineisnolongerlinearwhenyouplotthedependentvariableagainstindependentvariable.
Useahypotheticalexample,
Ifweobtain
hasalineareffect,
hasaquadraticeffect.Ifyouareaskedtoplot
against
thelineislinear.Ifyouareaskedtoplot
against
thelineisquadratic.Saythesamplemeanof
is0.5
0
1
2
3
5
8
Interpretationofcoefficientsinquadraticequations
Say
important:
thereisnosimplerelationshipbetween
and
.Sometime,theeffectof
on
ispositive,sometimestheeffectof
on
isnegative.Inotherwords,theeffectof
on
dependsonthevalueof
.
Suggestion:
plottheregressionforthedatarange.
Onethingwecantell:
When2>0,theeffectof
on
increaseswith
;
When2<0,theeffectof
on
decreaseswith
.
[figure]
II.InterpretationofCoefficientsinPolynomialRegression
RelationshipbetweenYandthe“polynomial”independentvariableisnolongerlinear.
Recallaspecialpropertyofthelinearfunction:
therelationshipbetweenYandanX(sayXk)isconstantforallvaluesofthisXandotherXvariables:
(1)
.
Inapolynomialregression,thissimplerelationshipnolongerholdstrue.Foraquadraticregression,forexample,
(2)
wehave
(3)
whichisdependentonthevalueofXk.
Ingeneral,thesituationwhereasimplelinearrelationshipofequation
(1)isnottrueiscalled“interaction”,atopictowhichwewilldevotealecture.Fornow,letusdefineinteractionasthesituationwherethe“effect”ofanindependentvariabledependsonthevalueofanothervariable.
Inpolynomialregressionsinvolvinganindependentvariableofanorderhigherthan1(i.e.,quadraticorhigher),wecaninterpretthisasanimplicitinteractionofavariablewithitself.
Example,earningsasafunctionofexperience.Ifthequadraticfunctionistrue,wecanfindavalueofexperiencewhichmaximizesearnings(whichcouldbeeitherwithinareasonablerangeexperiencedbyworkersorinarangeunlikelytobeexperiencedbyworkers).Useequation(3)toobtaintheyearthatmaximizesearnings:
Thatiswhywewouldwanttosee
and
tobeofdifferentsigns.
InXieandHannum(1996Table1,Model2),
=0.046,
=-0.000693.Optimalyearofexperienceis:
33.2years,aboutretirementage.InU.S.,itis33.8years.SeeXieandHannum(1996,p.955).
Notethatbeforethiscriticalvalue,the“effect”ofXkonYisalwayspositive,buttherateoftheincreasedeclines,upto33years.
III.DefiningDummyVariables
Adummyvariableissometimescalledan"indicatorvariable."
Itreferstothefollowinglogicalcodingschemeforadichotomousvariable:
x=1ifaparticulareventistrue
x=0otherwise.
A.ExamplesofDummyVariables
Sex(Male):
x1=0iffemale
x1=1ifmale
EmploymentStatus:
x2=0ifnotemployed
x2=1ifemployed
Povertystatus:
x3=0ifnotinpoverty,orhouseholdincome>threshold.
x3=1ifinpoverty,orhouseholdincomeB.InterpretationofDummyCoefficients(Intercept?
)
1.Whenadummyvariableistheonlyindependentvariable
Interpretation:
interceptisgroup-specific.
Example:
y=Income,
x=1ifmales
regressyonx1:
y=β0+β1x1
Aswediscussedbefore,regressionshouldbeinterpretedasconditionalmeans.Rememberinyourexercise,ifwehave1astheonlyregressor,theestimatedinterceptisidenticaltothesamplemean.Ifwehaveadummyvariableinaregression,theestimatedcoefficientrepresentsthemeandifferencebetweentwogroups.
Incomeleveloffemales:
β0
Incomelevelofmales:
β0+β1
β1isthemeandifferenceinincomebetweenmalesandfemales.
Ifwecomputethemeansbysex,wegetthesameresults.
Proof:
letusregroupthesamplebysex:
n=n1+n2:
dividethesampleintotwosamples:
malesandfemales.
First,regroupthedataintofemales(x1=0)andmales(x1=1).Noten1+n2=n.
(overn2meaningsummationfromn1+1ton1+n2,alsodenotedby
=
Howaboutthestandarderrors?
Theycanbedifferent(pooledversusgroup-specificestimatorof).
2.Whenadummyvariableisusedwithothercontinuousindependentvariables
Twoparallellineswithdifferentintercept.Thereexistsanoveralldifferencealongtheentiredistributionrangeofthecontinuousvariables.[blackboard]drawlines.
Assumption:
thereisnointeraction.
Example:
incomeonsexandability.
IV.Whenadummyvariableisusedwithanotherdummyvariable
Fourparallellineswithdifferentintercepts.
Assumption:
nointeractioneitherbetweenthedummyvariablesandthecontinuousvariablesorbetweenthetwodummyvariables.
V.ImportantDifference:
DichotomousVariablesusedasindependentvariablesandasdependentvariables
Independentvariable:
theeffectisashift.
Dependentvariable:
thelinearmodelcannotbetrue.
[blackboard]why?
VI.TheLeastSquaresEstimationwithacontinuousvariableandadummyvariable
Theleastsquaresestimationholdsupforregressionswithdummyvariables.
X=|1,x1x2|,wherex1isacontinuousvariable,x2isadummyvariable.
X'X:
=|nxi1xi2|
|xi12xi1xi2|
|xi2|
=|nx1in2|
|x1i2x1iovern2|
|n2|
n2isthetotalnumberofcaseswherex2=1istrue.
Allalgorithmsfortheleastsquaresestimationstillhold.
Interpretationof1:
thepureeffectofx1netofoverallgroupdifference.Alsocalled“within-groupaverageeffectofx1”.
1canbeestimatedin3-steppartialregressionmethod:
(1)Regressyonx2,obtainresiduals==y*(whichisthedeviationofyfromthegroupmean);
(2)Regressx1onx2,obtainresiduals==x1*(whichisthedeviationofx1fromthegroupmean);
(3)Thenregressy*onx1*,weobtain1,whichisthepure,partialeffectofx1ony.Remembertoadjustfordegreesoffreedom(by1)duetox2.
VI.NominalVariables
Definition:
Anominalvariableisaclassificationsystem.Noinformationaboutorderingisassumedorutilized.Numericalvaluesforanominalvariablearearbitrary,usedforclassificationoridentification.
ForanominalindependentvariablewithJcategories,weuseasetofJ-1dummyvariablesinregressionanalysis.
Sayavariablexhasthreecategories,weneedtouse2dummyvariables(inadditiontotheintercept):
x1=1ifx=2
x1=0otherwise
x2=1ifx=3
x2=0otherwise
Forexample,forvariableRace:
Race
(2)=1ifBlack
Race(3)=1ifAsian
Inthiscase,Whiteistheexcludedcategory.
Alternatively,
Race(Black)=1ifBlack
Race(Asian)=1ifAsian
Dummyvariablesforanominalvariableshouldappeartogetherinthemodel(ininteractions,forexample).Theycannothaveinteractionswitheachotherbecausetheydonotoverlap.
Regression:
yisincome
y=0+1Race(black)+2Race(Asian)+
say,b'=|20,-10,-15|
MeanofWhites:
0=20
MeanofBlacks:
0+1=10
MeanofAsians:
0+2=5
IfwechangethecodingsothatBlackisusedastheexcludedcategory:
y=0+1Race(white)+2Race(Asian)+
0=10=black
1=White-black=10
2=Asian-black=-5
Interpretationofcoefficientsinacomplexmodel:
Ifwehavetwosetsofdummyvariables
y=0+1Race(white)+2Race(Asian)+3Sex(male)+
Whatis0?
Meanincomeleveloffemaleblacks
ReasonisthatexcludedcategoriesareblacksforRaceandfemalesforSex.
Howdowecomputeaveragesforothergroups:
AsianFemale?
Whitemale?
Ifwehavetwodummyvariablesandonecontinuousvariable
y=0+1Race(white)+2Race(Asian)+3Sex(male)+4Ability+
0istheincomeleveloffemaleblackswithzeroscoreofability.Itisanintercept.
[blackboard]Sixparallellines.TwodummyvariablesforRacecannotoverlap.RaceandSexdooverlap.Additivityisassumedhere.WewilldiscussinteractionsonThursday.
VII.TestingforCollapsibilityofCategories.
WecanuseF-testsfornestedmodelstotestthecollapsibilityofcategoriesinanominalvariable.
ConsideraGSSquestionaboutregionofresidenceatage16(REG16):
OriginalCode
Recode
1
NewEngland
East
2
MiddleAtlantic
3
EastNorthCentral
Midwest
4
WestNorthCentral
5
SouthAtlantic
South
6
EastSouthCentral
7
WestSouthCent
8
Mountain
West
9
Pacific
Inregressionanalysis(saywithoccupationalprestigeasthedependentvariable),wecanuseasetof8dummyvariablesfortheoriginalcodesofthevariable.Wecanalsouseasetof3dummyvariablesafterwecollapsethecodesintoasmallersetof4broaderregions.
Thetwomodelsarenested.Seeexample
TheF-testbetweenthetwonestedmodelstellsuswhetherthecollapsingisjustified.
F(5,550)=[(127029.555-125261.776)/5]/227.75
=[1767.779/5]/227.75=353.56/227.75=1.55,notsignificantat5%.
.recodereg16x(2=1)(4=3)(6=5)(7=5)(9=8)(reg16x:
302changesmade)
.
.tablereg16,c(meanprestige)
--------------------------reg16|mean(prestige)
----------+---------------
1|39.59259
2|44.01123
3
|
42.0087
4
|
44.5161