线性回归模型在SASEM中的应用实例.docx
《线性回归模型在SASEM中的应用实例.docx》由会员分享,可在线阅读,更多相关《线性回归模型在SASEM中的应用实例.docx(43页珍藏版)》请在冰豆网上搜索。
![线性回归模型在SASEM中的应用实例.docx](https://file1.bdocx.com/fileroot1/2022-12/27/aecc6265-b006-4992-81bd-0a5bf4019041/aecc6265-b006-4992-81bd-0a5bf40190411.gif)
线性回归模型在SASEM中的应用实例
Chapter3PredictiveModelingUsingRegression
3.1IntroductiontoRegression
TheRegressionnodeinEnterpriseMinerdoeseitherlinearorlogisticregressiondependinguponthemeasurementlevelofthetargetvariable.
Linearregressionisdoneifthetargetvariableisanintervalvariable.Inlinearregressionthemodelpredictsthemeanofthetargetvariableatthegivenvaluesoftheinputvariables.
Logisticregressionisdoneifthetargetvariableisadiscretevariable.Inlogisticregressionthemodelpredictstheprobabilityofaparticularlevel(s)ofthetargetvariableatthegivenvaluesoftheinputvariables.Becausethepredictionsareprobabilities,whichareboundedby0and1andarenotlinearinthisspace,theprobabilitiesmustbetransformedinordertobeadequatelymodeled.Themostcommontransformationforabinarytargetisthelogittransformation.Probitandcomplementarylog-logtransformationsarealsoavailableintheregressionnode.
Recallthatoneassumptionoflogisticregressionisthatthelogittransformationoftheprobabilitiesofthetargetvariableresultsinalinearrelationshipwiththeinputvariables.
Regressionusesonlyfullcasesinthemodel.Thismeansthatanycase,orobservation,thathasamissingvaluewillbeexcludedfromconsiderationwhenbuildingthemodel.Asdiscussedearlier,whentherearemanypotentialinputvariablestobeconsidered,thiscouldresultinanunacceptablyhighlossofdata.Therefore,whenpossible,missingvaluesshouldbeimputedpriortorunningaregressionmodel.
Otherreasonsforimputingmissingvaluesincludethefollowing:
∙Decisiontreeshandlemissingvaluesdirectly,whereasregressionandneuralnetworkmodelsignoreallobservationswithmissingvaluesonanyoftheinputvariables.Itismoreappropriatetocomparemodelsbuiltonthesamesetofobservations.Therefore,beforedoingaregressionorbuildinganeuralnetworkmodel,youshouldperformdatareplacement,particularlyifyouplantocomparetheresultstoresultsobtainedfromadecisiontreemodel.
∙Ifthemissingvaluesareinsomewayrelatedtoeachotherortothetargetvariable,themodelscreatedwithoutthoseobservationsmaybebiased.
∙Ifmissingvaluesarenotimputedduringthemodelingprocess,observationswithmissingvaluescannotbescoredwiththescorecodebuiltfromthemodels.
TherearethreevariableselectionmethodsavailableintheRegressionnodeofEnterpriseMiner.
Forwardfirstselectsthebestone-variablemodel.Thenitselectsthebesttwovariablesamongthosethatcontainthefirstselectedvariable.Thisprocesscontinuesuntilitreachesthepointwherenoadditionalvariableshaveap-valuelessthanthespecifiedentryp-value.
Backwardstartswiththefullmodel.Next,thevariablethatisleastsignificant,giventheothervariables,isremovedfromthemodel.Thisprocesscontinuesuntilalloftheremainingvariableshaveap-valuelessthanthespecifiedstaypvalue.
Stepwiseisamodificationoftheforwardselectionmethod.Thedifferenceisthatvariablesalreadyinthemodeldonotnecessarilystaythere.Aftereachvariableisenteredintothemodel,thismethodlooksatallthevariablesalreadyincludedinthemodelanddeletesanyvariablethatisnotsignificantatthespecifiedlevel.Theprocessendswhennoneofthevariablesoutsidethemodelhasap-valuelessthanthespecifiedentryvalueandeveryvariableinthemodelissignificantatthespecifiedstayvalue.
Thespecifiedp-valuesarealsoknownassignificancelevels.
3.2RegressioninEnterpriseMiner
FIN〉FOUT
Imputation,Transformation,andRegression
Thedataforthisexampleisfromanonprofitorganizationthatreliesonfundraisingcampaignstosupporttheirefforts.Afteranalyzingthedata,asubsetof19predictorvariableswasselectedtomodeltheresponsetoamailing.Tworesponsevariableswerestoredinthedataset.Oneresponsevariablerelatedtowhetherornotsomeonerespondedtothemailing(TARGET_B),andtheotherresponsevariablemeasuredhowmuchthepersonactuallydonatedinU.S.dollars(TARGET_D).
Name
ModelRole
MeasurementLevel
Description
AGE
Input
Interval
Donor'sage
AVGGIFT
Input
Interval
Donor'saveragegift
CARDGIFT
Input
Interval
Donor'sgiftstocardpromotions
CARDPROM
Input
Interval
Numberofcardpromotions
FEDGOV
Input
Interval
%ofhouseholdinfederalgovernment
FIRSTT
Input
Interval
Elapsedtimesincefirstdonation
GENDER
Input
Binary
F=female,M=Male
HOMEOWNR
Input
Binary
H=homeowner,U=unknown
IDCODE
ID
Nominal
IDcode,uniqueforeachdonor
INCOME
Input
Ordinal
Incomelevel(integervalues0-9)
LASTT
Input
Interval
Elapsedtimesincelastdonation
LOCALGOV
Input
Interval
%ofhouseholdinlocalgovernment
MALEMILI
Input
Interval
%ofhouseholdmalesactiveinthemilitary
MALEVET
Input
Interval
%ofhouseholdmaleveterans
NUMPROM
Input
Interval
Totalnumberofpromotions
PCOWNERS
Input
Binary
Y=donorownscomputer(missingotherwise)
PETS
Input
Binary
Y=donorownspets(missingotherwise)
STATEGOV
Input
Interval
%ofhouseholdinstategovernment
TARGET_B
Target
Binary
1=donortocampaign,0=didnotcontribute
TARGET_D
Target
Interval
Dollaramountofcontributiontocampaign
TIMELAG
Input
Interval
Timebetweenfirstandseconddonation
ThevariableTARGET_Disnotconsideredinthischapter,soitsmodelrolewillbesettoRejected.
Acardpromotionisonewherethecharitableorganizationsendspotentialdonorsanassortmentofgreetingcardsandrequestsadonationforthem.
TheMYRAWdatasetintheCRSSAMPlibrarycontains6,974observationsforbuildingandcomparingcompetingmodels.Thisdatasetwillbesplitequallyintotrainingandvalidationdatasetsforanalysis.
BuildingtheInitialFlowandIdentifyingtheInputData
1.OpenanewdiagrambyselectingFileNewDiagram.
2.OntheDiagramssubtab,namethenewdiagrambyright-clickingonUntitledandselectingRename.
3.NamethenewdiagramNon-Profit.
4.AddanInputDataSourcenodetothediagramworkspacebydraggingthenodefromthetoolbarorfromtheToolstab.
5.AddaDataPartitionnodetothediagramandconnectittotheInputDataSourcenode.
6.Tospecifytheinputdata,double-clickontheInputDataSourcenode.
7.ClickonSelect…inordertochoosethedataset.
8.Clickonthe
andselectCRSSAMPfromthelistofdefinedlibraries.
9.SelecttheMYRAWdatasetfromthelistofdatasetsintheCRSSAMPlibraryandthenselectOK.
Observethatthisdatasethas6,974observations(rows)and21variables(columns).Evaluate(andupdate,ifnecessary)theassignmentsthatweremadeusingthemetadatasample.
1.ClickontheVariablestabtoseeallofthevariablesandtheirrespectiveassignments.
2.ClickontheNamecolumnheadingtosortthevariablesbytheirname.Aportionofthetableshowingthefirst10variablesisshownbelow.
Thefirstseveralvariables(AGEthroughFIRSTT)havethemeasurementlevelintervalbecausetheyarenumericinthedatasetandhavemorethan10distinctlevelsinthemetadatasample.Themodelroleforallintervalvariablesissettoinputbydefault.ThevariablesGENDERandHOMEOWNRhavethemeasurementlevelbinarybecausetheyhaveonlytwodifferentnonmissinglevelsinthemetadatasample.Themodelroleforallbinaryvariablesissettoinputbydefault.
ThevariableIDCODEislistedasanominalvariablebecauseitisacharactervariablewithmorethantwononmissinglevelsinthemetadatasample.Furthermore,becauseitisnominalandthenumberofdistinctvaluesisatleast2000orgreaterthan90%ofthesamplesize,theIDCODEvariablehasthemodelroleid.IftheIDvaluehadbeenstoredasanumber,itwouldhavebeenassignedanintervalmeasurementlevelandaninputmodelrole.
ThevariableINCOMEislistedasanordinalvariablebecauseitisanumericvariablewithmorethantwobutnomorethantendistinctlevelsinthemetadatasample.Allordinalvariablesaresettohavetheinputmodelrole.
Scrolldowntoseetherestofthevariables.
ThevariablesPCOWNERSandPETSbothareidentifiedasunaryfortheirmeasurementlevel.Thisisbecausethereisonlyonenonmissinglevelinthemetadatasample.Itdoesnotmatterinthiscasewhetherthevariablewascharacterornumeric,themeasurementlevelissettounaryandthemodelroleissettorejected.
Thesevariablesdohaveusefulinformation,however,anditisthewayinwhichtheyarecodedthatmakesthemseemuseless.BothvariablescontainthevalueYforapersonifthepersonhasthatcondition(petownerforPETS,computerownerforPCOWNERS)andamissingvalueotherwise.Decisiontreeshandlemissingvaluesdirectly,sonodatamodificationneedstobedoneforfittingadecisiontree;however,neuralnetworksandregressionmodelsignoreanyobservationwithamissingvalue,soyouwillneedtorecodethesevariablestogetatthedesiredinformation.Forexample,youcanrecodethemissingvaluesasaU,forunknown.YoudothislaterusingtheReplacementnode.
IdentifyingTargetVariables
NotethatthevariablesTARGET_BandTARGET_Daretheresponsevariablesforthisanalysis.TARGET_Bisbinaryeventhoughitisanumericvariablesincethereareonlytwonon-missinglevelsinthemetadatasample.TARGET_Dhastheintervalmeasurementlevel.Bothvariablesaresettohavetheinputmodelrole(justlikeanyotherbinaryorintervalvariable).ThisanalysiswillfocusonTARGET_B,soyouneedtochangethemodelroleforTARGET_BtotargetandthemodelroleTARGET_Dtorejectedbecauseyoushouldnotusearesponsevariableasapredictor.
1.