线性回归模型在SASEM中的应用实例.docx

上传人:b****6 文档编号:5577359 上传时间:2022-12-28 格式:DOCX 页数:43 大小:709.77KB
下载 相关 举报
线性回归模型在SASEM中的应用实例.docx_第1页
第1页 / 共43页
线性回归模型在SASEM中的应用实例.docx_第2页
第2页 / 共43页
线性回归模型在SASEM中的应用实例.docx_第3页
第3页 / 共43页
线性回归模型在SASEM中的应用实例.docx_第4页
第4页 / 共43页
线性回归模型在SASEM中的应用实例.docx_第5页
第5页 / 共43页
点击查看更多>>
下载资源
资源描述

线性回归模型在SASEM中的应用实例.docx

《线性回归模型在SASEM中的应用实例.docx》由会员分享,可在线阅读,更多相关《线性回归模型在SASEM中的应用实例.docx(43页珍藏版)》请在冰豆网上搜索。

线性回归模型在SASEM中的应用实例.docx

线性回归模型在SASEM中的应用实例

Chapter3PredictiveModelingUsingRegression

 

3.1IntroductiontoRegression

TheRegressionnodeinEnterpriseMinerdoeseitherlinearorlogisticregressiondependinguponthemeasurementlevelofthetargetvariable.

Linearregressionisdoneifthetargetvariableisanintervalvariable.Inlinearregressionthemodelpredictsthemeanofthetargetvariableatthegivenvaluesoftheinputvariables.

Logisticregressionisdoneifthetargetvariableisadiscretevariable.Inlogisticregressionthemodelpredictstheprobabilityofaparticularlevel(s)ofthetargetvariableatthegivenvaluesoftheinputvariables.Becausethepredictionsareprobabilities,whichareboundedby0and1andarenotlinearinthisspace,theprobabilitiesmustbetransformedinordertobeadequatelymodeled.Themostcommontransformationforabinarytargetisthelogittransformation.Probitandcomplementarylog-logtransformationsarealsoavailableintheregressionnode.

Recallthatoneassumptionoflogisticregressionisthatthelogittransformationoftheprobabilitiesofthetargetvariableresultsinalinearrelationshipwiththeinputvariables.

Regressionusesonlyfullcasesinthemodel.Thismeansthatanycase,orobservation,thathasamissingvaluewillbeexcludedfromconsiderationwhenbuildingthemodel.Asdiscussedearlier,whentherearemanypotentialinputvariablestobeconsidered,thiscouldresultinanunacceptablyhighlossofdata.Therefore,whenpossible,missingvaluesshouldbeimputedpriortorunningaregressionmodel.

Otherreasonsforimputingmissingvaluesincludethefollowing:

∙Decisiontreeshandlemissingvaluesdirectly,whereasregressionandneuralnetworkmodelsignoreallobservationswithmissingvaluesonanyoftheinputvariables.Itismoreappropriatetocomparemodelsbuiltonthesamesetofobservations.Therefore,beforedoingaregressionorbuildinganeuralnetworkmodel,youshouldperformdatareplacement,particularlyifyouplantocomparetheresultstoresultsobtainedfromadecisiontreemodel.

∙Ifthemissingvaluesareinsomewayrelatedtoeachotherortothetargetvariable,themodelscreatedwithoutthoseobservationsmaybebiased.

∙Ifmissingvaluesarenotimputedduringthemodelingprocess,observationswithmissingvaluescannotbescoredwiththescorecodebuiltfromthemodels.

TherearethreevariableselectionmethodsavailableintheRegressionnodeofEnterpriseMiner.

Forwardfirstselectsthebestone-variablemodel.Thenitselectsthebesttwovariablesamongthosethatcontainthefirstselectedvariable.Thisprocesscontinuesuntilitreachesthepointwherenoadditionalvariableshaveap-valuelessthanthespecifiedentryp-value.

Backwardstartswiththefullmodel.Next,thevariablethatisleastsignificant,giventheothervariables,isremovedfromthemodel.Thisprocesscontinuesuntilalloftheremainingvariableshaveap-valuelessthanthespecifiedstaypvalue.

Stepwiseisamodificationoftheforwardselectionmethod.Thedifferenceisthatvariablesalreadyinthemodeldonotnecessarilystaythere.Aftereachvariableisenteredintothemodel,thismethodlooksatallthevariablesalreadyincludedinthemodelanddeletesanyvariablethatisnotsignificantatthespecifiedlevel.Theprocessendswhennoneofthevariablesoutsidethemodelhasap-valuelessthanthespecifiedentryvalueandeveryvariableinthemodelissignificantatthespecifiedstayvalue.

Thespecifiedp-valuesarealsoknownassignificancelevels.

3.2RegressioninEnterpriseMiner

FIN〉FOUT

Imputation,Transformation,andRegression

Thedataforthisexampleisfromanonprofitorganizationthatreliesonfundraisingcampaignstosupporttheirefforts.Afteranalyzingthedata,asubsetof19predictorvariableswasselectedtomodeltheresponsetoamailing.Tworesponsevariableswerestoredinthedataset.Oneresponsevariablerelatedtowhetherornotsomeonerespondedtothemailing(TARGET_B),andtheotherresponsevariablemeasuredhowmuchthepersonactuallydonatedinU.S.dollars(TARGET_D).

Name

ModelRole

MeasurementLevel

Description

AGE

Input

Interval

Donor'sage

AVGGIFT

Input

Interval

Donor'saveragegift

CARDGIFT

Input

Interval

Donor'sgiftstocardpromotions

CARDPROM

Input

Interval

Numberofcardpromotions

FEDGOV

Input

Interval

%ofhouseholdinfederalgovernment

FIRSTT

Input

Interval

Elapsedtimesincefirstdonation

GENDER

Input

Binary

F=female,M=Male

HOMEOWNR

Input

Binary

H=homeowner,U=unknown

IDCODE

ID

Nominal

IDcode,uniqueforeachdonor

INCOME

Input

Ordinal

Incomelevel(integervalues0-9)

LASTT

Input

Interval

Elapsedtimesincelastdonation

LOCALGOV

Input

Interval

%ofhouseholdinlocalgovernment

MALEMILI

Input

Interval

%ofhouseholdmalesactiveinthemilitary

MALEVET

Input

Interval

%ofhouseholdmaleveterans

NUMPROM

Input

Interval

Totalnumberofpromotions

PCOWNERS

Input

Binary

Y=donorownscomputer(missingotherwise)

PETS

Input

Binary

Y=donorownspets(missingotherwise)

STATEGOV

Input

Interval

%ofhouseholdinstategovernment

TARGET_B

Target

Binary

1=donortocampaign,0=didnotcontribute

TARGET_D

Target

Interval

Dollaramountofcontributiontocampaign

TIMELAG

Input

Interval

Timebetweenfirstandseconddonation

ThevariableTARGET_Disnotconsideredinthischapter,soitsmodelrolewillbesettoRejected.

Acardpromotionisonewherethecharitableorganizationsendspotentialdonorsanassortmentofgreetingcardsandrequestsadonationforthem.

TheMYRAWdatasetintheCRSSAMPlibrarycontains6,974observationsforbuildingandcomparingcompetingmodels.Thisdatasetwillbesplitequallyintotrainingandvalidationdatasetsforanalysis.

BuildingtheInitialFlowandIdentifyingtheInputData

1.OpenanewdiagrambyselectingFileNewDiagram.

2.OntheDiagramssubtab,namethenewdiagrambyright-clickingonUntitledandselectingRename.

3.NamethenewdiagramNon-Profit.

4.AddanInputDataSourcenodetothediagramworkspacebydraggingthenodefromthetoolbarorfromtheToolstab.

5.AddaDataPartitionnodetothediagramandconnectittotheInputDataSourcenode.

6.Tospecifytheinputdata,double-clickontheInputDataSourcenode.

7.ClickonSelect…inordertochoosethedataset.

8.Clickonthe

andselectCRSSAMPfromthelistofdefinedlibraries.

9.SelecttheMYRAWdatasetfromthelistofdatasetsintheCRSSAMPlibraryandthenselectOK.

Observethatthisdatasethas6,974observations(rows)and21variables(columns).Evaluate(andupdate,ifnecessary)theassignmentsthatweremadeusingthemetadatasample.

1.ClickontheVariablestabtoseeallofthevariablesandtheirrespectiveassignments.

2.ClickontheNamecolumnheadingtosortthevariablesbytheirname.Aportionofthetableshowingthefirst10variablesisshownbelow.

Thefirstseveralvariables(AGEthroughFIRSTT)havethemeasurementlevelintervalbecausetheyarenumericinthedatasetandhavemorethan10distinctlevelsinthemetadatasample.Themodelroleforallintervalvariablesissettoinputbydefault.ThevariablesGENDERandHOMEOWNRhavethemeasurementlevelbinarybecausetheyhaveonlytwodifferentnonmissinglevelsinthemetadatasample.Themodelroleforallbinaryvariablesissettoinputbydefault.

ThevariableIDCODEislistedasanominalvariablebecauseitisacharactervariablewithmorethantwononmissinglevelsinthemetadatasample.Furthermore,becauseitisnominalandthenumberofdistinctvaluesisatleast2000orgreaterthan90%ofthesamplesize,theIDCODEvariablehasthemodelroleid.IftheIDvaluehadbeenstoredasanumber,itwouldhavebeenassignedanintervalmeasurementlevelandaninputmodelrole.

ThevariableINCOMEislistedasanordinalvariablebecauseitisanumericvariablewithmorethantwobutnomorethantendistinctlevelsinthemetadatasample.Allordinalvariablesaresettohavetheinputmodelrole.

Scrolldowntoseetherestofthevariables.

ThevariablesPCOWNERSandPETSbothareidentifiedasunaryfortheirmeasurementlevel.Thisisbecausethereisonlyonenonmissinglevelinthemetadatasample.Itdoesnotmatterinthiscasewhetherthevariablewascharacterornumeric,themeasurementlevelissettounaryandthemodelroleissettorejected.

Thesevariablesdohaveusefulinformation,however,anditisthewayinwhichtheyarecodedthatmakesthemseemuseless.BothvariablescontainthevalueYforapersonifthepersonhasthatcondition(petownerforPETS,computerownerforPCOWNERS)andamissingvalueotherwise.Decisiontreeshandlemissingvaluesdirectly,sonodatamodificationneedstobedoneforfittingadecisiontree;however,neuralnetworksandregressionmodelsignoreanyobservationwithamissingvalue,soyouwillneedtorecodethesevariablestogetatthedesiredinformation.Forexample,youcanrecodethemissingvaluesasaU,forunknown.YoudothislaterusingtheReplacementnode.

IdentifyingTargetVariables

NotethatthevariablesTARGET_BandTARGET_Daretheresponsevariablesforthisanalysis.TARGET_Bisbinaryeventhoughitisanumericvariablesincethereareonlytwonon-missinglevelsinthemetadatasample.TARGET_Dhastheintervalmeasurementlevel.Bothvariablesaresettohavetheinputmodelrole(justlikeanyotherbinaryorintervalvariable).ThisanalysiswillfocusonTARGET_B,soyouneedtochangethemodelroleforTARGET_BtotargetandthemodelroleTARGET_Dtorejectedbecauseyoushouldnotusearesponsevariableasapredictor.

1.

展开阅读全文
相关资源
猜你喜欢
相关搜索

当前位置:首页 > 自然科学 > 物理

copyright@ 2008-2022 冰豆网网站版权所有

经营许可证编号:鄂ICP备2022015515号-1