线性回归模型在SASEM中的应用实例.docx-资源下载

线性回归模型在SASEM中的应用实例.docx

1、线性回归模型在SASEM中的应用实例Chapter 3 Predictive Modeling Using Regression3.1 Introduction to RegressionThe Regression node in Enterprise Miner does either linear or logistic regression depending upon the measurement level of the target variable.Linear regression is done if the target variable is an interval

2、variable. In linear regression the model predicts the mean of the target variable at the given values of the input variables.Logistic regression is done if the target variable is a discrete variable. In logistic regression the model predicts the probability of a particular level(s) of the target var

3、iable at the given values of the input variables. Because the predictions are probabilities, which are bounded by 0 and 1 and are not linear in this space, the probabilities must be transformed in order to be adequately modeled. The most common transformation for a binary target is the logit transfo

4、rmation. Probit and complementary log-log transformations are also available in the regression node.Recall that one assumption of logistic regression is that the logit transformation of the probabilities of the target variable results in a linear relationship with the input variables.Regression uses

5、 only full cases in the model. This means that any case, or observation, that has a missing value will be excluded from consideration when building the model. As discussed earlier, when there are many potential input variables to be considered, this could result in an unacceptably high loss of data.

6、 Therefore, when possible, missing values should be imputed prior to running a regression model.Other reasons for imputing missing values include the following: Decision trees handle missing values directly, whereas regression and neural network models ignore all observations with missing values on

7、any of the input variables. It is more appropriate to compare models built on the same set of observations. Therefore, before doing a regression or building a neural network model, you should perform data replacement, particularly if you plan to compare the results to results obtained from a decisio

8、n tree model. If the missing values are in some way related to each other or to the target variable, the models created without those observations may be biased. If missing values are not imputed during the modeling process, observations with missing values cannot be scored with the score code built

9、 from the models.There are three variable selection methods available in the Regression node of Enterprise Miner.Forward first selects the best one-variable model. Then it selects the best two variables among those that contain the first selected variable. This process continues until it reaches the

10、 point where no additional variables have a p-value less than the specified entry p-value.Backward starts with the full model. Next, the variable that is least significant, given the other variables, is removed from the model. This process continues until all of the remaining variables have a p-valu

11、e less than the specified stay pvalue.Stepwise is a modification of the forward selection method. The difference is that variables already in the model do not necessarily stay there. After each variable is entered into the model, this method looks at all the variables already included in the model a

12、nd deletes any variable that is not significant at the specified level. The process ends when none of the variables outside the model has a p-value less than the specified entry value and every variable in the model is significant at the specified stay value. The specified p-values are also known as

13、 significance levels.3.2 Regression in Enterprise MinerFINFOUTImputation, Transformation, and RegressionThe data for this example is from a nonprofit organization that relies on fundraising campaigns to support their efforts. After analyzing the data, a subset of 19 predictor variables was selected

14、to model the response to a mailing. Two response variables were stored in the data set. One response variable related to whether or not someone responded to the mailing (TARGET_B), and the other response variable measured how much the person actually donated in U.S. dollars (TARGET_D).NameModel Role

15、Measurement LevelDescriptionAGEInputIntervalDonors ageAVGGIFTInputIntervalDonors average giftCARDGIFTInputIntervalDonors gifts to card promotionsCARDPROMInputIntervalNumber of card promotionsFEDGOVInputInterval% of household in federal governmentFIRSTTInputIntervalElapsed time since first donationGE

16、NDERInputBinaryF=female, M=MaleHOMEOWNRInputBinaryH=homeowner, U=unknownIDCODEIDNominalID code, unique for each donorINCOMEInputOrdinalIncome level (integer values 0-9)LASTTInputIntervalElapsed time since last donationLOCALGOVInputInterval% of household in local governmentMALEMILIInputInterval% of h

17、ousehold males active in the militaryMALEVETInputInterval% of household male veteransNUMPROMInputIntervalTotal number of promotionsPCOWNERSInputBinaryY=donor owns computer (missing otherwise)PETSInputBinaryY=donor owns pets (missing otherwise)STATEGOVInputInterval% of household in state governmentTA

18、RGET_BTargetBinary1=donor to campaign, 0=did not contributeTARGET_DTargetIntervalDollar amount of contribution to campaignTIMELAGInputIntervalTime between first and second donation The variable TARGET_D is not considered in this chapter, so its model role will be set to Rejected. A card promotion is

19、 one where the charitable organization sends potential donors an assortment of greeting cards and requests a donation for them.The MYRAW data set in the CRSSAMP library contains 6,974 observations for building and comparing competing models. This data set will be split equally into training and vali

20、dation data sets for analysis.Building the Initial Flow and Identifying the Input Data1. Open a new diagram by selecting File New Diagram.2. On the Diagrams subtab, name the new diagram by right-clicking on Untitled and selecting Rename.3. Name the new diagram Non-Profit.4. Add an Input Data Source

21、node to the diagram workspace by dragging the node from the toolbar or from the Tools tab.5. Add a Data Partition node to the diagram and connect it to the Input Data Source node.6. To specify the input data, double-click on the Input Data Source node.7. Click on Select in order to choose the data s

22、et.8. Click on the and select CRSSAMP from the list of defined libraries.9. Select the MYRAW data set from the list of data sets in the CRSSAMP library and then select OK.Observe that this data set has 6,974 observations (rows) and 21 variables (columns). Evaluate (and update, if necessary) the assi

23、gnments that were made using the metadata sample.1. Click on the Variables tab to see all of the variables and their respective assignments. 2. Click on the Name column heading to sort the variables by their name. A portion of the table showing the first 10 variables is shown below.The first several

24、 variables (AGE through FIRSTT) have the measurement level interval because they are numeric in the data set and have more than 10 distinct levels in the metadata sample. The model role for all interval variables is set to input by default. The variables GENDER and HOMEOWNR have the measurement leve

25、l binary because they have only two different nonmissing levels in the metadata sample. The model role for all binary variables is set to input by default.The variable IDCODE is listed as a nominal variable because it is a character variable with more than two nonmissing levels in the metadata sampl

26、e. Furthermore, because it is nominal and the number of distinct values is at least 2000 or greater than 90% of the sample size, the IDCODE variable has the model role id. If the ID value had been stored as a number, it would have been assigned an interval measurement level and an input model role.T

27、he variable INCOME is listed as an ordinal variable because it is a numeric variable with more than two but no more than ten distinct levels in the metadata sample. All ordinal variables are set to have the input model role.Scroll down to see the rest of the variables. The variables PCOWNERS and PET

28、S both are identified as unary for their measurement level. This is because there is only one nonmissing level in the metadata sample. It does not matter in this case whether the variable was character or numeric, the measurement level is set to unary and the model role is set to rejected. These var

29、iables do have useful information, however, and it is the way in which they are coded that makes them seem useless. Both variables contain the value Y for a person if the person has that condition (pet owner for PETS, computer owner for PCOWNERS) and a missing value otherwise. Decision trees handle

30、missing values directly, so no data modification needs to be done for fitting a decision tree; however, neural networks and regression models ignore any observation with a missing value, so you will need to recode these variables to get at the desired information. For example, you can recode the mis

31、sing values as a U, for unknown. You do this later using the Replacement node.Identifying Target VariablesNote that the variables TARGET_B and TARGET_D are the response variables for this analysis. TARGET_B is binary even though it is a numeric variable since there are only two non-missing levels in

32、 the metadata sample. TARGET_D has the interval measurement level. Both variables are set to have the input model role (just like any other binary or interval variable). This analysis will focus on TARGET_B, so you need to change the model role for TARGET_B to target and the model role TARGET_D to rejected because you should not use a response variable as a predictor. 1.

邮箱/手机：
温馨提示：	快捷下载时，用户名和密码都是您填写的邮箱或者手机号，方便查询和重复下载（系统自动生成）。如填写123，账号就是123，密码也是123。
特别说明：	请自助下载，系统不会自动发送文件的哦；如果您已付费，想二次下载，请登录后访问：我的下载记录
支付方式：
验证码：	换一换

账号：
密码：
验证码：	换一换
当日自动登录忘记密码？