1、多元线性回归multiple linear regression多元线性回归(multiple linear regression)Multiple linear regression in data miningContent:Review of 2.1 linear regression2.2 cases of regression processSubset selection in 2.3 linear regressionPerhaps the most popular and predictive mathematical model is the multivariate lin
2、ear regression model. Youre already in the data, modelIn the course of decision making, we studied the multiple linear regression model. In this statement, we will build on the basis of these knowledgeThe test applies multiple linear regression model in data mining. Multiple linear regression models
3、 are used for numerical data mining situationsIn. For example, demographic and historical behavior models are used to predict customer use of credit cards, based on usage and their environmentTo forecast the equipment failure time, often in the past through the travel vacation travel expenses foreca
4、st record, at the inquiry officeWindow through the history, product sales, information and other staff forecast the needs of workers, through historical information to predict cross sales of product salesAnd to predict the impact of discounts on sales in retail.In this note, we review the process of
5、 multiple linear regression. Here we stress the need to divide data into two categories: TrainingThe data set and the validation data set make it possible to validate multiple linear regression models and require a loose assumption: error obeysNormal distribution. After these reviews, we introduce m
6、ethods for determining subsets of independent variables to improve prediction.An overview of 2.1 linear regressionIn this discussion, we briefly review the multivariate linear models encountered in the course of data, models, and decision making. AA continuous random variable is called a dependent v
7、ariable, Y, and some independent variables,. Our purpose is to use independent variablesA linear equation that predicts the value of a dependent variable (also known as a response variable). The modulus that is known as the independent variable for the prediction purposeType is:Pxxx,. 21Epsilon, bet
8、a, beta, beta, +=ppxxxY,. 22110 (1)Wherein, epsilon is a noise variable, which is a normal distribution with a mean value of 0 and a standard deviation of delta (we dont know its value)Random variable. We dont know the values of these coefficients, P, beta, beta,., 10. We estimate all of these from
9、the data obtained(p+2) the value of an unknown parameter.These data include the N line observation points, also known as instances, which are represented as instances;. throughThese estimates of the beta coefficients are calculated by minimizing the variance and minimum of the values between the pre
10、dicted and observed data. Variance sumIs expressed as follows:Ipiiixxxy,., 21ni,., 2,1=Sigma =.Nippiiixxxy1222110) (beta, beta, beta)Let us represent the value of the coefficients by making the upper type minimized. These are our estimates of the unknown values,210,., P, beta, beta, betaThe estimato
11、r is also referred to in the literature as OLS (ordinary least squares). Once we have calculated these estimates,We can use the following formula to compute unbiased estimates: 10, -, P, beta, beta 2, Delta 2, DeltaObservation point factorResiduals and = =.= = SigmaNiippiiixxxypn12221102().11, beta,
12、 beta, beta DeltaThe values we insert in the linear regression model (1) are based on the values of the known independent variablesPredict the value of dependent variable. The predictor variables are calculated according to the following formula: 10,., P, beta, beta, pxxx,., 21YPpxxxY22110Beta beta
13、beta beta +=.In the sense that they are unbiased estimates (the mean is true) and that there is a minimum variance compared with other biased estimates,The predictions based on this formula are the best possible predictive values if we make the following assumptions:1. Linear hypothesis: the expecte
14、d value of dependent variable is a linear equation about the independent variablePppxxxxxxYE beta beta beta beta +=.), | (2211021,.2, independence hypothesis: random noise variable I epsilonIndependent in all lines. Here I epsilonThe noise is observed at the first I observation pointMachine variable
15、, i=1,2,. N;3. Unbiased hypothesis: noise stochastic variable I epsilonThe expected value is 0, that is, for i=1,2,. N has 0) (=iE epsilon);4, the same variance hypothesis: for i=1,2,. And ns I epsilonThe standard deviation has the same value as delta;5. Normality hypothesis: noise stochastic variab
16、le I epsilonNormal distribution.There is an important and interesting fact for our purpose, that is, even if we give up the hypothesis of normalitySet 5) and allow noise variables to obey arbitrary distributions, and these estimates are still well predicted. We can watch BenQThe prediction of these
17、estimators is the best linear predictor due to their minimum expected variance. In other words, in all linear modelsAnd, as defined in equation (1), the model uses a least squares estimator, 10, -, P, beta, betaWe will give the minimum of the mean square. And describe the idea in detail in the next
18、section.Normal distribution assumptions are used to derive confidence intervals for predictions. In data mining applications, we have two different data sets:The training data set and the validation data set, these two data sets contain typical relationships between independent variables and depende
19、nt variables. Training dataSets are used to estimate regression coefficients. Validation data sets are used to form retention samples without calculating regression coefficientsEstimated value. This allows us to estimate the errors in our predictions without assuming that the noise variables are nor
20、mally distributedPoor. We use training data to fit the model and estimate the coefficients. These estimated coefficients are used for all validation data setsExamples make predictions. Compare the actual dependent variable values for each examples prediction and validation data sets. The mean square
21、 difference allows usCompare the different models and estimate the accuracy of the model in forecasting. 10, -, P, beta, beta2.2 cases of regression processWe use examples from Chaterjee, Hadi, and Price to evaluate the performance of managers in big financial institutionsThe process of multivariate
22、 linear regression is shown.The data shown in Table 2.1 are derived from a survey of office staff at a department of a major financial institutionSub. Dependent variable is a measure of the efficiency of a department leading by the agencys managers. All dependent variables and independent variables
23、are25 employees are graded from 1 to 5 in different aspects of the managements work. As a result, for each variableThe minimum is 25 and the maximum is 125. These ratings are a survey of 25 employees in each department and 30 employees in each departmentAnswer。 The purpose of the analysis is to expl
24、ore the feasibility of using questionnaires to predict the efficiency of the Department, thus avoiding direct measurement of efficiencyEffort. Variables are answers to survey questions, and are described as follows:Efficiency measurement of Y management;Dealing with employee complaints; 1XNo privile
25、ges allowed; 2XOpportunities to learn new things; 3XPromote according to performance; 4XToo bad about the performance; 5XTo advance the progress of a better job; 6XThe multiple linear regression estimates are computed by the StatCalc plug-in in Excel, as shown in table 2.2.Table 2.2The equation for
26、predicting efficiency isY=13.182+0.5830.044+0.3290.057+0.1120.197 1X2X3X4X5X6XIn Table 2.3, we use ten examples as validation data. Apply the preceding equation to the predictions given in the validation dataAnd the error is shown in table 2.3. The error represented by the last column is the differe
27、nce between the predicted and actual values. For example, the error of example 21The difference is 44.4650=5.54Table 2.3ExampleYX1X2X3X4X5X6predicted valueerrorTwenty-oneFiftyFortyThirty-threeThirty-fourForty-threeSixty-fourThirty-threeForty-four point four six-5.54Twenty-twoSixty-fourSixty-oneFifty
28、-twoSixty-twoSixty-sixEighty四十一六十三点九八-0.02二十三五十三六十六五十二五十六十三八十三十七六十三点九一十点九一二十四四十三十七四十二五十八五十五十七四十九四十五点八七五点八七二十五六十三五十四四十二四十八六十六七十五三十三五十六点七五结果二十六六十六七十七六十六六十三八十八七十六七十二六十五点二二-0.78二十七七十八七十五五十八七十四八十七十八四十九七十三点二三-4.77二十八四十八五十七四十四四十五五十一八十三三十八五十八点一九十点一九二十九八十五八十五七十一七十一七十七七十四五十五七十六点零五-8.95三十八十二八十二三十九五十九六十四七十八三十九七
29、十六点一零-5.90平均标准差六十二点三八十一点三零-0.52七点一七我们注意到预测的平均误差很小(因此预测是无偏的进而误差大致是正态的0.52),因此这个模型给出的预测误差大致95%的可能处在真值的34.14(两个标准偏差)区间之内。2.3在线性回归中子集的选择一个在数据挖掘中经常遇到的问题是当我们用回归方程预测因变量的值时,在模型中有很多的变量可以作为候选的自变量给出的一个高速的线性回归现代算法,可以尝试在某些。情形下用一种极端现实主义的方法:为什么会为选择子集而烦恼?在模型中用所有的变量就可以了。这有几个原因表明上述做法并不理想。.收集到全部的用来预测的变量代价较高;.我们也许能够做到用
30、更少的变量做到更精确的度量(例如在调查方面);.精简是一个好模型的重要属性当有较少的变量时,我们能洞察模型中自变量的影。响;.由于在模型中很多变量存在多重共线性问题会导致估计出的回归系数不稳定。我们用更少的变量能更好地洞察模型中自变量的影响,由于简单的模型的回归系数更稳定;.当自变量和因变量无关时会增加预测值的方差;.去掉回归系数小的自变量会减少预测的平均误差;让我们以简单的有两个自变量的例子来例示后面两点原因。在多于两个自变量的情形下,这些探究依然有效。2.3.1去掉不相关的变量假设真实的自变量Y的方程是(模型1),+ = 11xy,并且假设我们用来估计Y的方程是(用附加的实际上是无关的变量),2x+ + = 2211xxy(模型2)我们使用数据。我们能够得出在这种情形下,的最小二乘估计将服从如下的期望值和方差:nixxyiii,2,1,21 =121 (1)= E,=。= niixrvar1212122 1)1()(0)2 =e,=。= niixrvar1222122 2)1()(其中,是,间的相关系数。12r1x2x我们注意到由于期望值是0,所以,分别是1221,的无偏估计。如果我们用模型会得到:1 (1)= E,= nixvar1212 (1)注意到在这种情形下1的方差较低。方差对于无偏估计是误差平方的期望值。So when we make predict
copyright@ 2008-2022 冰豆网网站版权所有
经营许可证编号:鄂ICP备2022015515号-1