ImageVerifierCode 换一换
格式:DOCX , 页数:23 ,大小:23.89KB ,
资源ID:6188542      下载积分:3 金币
快捷下载
登录下载
邮箱/手机:
温馨提示:
快捷下载时,用户名和密码都是您填写的邮箱或者手机号,方便查询和重复下载(系统自动生成)。 如填写123,账号就是123,密码也是123。
特别说明:
请自助下载,系统不会自动发送文件的哦; 如果您已付费,想二次下载,请登录后访问:我的下载记录
支付方式: 支付宝    微信支付   
验证码:   换一换

加入VIP,免费下载
 

温馨提示:由于个人手机设置不同,如果发现不能下载,请复制以下地址【https://www.bdocx.com/down/6188542.html】到电脑端继续下载(重复下载不扣费)。

已注册用户请登录:
账号:
密码:
验证码:   换一换
  忘记密码?
三方登录: 微信登录   QQ登录  

下载须知

1: 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。
2: 试题试卷类文档,如果标题没有明确说明有答案则都视为没有答案,请知晓。
3: 文件的所有权益归上传用户所有。
4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
5. 本站仅提供交流平台,并不能对任何下载内容负责。
6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

版权提示 | 免责声明

本文(多元线性回归multiple linear regression.docx)为本站会员(b****5)主动上传,冰豆网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。 若此文所含内容侵犯了您的版权或隐私,请立即通知冰豆网(发送邮件至service@bdocx.com或直接QQ联系客服),我们立即给予删除!

多元线性回归multiple linear regression.docx

1、多元线性回归multiple linear regression多元线性回归(multiple linear regression)Multiple linear regression in data miningContent:Review of 2.1 linear regression2.2 cases of regression processSubset selection in 2.3 linear regressionPerhaps the most popular and predictive mathematical model is the multivariate lin

2、ear regression model. Youre already in the data, modelIn the course of decision making, we studied the multiple linear regression model. In this statement, we will build on the basis of these knowledgeThe test applies multiple linear regression model in data mining. Multiple linear regression models

3、 are used for numerical data mining situationsIn. For example, demographic and historical behavior models are used to predict customer use of credit cards, based on usage and their environmentTo forecast the equipment failure time, often in the past through the travel vacation travel expenses foreca

4、st record, at the inquiry officeWindow through the history, product sales, information and other staff forecast the needs of workers, through historical information to predict cross sales of product salesAnd to predict the impact of discounts on sales in retail.In this note, we review the process of

5、 multiple linear regression. Here we stress the need to divide data into two categories: TrainingThe data set and the validation data set make it possible to validate multiple linear regression models and require a loose assumption: error obeysNormal distribution. After these reviews, we introduce m

6、ethods for determining subsets of independent variables to improve prediction.An overview of 2.1 linear regressionIn this discussion, we briefly review the multivariate linear models encountered in the course of data, models, and decision making. AA continuous random variable is called a dependent v

7、ariable, Y, and some independent variables,. Our purpose is to use independent variablesA linear equation that predicts the value of a dependent variable (also known as a response variable). The modulus that is known as the independent variable for the prediction purposeType is:Pxxx,. 21Epsilon, bet

8、a, beta, beta, +=ppxxxY,. 22110 (1)Wherein, epsilon is a noise variable, which is a normal distribution with a mean value of 0 and a standard deviation of delta (we dont know its value)Random variable. We dont know the values of these coefficients, P, beta, beta,., 10. We estimate all of these from

9、the data obtained(p+2) the value of an unknown parameter.These data include the N line observation points, also known as instances, which are represented as instances;. throughThese estimates of the beta coefficients are calculated by minimizing the variance and minimum of the values between the pre

10、dicted and observed data. Variance sumIs expressed as follows:Ipiiixxxy,., 21ni,., 2,1=Sigma =.Nippiiixxxy1222110) (beta, beta, beta)Let us represent the value of the coefficients by making the upper type minimized. These are our estimates of the unknown values,210,., P, beta, beta, betaThe estimato

11、r is also referred to in the literature as OLS (ordinary least squares). Once we have calculated these estimates,We can use the following formula to compute unbiased estimates: 10, -, P, beta, beta 2, Delta 2, DeltaObservation point factorResiduals and = =.= = SigmaNiippiiixxxypn12221102().11, beta,

12、 beta, beta DeltaThe values we insert in the linear regression model (1) are based on the values of the known independent variablesPredict the value of dependent variable. The predictor variables are calculated according to the following formula: 10,., P, beta, beta, pxxx,., 21YPpxxxY22110Beta beta

13、beta beta +=.In the sense that they are unbiased estimates (the mean is true) and that there is a minimum variance compared with other biased estimates,The predictions based on this formula are the best possible predictive values if we make the following assumptions:1. Linear hypothesis: the expecte

14、d value of dependent variable is a linear equation about the independent variablePppxxxxxxYE beta beta beta beta +=.), | (2211021,.2, independence hypothesis: random noise variable I epsilonIndependent in all lines. Here I epsilonThe noise is observed at the first I observation pointMachine variable

15、, i=1,2,. N;3. Unbiased hypothesis: noise stochastic variable I epsilonThe expected value is 0, that is, for i=1,2,. N has 0) (=iE epsilon);4, the same variance hypothesis: for i=1,2,. And ns I epsilonThe standard deviation has the same value as delta;5. Normality hypothesis: noise stochastic variab

16、le I epsilonNormal distribution.There is an important and interesting fact for our purpose, that is, even if we give up the hypothesis of normalitySet 5) and allow noise variables to obey arbitrary distributions, and these estimates are still well predicted. We can watch BenQThe prediction of these

17、estimators is the best linear predictor due to their minimum expected variance. In other words, in all linear modelsAnd, as defined in equation (1), the model uses a least squares estimator, 10, -, P, beta, betaWe will give the minimum of the mean square. And describe the idea in detail in the next

18、section.Normal distribution assumptions are used to derive confidence intervals for predictions. In data mining applications, we have two different data sets:The training data set and the validation data set, these two data sets contain typical relationships between independent variables and depende

19、nt variables. Training dataSets are used to estimate regression coefficients. Validation data sets are used to form retention samples without calculating regression coefficientsEstimated value. This allows us to estimate the errors in our predictions without assuming that the noise variables are nor

20、mally distributedPoor. We use training data to fit the model and estimate the coefficients. These estimated coefficients are used for all validation data setsExamples make predictions. Compare the actual dependent variable values for each examples prediction and validation data sets. The mean square

21、 difference allows usCompare the different models and estimate the accuracy of the model in forecasting. 10, -, P, beta, beta2.2 cases of regression processWe use examples from Chaterjee, Hadi, and Price to evaluate the performance of managers in big financial institutionsThe process of multivariate

22、 linear regression is shown.The data shown in Table 2.1 are derived from a survey of office staff at a department of a major financial institutionSub. Dependent variable is a measure of the efficiency of a department leading by the agencys managers. All dependent variables and independent variables

23、are25 employees are graded from 1 to 5 in different aspects of the managements work. As a result, for each variableThe minimum is 25 and the maximum is 125. These ratings are a survey of 25 employees in each department and 30 employees in each departmentAnswer。 The purpose of the analysis is to expl

24、ore the feasibility of using questionnaires to predict the efficiency of the Department, thus avoiding direct measurement of efficiencyEffort. Variables are answers to survey questions, and are described as follows:Efficiency measurement of Y management;Dealing with employee complaints; 1XNo privile

25、ges allowed; 2XOpportunities to learn new things; 3XPromote according to performance; 4XToo bad about the performance; 5XTo advance the progress of a better job; 6XThe multiple linear regression estimates are computed by the StatCalc plug-in in Excel, as shown in table 2.2.Table 2.2The equation for

26、predicting efficiency isY=13.182+0.5830.044+0.3290.057+0.1120.197 1X2X3X4X5X6XIn Table 2.3, we use ten examples as validation data. Apply the preceding equation to the predictions given in the validation dataAnd the error is shown in table 2.3. The error represented by the last column is the differe

27、nce between the predicted and actual values. For example, the error of example 21The difference is 44.4650=5.54Table 2.3ExampleYX1X2X3X4X5X6predicted valueerrorTwenty-oneFiftyFortyThirty-threeThirty-fourForty-threeSixty-fourThirty-threeForty-four point four six-5.54Twenty-twoSixty-fourSixty-oneFifty

28、-twoSixty-twoSixty-sixEighty四十一六十三点九八-0.02二十三五十三六十六五十二五十六十三八十三十七六十三点九一十点九一二十四四十三十七四十二五十八五十五十七四十九四十五点八七五点八七二十五六十三五十四四十二四十八六十六七十五三十三五十六点七五结果二十六六十六七十七六十六六十三八十八七十六七十二六十五点二二-0.78二十七七十八七十五五十八七十四八十七十八四十九七十三点二三-4.77二十八四十八五十七四十四四十五五十一八十三三十八五十八点一九十点一九二十九八十五八十五七十一七十一七十七七十四五十五七十六点零五-8.95三十八十二八十二三十九五十九六十四七十八三十九七

29、十六点一零-5.90平均标准差六十二点三八十一点三零-0.52七点一七我们注意到预测的平均误差很小(因此预测是无偏的进而误差大致是正态的0.52),因此这个模型给出的预测误差大致95%的可能处在真值的34.14(两个标准偏差)区间之内。2.3在线性回归中子集的选择一个在数据挖掘中经常遇到的问题是当我们用回归方程预测因变量的值时,在模型中有很多的变量可以作为候选的自变量给出的一个高速的线性回归现代算法,可以尝试在某些。情形下用一种极端现实主义的方法:为什么会为选择子集而烦恼?在模型中用所有的变量就可以了。这有几个原因表明上述做法并不理想。.收集到全部的用来预测的变量代价较高;.我们也许能够做到用

30、更少的变量做到更精确的度量(例如在调查方面);.精简是一个好模型的重要属性当有较少的变量时,我们能洞察模型中自变量的影。响;.由于在模型中很多变量存在多重共线性问题会导致估计出的回归系数不稳定。我们用更少的变量能更好地洞察模型中自变量的影响,由于简单的模型的回归系数更稳定;.当自变量和因变量无关时会增加预测值的方差;.去掉回归系数小的自变量会减少预测的平均误差;让我们以简单的有两个自变量的例子来例示后面两点原因。在多于两个自变量的情形下,这些探究依然有效。2.3.1去掉不相关的变量假设真实的自变量Y的方程是(模型1),+ = 11xy,并且假设我们用来估计Y的方程是(用附加的实际上是无关的变量),2x+ + = 2211xxy(模型2)我们使用数据。我们能够得出在这种情形下,的最小二乘估计将服从如下的期望值和方差:nixxyiii,2,1,21 =121 (1)= E,=。= niixrvar1212122 1)1()(0)2 =e,=。= niixrvar1222122 2)1()(其中,是,间的相关系数。12r1x2x我们注意到由于期望值是0,所以,分别是1221,的无偏估计。如果我们用模型会得到:1 (1)= E,= nixvar1212 (1)注意到在这种情形下1的方差较低。方差对于无偏估计是误差平方的期望值。So when we make predict

copyright@ 2008-2022 冰豆网网站版权所有

经营许可证编号:鄂ICP备2022015515号-1