Saturday, June 09, 2007

correlation and regression

Chapter 17

The product moment correlation coefficient, r , measures the linear association between two metric (interval or ratio scaled) variables. It's square,r2, measures the proportion of variation and one variable explained by the other. The partial correlation coefficient measures the association between two variables after controlling, or adjusting for, the affects of one or more additional variables. The order of the partial correlation indicates how many variables are being adjusted or controlled. Partial correlations can be very helpful for detecting spurious relationships.

Bivariate regression derives a mathematical equation between a single metric criterion variable and a single metric predictor variable. The equation is derived in the form of a straight line by using the least squares procedure. When the regression is run on standardized data, the intercept assumes a value of 0, and the regression coefficients are called beta weights. The strength of association is measured by the coefficient of determination, r2, which is obtained by computing a ratio of SSreg to SSy. The standard error of estimate is used to access the accuracy of prediction and may be interpreted as a kind of average error made in predicting Y from the regression equation.

Multiple regression involves a single dependent variable and to a more independent variables. The partial regression coefficient, b1 , represents the expected change in Y when X1 is changed by one unit and X2 through Xk are held constant. The strength of association is measured by the coefficient of multiple determination, R2. The significance of the overall regression equation may be tested by the overall F test. Individual partial regression coefficients may be tested for significance using the t test or the incremental F test. Scattergrams of the residuals, in which the residuals are plotted against the predicted values, time, or predictor variables, are useful for examining the appropriateness of the underlying assumptions and the regression model fitted.

In stepwise regression, the predictor variables are entered or renewed from the regression equation 1 at a time for the purpose of selecting a smaller subset of predictors that account for most of the variation in the criterion variable. Multicollinearity, or very high intercorrelations among the predictor variables, can result in several problems. Because the predictors are correlated, regression analysis provides no unambiguous measure of relative importance of the predictors. Cross validation examines whether the regression model continues to hold true for comparable data not used in estimation. It is a useful procedure for evaluating the regression model.

Nominal or categorical variables may be used as predictors by coding them as dummy variables. Multiple regression with dummy variables provide a general procedure for the analysis of variance and covariance.

Product moment correlation (r) -- a statistic summarizing the strength of association between two metric variables
covariance -- a systematic relationship between two variables in which a change in one implies a corresponding change in the other
partial correlation coefficient -- a measure of the association between two variables after controlling or adjusting for the effects of one or more additional variables
part correlation coefficient -- a measure of the correlation between Y and X when the linear affects of the other independent variables have been removed from X but not from Y
nonmetric correlation -- a correlation measure for two nonmetric variables that relies on rankings to compute the correlation
regression analysis -- a statistical procedure for analyzing associative relationships between a metric dependent variable and one or more independent variables
bivariate regression -- a procedure for deriving a mathematical relationship, in the form of an equation, between a single metric dependent variable in a single metric independent variable
least-squares procedure -- a technique for fitting a straight line to a scattergram by minimizing the square of the vertical distances of all the points from the line
multiple regression -- a statistical technique that simultaneously developed a mathematical relationship between two or more independent variables and on interval scale dependent variable
multiple regression model -- an equation used to explain the results of multiple regression analysis
residual -- the difference between the observed value of Yi and the value predicted by the regression equation ,Yi
stepwise regression -- a regression procedure in which the predictor variables enter or leave the regression equation when a time
multicollinearity -- a state of very high is intercorrelations among independent variables
Cross validation -- a test of validity that examines whether a model holds on comparable data not used in the original estimation
double cross validation -- a special form of validation in which the sample is split into halves. One half serves as the estimation sample in the other as a validation sample. The roles of the estimation and validation halves and then reversed, and the cross validation process repeated