The lab for this chapter is at joshterrell805/OpenIntro_Statistics_Labs lab#7.
Linear regression should only be used when the data appear to have a linear relationship.
"A 'hat' on a y is used to signify that this is an estimate." ˆy is the estimate or predicted value for y.
"Residuals are the leftover variation in the data after accounting for the model fit: Data = Fit + Residuals."
residual - "the vertical distance from the observation to the line." If the point lies above the line, the residual is positive, if the point is on the line, the residual is 0, and if it is below the line, the residual is negative.
residual plot - plot a horizontal line. For each point, plot the point at its original x location along the horizontal line, but plot its height as the residual value. So if a point has a residual of +2, it is two units above the residual line.
"Correlation, which always takes values between -1 and 1, describes the strength of the linear relationship between two variables. We denote the correlation by R.
least squares regression minimizes the squared residuals.
conditions for the least squares line
"The slope of the least squares line can be estimated by:"
b1=sysxR
"where R is the correlation between the two variables, and sx and sy are the sample standard deviations of the explanatory variable and the response, respectively."
"The point (ˉx,ˉy) is on the least squares line."
Point-slope form:
y−y0=slope(x−x0)
When using statistical software to fit a line to data, a table like the one below is generated. I copied this table from chapter 7 in the book. This table models the amount of student aid a student receives as a function of their family's income. The units of estimate and standard error are in thousands (so first cell is 25.3193 * 1000 dollars).
-----------------------------------------------------------
Estimate Std. Error t value Pr(>|t|)
-----------------------------------------------------------
(Intercept) 25.3193 1.2915 18.83 0.0000
family_income -0.0431 0.0108 -3.98 0.0002
-----------------------------------------------------------
The first row is the intercept of the line. The intercept row holds data for the output variable when all other variables are 0.
The second row is the slope of the line.
The first column is the estimate. When family_income
is 0, the output is 25.3193 (intercept). For each unit family income increases, the output decreases by 0.0431.
The third and fourth columns are the t-value and two-sided p-value given the null hypothesis (intercept and family_income
are 0).
extrapolation is "applying a model estimate to values outside the realm of the original data…If we extrapolate, we are making an unreliable bet that the approximate linear relationship will be valid in places where it has not been analyzed."
"The R2 of a linear model describes the amount of variation in the response that is explained by the least squares line."
An indicator variable is a binary variable. It is equal to 1 if the thing it represents is present, otherwise 0.
A high leverage outlier is a point that falls far away from the center of the cloud of points.
"If one of these high leverage points does appear to actually invoke its influence on the slope of the line…then we call it an influential point. Usually we can say a point is influential if, had we fitted the line without it, the influential point would have been unusually far away from the least squares line."
Don't remove outliers without a very good reason. "Models that ignore exceptional (and interesting) cases often perform poorly." The answer to "Guided Practice 7.24" in this chapter suggests it's okay to remove outliers when they interfere with understanding the data we care about. This example removed two points that occurred during the great depression when modeling voting behavior over the last century. These two great depression points would have been influential on the model, but we don't care much about modeling voting behavior during the depression.