1. From the Oregon Health Authority we have numbers for deaths attributable to the virus. We also have the proportion of each county’s residents who have had at least one dose of vaccine. This is less informative than the state-level data on doses per capita, but we can assume that the OHA variable is substantially correlated with the vaccination data we would like to have. From the US Census bureau we have other information about Oregon counties: the percentage over 65 years of age, and the percentage of adults with a four-year college degree or higher. Finally, from various sources we can obtain the share of each county’s vote won by Donald Trump in the 2020 election. These are summarized in the SPSS Data Set. Load it into SPSS to begin working on the questions below:
(a) Construct and display a scatter plot using the vaccination and deaths per thousand variables. Which should you put on the y-axis? Why?
(b) How would you describe the relationship between these two variables based solely on the scatter plot?
(c) Now fit a linear regression with deaths per thousand as the dependent variable and vaccination as the independent variable. What is the adjusted R 2 of this equation? Say in words what it means.
(d) Let’s consider whether the dependent variable should be deaths per thousand or the natural logarithm of deaths per thousand. Construct and display a histogram for the deaths per thousand variable. Does this look like a power law-type distribution? Based on this one consideration, do you think it will improve the regression to log transform the dependent variable?
(e) Now let’s look at the Trump vote share variable. Fit a linear regression for which deaths per thousand is the dependent variable and the Trump vote is the sole independent variable. What is the adjusted R 2 for this equation? Does the t-statistic for this variable show that it is “statistically significant”? Say in words what the coefficient on the Trump vote share means.
(f) Based on (e), it might be possible for someone to fall into the Ecological Fallacy. Express an erroneous interpretation of the Trump vote share regression that illustrates this fallacy.
(g) It might be possible that counties with higher levels of education take more precautions against the virus and therefore have a lower death rate. To test this, fit a linear regression for which deaths per thousand is the dependent variable and BA is the sole independent variable. Is the coefficient statistically significant? What is the adjusted R 2 for this equation?
(h) With several such apparently powerful explanatory variables, we need more information to figure out how to combine them. Construct a correlation matrix for all the variables in the data set. Which of the potential explanatory variables have a very strong correlation, positive or negative, with deaths per thousand?
(i) So now fit a regression of deaths per thousand (dependent variable) on vaccination, Trump vote share and BA. What is the adjusted R 2 for this equation? Based on their t-statistics, what is the weakest independent variable?
(j) Eliminate this weakest variable, and now fit a regression of deaths per thousand on the two that remain. What is the adjusted R 2? Compare the coefficients in this equation to the coefficients of these two variables when they were run separately. What do you see?
(k) Construct a standardized residual plot for this equation, putting the dependent variable on the yaxis and the standardized (Z) residual on the x-axis. Which counties, if any, are close to three standard deviations above or below their predicted value? Is Multnomah county above or below its prediction?
(l) Based on the correlation matrix, why do you think that neither of the two explanatory variables you used in (j), which were so powerful on their own, are statistically significant when used together?