---
title: "Recitation 4: Nonlinear regressions and variable interactions"
author: "Matthew Alampay Davis"
date: June 20, 2023
output:
pdf_document:
keep_tex: true
header-includes:
- \usepackage{booktabs}
- \usepackage{tikz}
- \usetikzlibrary{fit}
- \usepackage{colortbl}
---
```{r, include = F}
# Load all packages used in this Notebook
library(tidyverse) # Data cleaning
library(magrittr) # Lets us use the convenient %<>% pipe
library(haven) # read_dta
library(estimatr) # lm_robust, confint
library(car) # linearHypothesis and joint hypothesis tests
library(ggplot2) # Plotting
library(fixest) # multiple regressions simultaneously, displays in a nice table
# New
library(margins) # used for estimating marginal effects
# Load all datasets used in this Notebook
lead <- read_dta('data/lead_mortality.dta') # Practice Question 1
cps <- read_dta('data/CPS2015.dta') # Practice Question 2
```
# Regressions with interactions
Regressions with interaction terms are a type of non-linear regression and they involve multiplying two covariates together:
$$
Y_i = \beta_0 + \beta_1 X_{1i} + \beta_2 X_{2i} + \beta_3 (X_{1i}\times X_{2i}) + u_i
$$
Obviously, the last covariate is the interaction term. They are relevant when we think that $X_1$'s effect on $Y$ depends on the value of $X_2$. For example, in our data, we may think that having a college degree affects earnings and that being a woman affects earnings, but we may also suspect that the effect a college education has on earnings is different for men and women. Similarly, we may think the effect being a woman has on earnings is different for those with and without a college degree. If so, then we want to include an interaction term to capture this relationship.
There are a few equivalent ways to do this. Starting with what I think is the most convenient:
## Method (1)
The easiest is to put an asterisk '\*' in between the two variables you want to interact:
```{r}
int.model.1 <- lm_robust(ahe ~ female*bachelor, cps, se_type = 'HC1')
summary(int.model.1)
```
Conveniently, this automatically adds the two interacting variables separately so no need to think about whether you've included all the 'main' effects
## Method (2)
A second way is to use the colon ':' in between the two variables you want to interact. The difference here is doing so does not automatically include the main effects:
```{r}
int.model.2a <- lm_robust(ahe ~ female:bachelor, cps, se_type = 'HC1')
summary(int.model.2a)
```
We'll almost always want to include the main effects so a complete implementation has a longer formula:
```{r}
int.model.2b <- lm_robust(ahe ~ female + bachelor + female:bachelor, cps, se_type = 'HC1')
summary(int.model.2b)
```
## Method (3)
Finally, we could also define a new variable that is the product of the female and bachelor variables then include it as a regressor in our regression formula:
```{r}
cps.new <- mutate(cps, female.bachelor = female*bachelor)
int.model.3 <- lm_robust(ahe ~ female + bachelor + female.bachelor, cps.new, se_type = 'HC1')
summary(int.model.3)
```
A bit more circuitous but all three methods are equivalent: you can see the ones that include the main effects all produce the exact same estimates. The only difference is that in the third method, the name of the interactive term uses a period '.' instead of a colon ':'. I'll tend to favor method 1.
Here's how we'd conduct a linear Hypothesis test including female:bachelor":
```{r}
linearHypothesis(int.model.1, c('bachelor = 0', 'female:bachelor = 0'), test = 'F')
```
# Some more data cleaning
## Summarizing data by group
We know how to take means and standard deviations of variables. But suppose we want to calculate separate means and standard deviations for different subsets. See Practice Question 1 for an example that also demonstrates the efficiency and readability of using piping and the "group_by" function from tidyverse.
## Plotting data by group
Suppose we want to draw a plot with separate lines of best fit for different subsets of the data so we can compare their slopes. Or suppose we want to draw a scatter plot and color points according to different values they take on (for example, color female observations differently from male observations). ggplot2 lets us do this very efficiently as we'll see in Practice Question 1 part b-ii.
# Practice Question 1: Stock-Watson Empirical Exercise E8.1
Preview the data:
```{r}
head(lead)
```
We will be investigating the effects of early-20th century lead contamination on infant mortality.
## Part a: Compute the average infant mortality rate (*Inf*) for cities with lead pipes and for cities with nonlead pipes. Is there a statistically significant difference in the averages?
Question's simple enough so just for kicks, here's two equivalent ways of answering the question:
Method 1: using the group_by and reframe functions from tidyverse:
```{r}
# Method 1: in one command
lead %>% # The dataset lead
group_by(lead) %>% # Here, lead refers to the variable lead contained in the dataset also named lead
reframe(mean = mean(infrate),
sd = sd(infrate))
```
Method 2: doing separate calculations for each category of lead:
```{r}
# Lead pipes
filter(lead, lead == 1) %$% # %$% pipe to select the variable infrate
infrate %>%
mean
filter(lead, lead == 1) %$%
infrate %>%
sd
# Non-lead pipes
filter(lead, lead == 0) %$%
infrate %>%
mean
filter(lead, lead == 0) %$%
infrate %>%
sd
```
## Part b: The amount of lead leached from lead pipes depends on the chemistry of the water running through the pipes. The more acidic the water is (that is, the lower its *pH*), the more lead is leached. Run a regression of *Inf* on *Lead*, *pH*, and the interaction term $Lead\times pH$.
Running the regression:
```{r}
lead.mod <- lm_robust(infrate ~ lead*ph, lead, se_type = 'HC1')
summary(lead.mod)
```
### b-i) The regression includes four coefficients (the intercept and the three coefficients multiplying the regressors). Explain what each coefficient measures.
The first coefficient is the intercept, which shows the level of *Infrate* when *lead* = 0 and *pH* = 0. It dictates the level of the regression line. The second coefficient and fourth coefficients measure the effect of lead on the infant mortality rate. Comparing two cities, one with lead pipes (*lead* = 1) and one without lead pipes (*lead* = 0), but with the same *pH*, the difference in predicted infant mortality rate is
$$0.46180-0.05686\times pH$$ Thus, the effect of lead contamination depends on the level of acidity that we're holding fixed.
The third and fourth coefficients measure the effect of *pH* on the infant mortality rate. Comparing two cities, one with a *pH* of 6 and the other with a *pH* of 5, but the same 'leadedness', the difference in predicted infant mortality rate is
$$-0.075-0.057\times lead$$
so the infant mortality increase associated with a one-unit increase in *pH* is -0.075 for cities without lead pipes and -0.075-0.057=-0.132 for cities with lead pipes.
### b-ii) Plot the estimated regression function relating *Inf* to *pH* for *Lead* = 0 and for *Lead* = 1. Describe the differences in the regression functions, and relate these differences to the coefficients you discussed in b-i).
You can of course plot two separate graphs for the two cases. But here's an opportunity to try out plotting two lines on the same graph:
```{r}
ggplot(lead, aes(x = ph, y = infrate, fill = factor(lead), color = factor(lead))) +
theme_bw() +
geom_smooth(method = 'lm') +
# Give title to the axes and the legend
labs(x = 'Acidity (pH)',
y = 'Infant mortality rate',
fill = 'Lead piping', color = 'Lead piping')
```
Notice we have turned the variable *lead* into a factor. The factor function converts a numerical variable into a categorical variable so that all values that it takes on are distinct groups. In ggplot, this means that we set the *fill* and *color* colors to depend on this factor so that when we add any plot element, it will color the two groups differently. In addition, it means that if we use geom_smooth to plot lines of best fit, it'll create separate lines for the set of observations with lead == 1 and for lead == 0 (the only two values lead takes in this data) since geom_smooth() also takes *fill* and *color* as arguments.
The infant mortality rate is higher for cities with lead pipes, but the difference declines as the pH level increases.
### b-iii) Does *lead* have a statistically significant effect on *Infrate*?
```{r}
linearHypothesis(lead.mod, c('lead = 0', 'lead:ph = 0'), test = 'F')
```
The F-statistic for the coefficient on lead and the interaction term is F = 3.936, which has a p- value of 0.02, so the lead coefficients are jointly significant at the 5% but not the 1% significance level.
### b-iv) Does the effect of lead on infant mortality depend on pH? Is this dependence statistically significant?
```{r}
lead.mod$p.value
coeftable(lead.mod) # from fixest package
```
The interaction term has a t-statistic of t = -2.02, corresponding to a p-value of 0.448 so the coefficient is significant at the 5% but not the 1% significance level.
### b-v) Average value of pH
#### What is the average value of pH in the sample?
```{r}
mean(lead$ph)
```
#### At this pH level, what is the estimated effect of Lead on infant mortality?
Method 1:
```{r}
# Create two observations with the mean ph level but different values of lead:
mean.ph <- data.frame(lead = c(1,0),
ph = mean(lead$ph))
# Estimate their infant mortality
inf.meanph <- predict(lead.mod, newdata = mean.ph)
inf.meanph
# Take the difference between these estimates
inf.meanph[1]-inf.meanph[2]
```
Careful with the sign here: the interpretation here is that at the mean pH level, an observation with lead (our first observation) is predicted to have a 0.0454 *higher* infant mortality than an observation without lead (our second observation)
Method 2:
```{r}
# Estimated mortality with lead
lead.1 <- data.frame(lead = 1, ph = mean(lead$ph))
# Estimated mortality with lead
lead.0 <- data.frame(lead = 0, ph = mean(lead$ph))
# Difference between estimates
predict(lead.mod, newdata = lead.1)-predict(lead.mod, newdata = lead.0)
```
#### What is the standard deviation of pH?
The standard deviation of pH is
```{r}
ph.sd <- sd(lead$ph)
ph.sd
```
#### Suppose the pH level is one standard deviation lower than the average level of pH in the sample: What is the estimated effect of Lead on infant mortality?
The estimated effect of lead on infant mortality when the pH is one standard deviation lower than average level of pH in the sample is given by
```{r}
# Method 1
ph.1sd.lower <- data.frame(lead = c(1,0),
ph = mean(lead$ph)-sd(lead$ph))
preds.1sd.lower <- predict(lead.mod, newdata = ph.1sd.lower)
preds.1sd.lower[1]-preds.1sd.lower[2]
# Method 2
lead.sd1 <- data.frame(lead = 1, ph = mean(lead$ph)-sd(lead$ph))
lead.sd0 <- data.frame(lead = 0, ph = mean(lead$ph)-sd(lead$ph))
predict(lead.mod, newdata = lead.sd1)-predict(lead.mod, newdata = lead.sd0)
```
#### What if pH is one standard deviation higher than the average value?
```{r}
# Method 1
ph.1sd.higher <- data.frame(lead = c(1,0),
ph = mean(lead$ph)+sd(lead$ph))
preds.1sd.higher <- predict(lead.mod, newdata = ph.1sd.higher)
preds.1sd.higher[1]-preds.1sd.higher[2]
# Method 2
lead.sd1 <- data.frame(lead = 1, ph = mean(lead$ph)+sd(lead$ph))
lead.sd0 <- data.frame(lead = 0, ph = mean(lead$ph)+sd(lead$ph))
predict(lead.mod, newdata = lead.sd1)-predict(lead.mod, newdata = lead.sd0)
```
### b-vi) Constructing a 95% confidence interval for the effect of lead on infant mortality when pH = 6.5
The given regression equation is
$$
\text{Infrate} = \beta_0 + \beta_1\text{lead} + \beta_2\text{pH}+\beta_3(\text{lead}\times\text{pH})+u_i
$$
#### Method 1: the margins package
```{r}
mod.lead <- lm_robust(infrate ~ lead*ph, lead, se_type = 'HC1')
summary(mod.lead)
```
The given coefficient on lead is the marginal effect of lead but when pH = 0. We want the marginal effect of lead when pH = 6.5:
```{r}
marg.ph65 <- margins(mod.lead, at = list(ph = 6.5))
marg.ph65
```
The marginal effect of lead is estimated to be 0.092. We can construct the corresponding confidence intervals using confint():
```{r}
confint(marg.ph65, level = 0.95)
```
This gives a confidence interval of 0.028 to 0.157
#### Method 2: Transforming the regression
Referring to method 2 of section 7.3 of Stock-Watson, we add and subtract $6.5\beta_3\text{lead}$ to the regression:
$$
\text{Infrate} = \beta_0 + (\beta_1+6.5\beta_3)\text{lead} + \beta_2\text{pH}+\beta_3[\text{pH}\cdot\text{lead}-6.5\cdot\text{lead}] + u_i
$$
Estimating this regression
```{r}
lead %<>% mutate(lead, x3 = lead*ph-6.5*lead)
mod.lead2 <- lm_robust(infrate ~ lead + ph + x3, lead, se_type = 'HC1')
summary(mod.lead2)
```
Then we get exactly the same estimated effet, here presented as the coefficient on lead. The corresponding confidence interval is the one for the coefficient on lead: 0.027 to 0.157, essentially the same as through Method 1.
## Part c: The analysis in (b) may suffer from omitted variable bias because it neglects factors that affect infant mortality and that might potentially be correlated with Lead and pH. Investigate this concern, using the other variables in the data set.
There are several demographic variables in the dataset. You should add these and see if the conclusions from (b) change in an important way. (Skipping this)
# Practice Question 2: Stock-Watson Empirical Exercise E8.2
One thing to note here is that this data comes from 2015 whereas the solutions seem to use 2012 data so the estimates are slightly different. When I'm comparing models below, I'm copying and pasting the official answers provided so they may actually be incompatible with the results being displayed.
Transform data:
```{r}
head(cps)
# Creating new variables needed for the regressions
cps %<>% mutate(log.ahe = log(ahe),
log.age = log(age),
age2 = age^2)
```
This question asks us to run several regressions so I think it's convenient to just run them all at the beginning then refer to them as needed:
```{r}
# You can create five separate model objects as usual:
mod.a <- lm_robust(ahe ~ age + female + bachelor, cps, se_type = 'HC1')
mod.b <- lm_robust(log.ahe ~ age + female + bachelor, cps, se_type = 'HC1')
mod.c <- lm_robust(log.ahe ~ log.age + female + bachelor, cps, se_type = 'HC1')
mod.d <- lm_robust(log.ahe ~ age + age2 + female + bachelor, cps, se_type = 'HC1')
mod.i <- lm_robust(log.ahe ~ age + age2 + female*bachelor, cps, se_type = 'HC1')
# And then running five separate summary() commands
# Here's a convenient way to do this in one command and one object using the fixest package:
## Models a, b: different LHS, same RHS
mods.ab <- feols(c(ahe, log.ahe) ~ age + female + bachelor,
data = cps, se = 'HC1')
## Models c, d, i: same LHS, different RHS
mods.cdi <- feols(log.ahe ~ sw(log.age + female + bachelor,
age + age2 + female + bachelor,
age + age2 + female*bachelor,
female*bachelor + female*age + female*age2),
data = cps, se = 'HC1')
```
With the latter, we can call any of the models estimated using "mods\$" and selecting the relevant model or call mods[[number]] where number is the index of the model in order of estimation.
It also becomes convenient for making compact tables:
```{r, results = 'asis'}
etable(mods.ab, mods.cdi, markdown = T)
```
## Parts a-d:
All these subquestions also ask us to look at the effect of age increasing from 25 to 26 and from 33 to 34 for each of the different models so we also define those cases below. Since we will be interested in the age effect which does not interact with our control variables female and bachelor, we arbitrarily set sex to female and bachelor to 1.
```{r}
# Method 1
age.25 <- data.frame(age = 25) %>%
mutate(age2 = age^2, log.age = log(age),
female = 1, bachelor = 1)
age.26 <- data.frame(age = 26) %>%
mutate(age2 = age^2, log.age = log(age),
female = 1, bachelor = 1)
age.33 <- data.frame(age = 33) %>%
mutate(age2 = age^2, log.age = log(age),
female = 1, bachelor = 1)
age.34 <- data.frame(age = 34) %>%
mutate(age2 = age^2, log.age = log(age),
female = 1, bachelor = 1)
# Method 2
ages <- data.frame(age = c(25, 26, 33, 34)) %>%
mutate(age2 = age^2,
log.age = log(age),
female = 1,
bachelor = 1)
```
### Effect of age increases from 25 to 26 and from 33 to 34 on expected earnings:
```{r}
ages.preds <- data.frame(ages,
pred.a = predict(mods.ab[[1]], newdata = ages),
pred.b = predict(mods.ab[[2]], newdata = ages),
pred.c = predict(mods.cdi[[1]], newdata = ages),
pred.d = predict(mods.cdi[[2]], newdata = ages),
pred.i = predict(mods.cdi[[3]], newdata = ages))
ages.preds
```
Then we can get their predictions for the differences by subtracting row 2 by row 1 and row 4 by row 3:
```{r}
# Age effect from age 25 to age 26
(ages.preds[2,]-ages.preds[1,])[,6:10]
# Age effect from age 33 to 34
(ages.preds[4,]-ages.preds[3,])[,6:10]
```
## Part e: Do you prefer the regression in (c) to the regression in (b)? Explain.
## Part f: Do you prefer the regression in (d) to the regression in (b)? Explain.
## Part g: Do you prefer the regression in (d) to the regression in (c)? Explain.
Displaying them again:
```{r, results = 'asis'}
etable(mods.ab, mods.cdi, markdown = T)
```
The regressions differ in their choice of one of the regressors. They can be compared on the basis of the $R^2$. The regression in (3) has a (marginally) higher $R^2$, so it is preferred.
The regression in (4) adds the variable *Age2* to regression (2). The coefficient on *Age2* is not statistically significant (t = -1.72) and the estimated coefficient is very close to zero. This suggests that (2) is preferred to (4), the regressions are so similar that either may be used.
The regressions differ in their choice of the regressors (*ln(Age)* in (3) and *Age* and *Age2* in (4)). They can be compared on the basis of the $R^2$. The regression in (4) has a (marginally) higher $R^2$, so it is preferred.
## Part h: Plot the regression relation between *Age* and *ln(AHE)* from (b), (c), and (d) for males with a high school diploma. Describe the similarities and differences between the estimated regression functions. Would your answer change if you plotted the regression function for females with college degrees?
The regression functions are very similar, particularly for Age between 27 and 33 years. The quadratic regression shows somewhat more curvature than the log-log regression, but the difference is small. The regression functions for a female with a high school diploma will look just like these, but they will be shifted by the amount of the coefficient on the binary regressor Female. The regression functions for workers with a bachelor's degree will also look just like these, but they would be shifted by the amount of the coefficient on the binary variable Bachelor.
## Part i: Run a regression of *ln(AHE)* on *Age*, *Age2*, *Female*, *Bachelor*, and the interaction term *Female* \* *Bachelor.*
```{r, results = 'asis'}
etable(mods.cdi[[3]], markdown = T)
```
### What does the coefficient on the interaction term measure?
The coefficient on the interaction term $Female\cdot Bachelor$ shows the "extra effect" of *Bachelor* on *ln(AHE)* for women relative to that for men.
### Alexis is a 30-year-old female with a bachelor's degree. What does the regression predict for her value of *ln(AHE)*?
### Jane is a 30-year-old female with a high school diploma. What does the regression predict for her value of *ln(AHE)*?
### Bob is a 30-year-old male with a bachelor's degree. What does the regression predict for his value of ln(AHE)?
### Jim is a 30-year-old male with a high school diploma. What does the regression predict for his value of ln(AHE)?
```{r}
people <- data.frame(name = c('Alexis', 'Jane', 'Bob', 'Jim'),
age = c(30, 30, 30, 30),
female = c(1, 1, 0, 0),
bachelor = c(1, 0, 1, 0)) %>%
mutate(age2 = age^2)
preds <- data.frame(people,
predictions = predict(mods.cdi[[3]], newdata = people))
preds
```
### What is the predicted difference between Alexis's and Jane's earnings?
```{r}
preds$prediction[1]-preds$prediction[2]
```
Alexis' predicted earnings are 0.476 higher than Jane's
### What is the predicted difference between Bob's and Jim's earnings?
```{r}
preds$prediction[3]-preds$prediction[4]
```
Bob's predicted earnings are 0.452 higher than Jim's
## Part j: Is the effect of Age on earnings different for men than for women? Specify and estimate a regression that you can use to answer this question.
Parts j, k, and l all ask for additional regressions so combining them into one command:
```{r}
mods.jkl <- feols(log.ahe ~ sw(female*bachelor + female*age + female*age2,
female*bachelor + bachelor*age + bachelor*age2,
female*bachelor + female*age + female*age2 + bachelor*age + bachelor*age2),
cps, se = 'HC1')
```
For model j, we include two additional regressors: the interactions of female and the two age variables, age and age2.
```{r, results = 'asis'}
etable(mods.jkl[[1]], markdown = T)
```
```{r}
# Re-estimating model j asn lm_robust object so we can use linearHypothesis
mod.j <- lm_robust(log.ahe ~ female*bachelor + female*age + female*age2, cps, se_type = 'HC1')
# Testing joint significance of the age-female interactions
linearHypothesis(mod.j, c('female:age = 0', 'female:age2 = 0'), test = 'F')
```
The F-statistic testing the null hypothesis that the coefficients on these interaction terms are both equal to zero is F = 2.64 with a p-value of 0.07. This implies that there isn't sufficient evidence to reject the null hypothesis that there is a different effect of *Age* on *ln(AHE)* for men compared to women at the 1% confidence level.
## Part k: Is the effect of Age on earnings different for high school graduates than for college graduates? Specify and estimate a regression that you can use to answer this question.
Same as above but age interactions on bachelor instead:
```{r, results = 'asis'}
etable(mods.jkl[[2]], markdown = T)
# Re-estimating model k asn lm_robust object so we can use linearHypothesis
mod.k <- lm_robust(log.ahe ~ female*bachelor + bachelor*age + bachelor*age2, cps, se_type = 'HC1')
```
Testing the null hypothesis of no difference in the age association with log earnings between college-educated and non-college-educated people:
```{r}
linearHypothesis(mod.k, c('bachelor:age = 0', 'bachelor:age2 = 0'), test = 'F')
```
The associated p-value is 0.3559 so we cannot reject the null hypothesis at any reasonable confidence level.
## Part l: After running all these regressions (and any others that you want to run), summarize the effect of age on earnings for young workers.
We'll run an additional regression with both sets of age interaction terms:
```{r, results = 'asis'}
etable(mods.jkl[[3]], markdown = T)
# mods.l <- lm_robust(log.ahe ~ female*bachelor + female*age + female*age2 + bachelor*age + bachelor*age2, cps, se_type = 'HC1')
# summary(mods.l)
```
This is a weirdly demanding question so no need to know the following commands. Including here just for completion.
Let's create a table of predicted values using this model for earnings from ages 25-35 for all possible combinations of female and bachelor
```{r}
# Create a grid of combinations for the binary variables
combos <- expand.grid(female = c(0, 1), bachelor = c(0, 1))
# Create a data frame to store the predictions of the model for each combination from all ages from 25-35
preds <- expand.grid(age = 20:50,
female = c(0,1),
bachelor = c(0,1)) %>%
mutate(age2 = age^2)
# Generate predictions
preds$ahe.hat <- predict(mods.jkl[[3]], newdata = preds)
# Plot these for different subgroups
ggplot(preds, aes(x = age, y = ahe.hat, color = female==1, lty = bachelor==1)) +
theme_bw() +
geom_smooth(method = 'lm', formula = y ~ poly(x, 2, raw = TRUE), se = FALSE) +
labs(color = 'Female', lty = 'College-educated')
```