---
title: "Recitation 5: Panel Data and Binary Dependent Variables"
author: "Matthew Alampay Davis"
date: June 27, 2023
output:
pdf_document:
keep_tex: true
header-includes:
- \usepackage{booktabs}
- \usepackage{tikz}
- \usetikzlibrary{fit}
- \usepackage{colortbl}
---
```{r, include = F}
# Load all packages used in this Notebook
library(tidyverse) # Data cleaning
library(magrittr) # Lets us use the convenient %<>% pipe
library(haven) # read_dta
library(fixest) # panel data
```
```{r}
# Load all datasets used in this Notebook
guns <- read_dta('data/handguns.dta') # Practice Question 1
smoking <- read_dta('data/Smoking.dta') # Practice Question 2
```
# Practice Question 1: Stock-Watson Empirical Exercise 10.1
Some U.S. states have enacted laws that allow citizens to carry concealed weapons. These laws are known as "shall-issue" laws because they instruct local authorities to issue a concealed weapons permit to all applicants who are citizens, are mentally competent, and have not been convicted of a felony. (Some states have some additional restrictions.)
Proponents argue that if more people carry concealed weapons, crime will decline because criminals will be deterred from attacking other people. Opponents argue that crime will increase because of accidental or spontaneous use of the weapons. In this exercise, you will analyze the effect of concealed weapons laws on violent crimes.
## a) Estimate (1) a regression of ln(vio) against shall and (2) a regression of ln(vio) against shall, incarc_rate, density, avginc, pop, pb1064, pw1064, and pm1029.
We are using the feols function ("fixed effects estimation by OLS") in the fixest package. Note also that this is panel data where the unit is defined by the state variable and the time is defined by the year variable. In this context, we prefer to use standard errors clustered at the state level:
```{r}
guns %<>% mutate(log.vio = log(vio))
mod.1a1 <- feols(log.vio ~ shall, data = guns)
mod.1a2 <- feols(log.vio ~ shall, data = guns)
```
```{r, results = 'asis'}
etable(mod.1a1, mod.1a2, cluster = 'state', digits = 6, markdown = T)
```
### i. Interpret the coefficient on shall in regression (2). Is this estimate large or small in a real-world sense?
The coefficient is -0.368, which suggests that shall-issue laws is associated with a violent crime rate 36% lower. The p-value associated with this estimate is less than 0.001, indicating it is a statistically significant effect. The magnitude of this effect is also clearly large in a real-world sense.
### ii. Does adding the control variables in regression (2) change the estimated effect of a shall-issue law in regression (1) as measured by statistical significance? As measured by the real-world significance of the estimated coefficient?
The coefficient in (1) is -0.443; in (2) it is -0.369. Both are highly statistically significant. Adding the control variables results in a small drop in the coefficient.
### iii. Suggest a variable that varies across states but plausibly varies little or not at all over time and that could cause omitted variable bias in regression (2).
There are several examples. Here are two: Attitudes towards guns and crime, and quality of police and other crime-prevention programs.
## b) Do the results change when you add fixed state effects? If so, which set of regression results is more credible, and why?
```{r, results = 'asis'}
mod.1b <- feols(log.vio ~ shall + incarc_rate + density + avginc + pop + pb1064 + pw1064 + pm1029 | state, data = guns)
etable(mod.1a2, mod.1b, cluster = 'state', markdown = T)
```
Note that we here include state fixed effects just by referring to the "state" variable to the regression formula with a vertical bar "\|" separating it from the regular part of the formula.
In this new regression with state fixed effects, the coefficient on shall falls to -0.046, a large reduction. Evidently there was important omitted variable bias in (2). Further, the estimate is not statistically significantly different from zero.
## c) Do the results change when you add fixed time effects? If so, which set of regression results is more credible, and why?
```{r, results = 'asis'}
mod.1c <- feols(log.vio ~ shall + incarc_rate + density + avginc + pop + pb1064 + pw1064 + pm1029 | state + year, data = guns)
etable(mod.1a2, mod.1b, mod.1c, cluster = 'state', markdown = T)
```
The coefficient falls further to -0.028. The coefficient is insignificantly different from zero.
Testing the joint significance of the time fixed effects. There isn't an in-built function to do this (since it's an unusual test) but here's how you could do it:
```{r}
# Estimate the model but with time dummies explicitly included
mod.1c.dummies <- feols(log.vio ~ shall + incarc_rate + density + avginc + pop + pb1064 + pw1064 + pm1029 + factor(year) | state, data = guns, cluster = 'state')
# Test the linear hypothesis that all the dummies (variables beginning with factor) have a coefficient of 0
wald(mod.1c.dummies, cluster = 'state', keep = 'year', )
```
The year fixed effects are jointly statistically significant with an F-statistic of 21.6, so this regression seems better specified than (3).
## d) Repeat the analysis using ln(rob) and ln(mur) in place of ln(vio).
Here's a very compact way of displaying all 12 regressions that questions 1a-1d is asking for:
```{r, results = 'asis'}
guns %<>% mutate(log.rob = log(rob),
log.mur = log(mur))
mods.1d <- feols(c(log.vio, log.rob, log.mur) ~ # vector of all outcome variables of interest
# Cumulative stepwise function (csw) to run models, cumulatively adding more regressors
csw(shall,
incarc_rate + density + avginc + pop + pb1064 + pw1064 + pm1029,
# Include fixed effects here too as factor variables
factor(state),
factor(year)),
data = guns,
# Cluster SEs for all these regressions at the state level
cluster = ~ state)
etable(mods.1d,
# display 95% confidence interval rather than standard errors
coefstat = 'confint', ci = 0.95,
# cluster all regressions' standard errors at the state level
cluster = 'state',
# treat all factor variables (state and/or year) as fixed effects,
keepFactors = F,
# include more statistics
fitstat = ~ r2 + ar2 + rmse + wald + wf, markdown = T)
```
The quantitative results are similar to the results using violent crimes: there is a large estimated effect of concealed weapons laws in specifications (1) and (2). This effect is spurious and is due to omitted variable bias as specification (3) and (4) show.
## e) In your view, what are the most important remaining threats to the internal validity of this regression analysis?
There is potential two-way causality between this year's incarceration rate and the number of crimes. Because this year's incarceration rate is much like last year's rate, there is a potential two-way causality problem. There are similar two-way causality issues relating crime and shall.
## f) Based on your analysis, what conclusions would you draw about the effects of concealed weapons laws on these crime rates?
The most credible results are given by the two-way fixed effects model. The 95% confidence interval for shall contains the value of 0. Thus, there is no statistically significant evidence that concealed weapons laws have any effect on crime rates.
# Practice Question 2: Stock-Watson Empirical Exercise 11.2
Believe it or not, workers used to be able to smoke inside office buildings. Smoking bans were introduced in several areas during the 1990s. Supporters of these bans argued that in addition to eliminating the externality of secondhand smoke, they would encourage smokers to quit by reducing their opportunities to smoke.
In this assignment, you will estimate the effect of workplace smoking bans on smoking, using data on a sample of 10,000 U.S. indoor workers from 1991 to 1993. The dataset contains information on whether individuals were or were not subject to a workplace smoking ban, whether the individuals smoked, and other individual characteristics.
## a) Estimate the probability of smoking for (i) all workers, (ii) workers affected by workplace smoking bans, and (iii) workers not affected by workplace smoking bans.
Using a linear probability model:
```{r, results = 'asis'}
mods.2a <- feols(smoker ~ csw0(smkban),
data = smoking,
se = 'HC1')
etable(mods.2a, fitstat = ~ r2 + ar2 + pr2 + f, markdown = T)
```
There is a 24% chance of smoking for all workers, 29.0% for workers affected by the ban and 0.290-0.08 = 21.2% for those affected by the ban
## b) What is the difference in the probability of smoking between workers affected by a workplace smoking ban and workers not affected by a workplace smoking ban? Use a linear probability model to determine whether this difference is statistically significant.
We answered the first part above: 7.8 percentage points is the difference. This difference is significant as the associated p-value is less than 0.001.
## c) Estimate a linear probability model with smoker as the dependent variable and the following regressors: smkban, female, age, age2, hsdrop, hsgrad, colsome, colgrad, black, and hispanic. Compare the estimated effect of a smoking ban from this regression with your answer from (b). Suggest an explanation, based on the substance of this regression, for the change in the estimated effect of a smoking ban between (b) and (c).
## d) Test the hypothesis that the coefficient on smkban is 0 in the population version of the regression in (c) against the alternative that it is nonzero, at the 5% significance level.
## e) Test the hypothesis that the probability of smoking does not depend on the level of education in the regression in (c). Does the probability of smoking increase or decrease with the level of education?
## f) Repeat c-e using a probit model
## g) Repeat c-e using a logit model
For the linear probability model:
```{r, results = 'asis'}
smoking %<>% mutate(age2 = age^2)
mods2.lpm <- feols(smoker ~ csw0(smkban,
female + age + age2 + hsdrop + hsgrad + colsome + colgrad + black + hispanic),
data = smoking,
se = 'HC1')
etable(mods2.lpm, fitstat = ~ r2 + ar2 + pr2 + f, digits = 6, markdown = T)
```
From model (3) the estimated difference is -0.047, smaller than the effect in model (2). Evidently (2) suffers from omitted variable bias. That is, smkban may be correlated with the education/race/gender indicators or with age. For example, workers with a college degree are more likely to work in an office with a smoking ban than high-school dropouts, and college graduates are less likely to smoke than high-school dropouts.
The p-value on the coefficient of smkban is less than 0.001 so the coefficient is statistically significant at the 1% level.
```{r}
wald(mods2.lpm[[3]], keep = c('hsdrop', 'hsgrad', 'colsome', 'colgrad'))
```
The p-value of the joint hypothesis test is less than 2.2e-16 so the coefficients are significant. The omitted education status is "Masters degree or higher." Thus the coefficients show the increase in probability relative to someone with a postgraduate degree. For example, the coefficient on Colgrad is 0.045, so the probability of smoking for a college graduate is 0.045 (4.5%) higher than for someone with a postgraduate degree. Similarly, the coefficient on HSdrop is 0.323, so the probability of smoking for a college graduate is 0.323 (32.3%) higher than for someone with a postgraduate degree. Because the coefficients are all positive and get smaller as educational attainment increases, the probability of smoking falls as educational attainment increases.
The coefficient on Age2 is statistically significant. This suggests a nonlinear relationship between age and the probability of smoking. In fact it is a negative quadratic with a probability-maximizing age of
```{r}
-0.009674/(2*-0.000132)
```
For the probit and logit models:
```{r, results = 'asis'}
mods2.probit<- feglm(smoker ~ csw0(smkban,
female + age + age2 + hsdrop + hsgrad + colsome + colgrad + black + hispanic),
family = binomial(link = 'probit'),
data = smoking,
se = 'HC1')
etable(mods2.probit, fitstat = ~ r2 + ar2 + pr2 + f, digits = 6, markdown = T)
```
```{r, results = 'asis'}
mods2.logit<- feglm(smoker ~ csw0(smkban,
female + age + age2 + hsdrop + hsgrad + colsome + colgrad + black + hispanic),
family = binomial(link = 'logit'),
data = smoking,
se = 'HC1')
etable(mods2.logit, fitstat = ~ r2 + ar2 + pr2 + f, digits = 6, markdown = T)
```
## h) Predictions
i. Mr. A is white, non-Hispanic, 20 years old, and a high school dropout. Using the probit regression and assuming that Mr. A is not subject to a workplace smoking ban, calculate the probability that Mr. A smokes. Carry out the calculation again, assuming that he is subject to a workplace smoking ban. What is the effect of the smoking ban on the probability of smoking?
ii. Repeat for Ms. B, a female, black, 40-year-old college graduate.
iii. Repeat (i)--(ii) using the linear probability model.
iv. Repeat (i)--(ii) using the logit model.
```{r}
preds <- data.frame(names = rep(c('Mr. A', 'Ms. B'), each = 2),
smkban = c(0, 1, 0, 1),
female = c(0, 0, 1, 1),
age = c(20, 20, 40, 40),
hsdrop = c(1, 1, 0, 0),
hsgrad = c(0, 0, 0, 0),
colsome = c(0, 0, 0, 0),
colgrad = c(0, 0, 1, 1),
black = c(0, 0, 1, 1),
hispanic = c(0, 0, 0, 0)) %>%
mutate(age2 = age^2)
preds$probit.predictions <- predict(mods2.probit[[3]], newdata = preds)
preds$lpm.predictions <- predict(mods2.lpm[[3]], newdata = preds)
preds$logit.predictions <- predict(mods2.logit[[3]], newdata = preds)
preds.show <- select(preds, names, smkban, contains('predictions'))
preds.show
```
So differences for Mr. A and Ms. B respectively are
```{r}
# Mr. A
preds$probit.predictions[2]-preds$probit.predictions[1]
preds$lpm.predictions[2]-preds$lpm.predictions[1]
preds$logit.predictions[2]-preds$logit.predictions[1]
# Ms. B
preds$probit.predictions[4]-preds$probit.predictions[3]
preds$lpm.predictions[4]-preds$lpm.predictions[3]
preds$logit.predictions[4]-preds$logit.predictions[3]
```
v. Based on your answers to (i)--(iv), do the logit, probit, and linear probability models differ? If they do, which results make most sense? Are the estimated effects large in a real-world sense?
They differ a bit but are pretty consistent with one another. The estimated effects are on the order of 6 percentage points for Mr. A and 3 percentage points for Ms. B, which is large.