# 9 Hypothesis Tests and Uncertainty in Regression

## 9.1 Overview

In the lecture this week we continued our discussion of statistical inference, and particularly focussed on hypothesis tests and uncertainty in regression estimates. We learned about the different steps of conducting a hypothesis test, and about how to interpret both t-statistics and p-values. We saw the close connection between hypothesis tests and confidence intervals, and drew attention to the fact that observing a “statistically significant” result may not tell us anything about the substantive significance of that result. We also discussed uncertainty in regression models, and saw that our estimated regression coefficients are a quantity of interest that will vary from sample to sample, just as with the difference in means. Accordingly, we saw that we can also construct and interpret standard errors, t-statistics, p-values, and confidence intervals for our regression estimates.

In seminar this week, we will:

1. Practice conducting hypothesis tests for the difference in means.
2. Practice conducting hypothesis tests for regression coefficients.
3. Constructing confidence intervals for regression coefficients.
4. Revisit fixed-effect models for panel data.

Before coming to the seminar

1. Please read chapter 6, “Probability” and chapter 7, “Uncertainty” in Quantitative Social Science: An Introduction

## 9.2 Seminar

In this seminar, we will examine survey data in order to investigate the size of the wage penalty that mothers face in the USA, which you can download from the link above.

The data file is motherhood_revisited.csv, which is a CSV file. Store this file in your data folder as you have done in previous weeks. Then load the data into R:

motherhood <- read.csv("data/motherhood_revisited.csv")

The names and descriptions of variables are:

Name Description
PUBID ID of woman
year Year of observation
wage Hourly wage, in dollars
numChildren Number of children that the woman has (in this wave)
age Age in years
region Name of region (North East = 1, North Central = 2, South = 3, West = 4)
urban Geographical classification (urban = 1, otherwise = 0)
marstat Marital status
educ Level of education
school School enrollment (enrolled = TRUE, otherwise = FALSE)
experience Experience since 14 years old, in days
tenure Current job tenure, in years
tenure2 Current job tenure in years, squared
fullTime Employment status (employed full-time = TRUE, otherwise = FALSE)
firmSize Size of the firm
multipleLocations Multiple locations indicator (firm with multiple locations = 1, otherwise = 0)
unionized Job unionization status (job is unionized = 1, otherwise = 1)
industry Job’s industry type
hazardous Hazard measure for the job (between 1 and 2)
regularity Regularity measure for the job (between 1 and 5)
competitiveness Competitiveness measure for the job (between 1 and 5)
autonomy Autonomy measure for the job (between 1 and 5)
teamwork Teamwork requirements measure for the job (between 1 and 5)

Question 1

What years are included in the data? How many women are included, and how many person-years are included?

# Number of years
length(unique(motherhood$year)) # Number of women length(unique(motherhood$PUBID))

# Number of observations
nrow(motherhood)
## [1] 16
## [1] 1569
## [1] 18214

There are 16 unique years in this dataset. There are 1569 women in the data and 18214 person-year observations.

Question 2

Create a new variable – isMother – that takes a value of 1 if the woman has at least one child and a value of 0 otherwise.

motherhood$isMother <- ifelse(motherhood$numChildren > 0, 1, 0)

# or

motherhood$isMother <- as.numeric(motherhood$numChildren > 0) 

a. Calculate the difference in mean wages between women with children and women without children.

wage_mothers <- mean(motherhood$wage[motherhood$isMother == 1], na.rm = TRUE)
wage_not_mothers <- mean(motherhood$wage[motherhood$isMother == 0], na.rm = TRUE)
mother_not_mother_diff <- wage_mothers - wage_not_mothers
mother_not_mother_diff
## [1] 1.247316

In this sample, mothers earn on average 1.25 dollars more per hour than non-mothers.

b. Calculate the standard error for the difference in means.

The formula for the standard error of the difference in means is $$SE(\hat{Y}_{X=1} - \hat{Y}_{X=0}) = \sqrt{\frac{Var(Y_{X=1})}{n_{X=1}} + \frac{Var(Y_{X=0})}{n_{X=0}}}$$

## Standard error
treat_var <- var(motherhood$wage[motherhood$isMother == 1], na.rm = TRUE)
control_var <- var(motherhood$wage[motherhood$isMother == 0], na.rm = TRUE)

treat_n <- sum(motherhood$isMother == 1, na.rm = TRUE) control_n <- sum(motherhood$isMother == 0, na.rm = TRUE)

st_err <- sqrt(treat_var/treat_n + control_var/control_n)
st_err
## [1] 0.1007549

c. Calculate the t-statistic for the difference in means.

# T-statistic
t_stat <- mother_not_mother_diff/st_err
t_stat
## [1] 12.37971

d. At the 95% confidence level, can we reject the null hypothesis that there is no difference in the wage levels of mothers and not mothers in the population?

Yes, the t-statistic is much greater than 1.96, implying that we can reject the null hypothesis of no difference. The intuition here is that it is extremely unlikely that we would observe a difference in means this large in our sample if it were true that there were no difference between mothers and non-mothers in the population.

e. Use the t.test() function to conduct the same hypothesis test that you just conducted manually. What is the p-value? Does the 95% confidence interval include the value of 0?

t.test(x = motherhood$wage[motherhood$isMother==1],
y = motherhood$wage[motherhood$isMother==0],
conf.level = 0.95)
##
##  Welch Two Sample t-test
##
## data:  motherhood$wage[motherhood$isMother == 1] and motherhood$wage[motherhood$isMother == 0]
## t = 12.38, df = 13709, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  1.049823 1.444810
## sample estimates:
## mean of x mean of y
##  11.45436  10.20704

The p-value here is very small (2.2e-16 = 0.00000000000000022), which is consistent with the large t-statistic we calculated above. The confidence interval also does not, of course, include zero. Confidence intervals and hypothesis tests will always produce the same result for a given confidence level.

Question 3

a. Run a regression with wage as the outcome variable and numChildren as the explanatory variable. What is the estimated coefficient on the variable numChildren? Provide a brief substantive interpretation of the coefficient.

simple_ols_model <- lm(wage ~ numChildren, data = motherhood)

The coefficient on the variable numChildren implies that each additional child that a woman has is associated with an increase of 43 cents in a woman’s hourly wage.

b. What is the standard error of the coefficient for numChildren?

We can find the values of the standard error associated with each regression coefficient by using the summary() function:

summary(simple_ols_model)
##
## Call:
## lm(formula = wage ~ numChildren, data = motherhood)
##
## Residuals:
##     Min      1Q  Median      3Q     Max
## -11.531  -4.138  -1.962   2.112  49.612
##
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.38796    0.05755 180.509   <2e-16 ***
## numChildren  0.43424    0.05052   8.596   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.583 on 18197 degrees of freedom
##   (15 observations deleted due to missingness)
## Multiple R-squared:  0.004044,   Adjusted R-squared:  0.003989
## F-statistic: 73.89 on 1 and 18197 DF,  p-value: < 2.2e-16

The standard error for the numChildren coefficient is 0.051.

c. Using the estimated coefficient and standard error for the numChildren variable, conduct a hypothesis test where the null hypothesis is that this coefficient is equal to zero in the population. Can you reject the null hypothesis at the 95% confidence level? Can you reject the null hypothesis at the 99% confidence level?

The formula for the test statistic for testing a null hypothesis that a regression coefficient is equal to zero is:

$t= \frac{\hat{\beta}-\beta_{H_0}}{\hat{\sigma}_{\hat{\beta}}} = \frac{\hat{\beta}}{\hat{\sigma}_{\hat{\beta}}}$

So, to calculate $$t$$, we simply divide the estimated coefficient by the standard error:

t_stat <- 0.43424/0.05052

The test statistic for the numChildren variable is 8.595, which is far larger than the critical values for either the 95% (1.96) or 99% (2.58) confidence levels. Accordingly, we can easily reject the null hypothesis that the association between the number of children and a mother’s wage in the population is equal to zero.

d. What is the meaning of rejecting the null hypothesis in this exercise? Does this provide evidence of a causal relationship between the number of children and the wage level of mothers?

Whether or not we reject the null hypothesis of no effect is a different question to whether the coefficient represents a causal effect. Here, rejecting the null hypothesis means that we are confident that the relationship between the number of children and the wage of the mother that we observe in our sample of data is very unlikely to have arisen by chance if the association between those two quantities is zero in the population.

However, we should not forget that the association we observe in our sample, however precisely estimated it may be, is still subject to confounding by omitted variables. There are many ways in which women who have more children differ from women with fewer children. For instance, women with more children may also be older on average, or they may have more experience, or different living situations. Each of these characteristics may also be associated with higher wage levels, and therefore even though we can reject the null hypothesis, we cannot conclude that our regression estimate gives us an unbiased estimate of the causal effect of children on their mothers’ wages.

Question 4

a. Create a box plot which depicts the distribution of wage for every year in the data. What do you observe?

boxplot(wage ~ year,
data = motherhood,
xlab = "Year",
ylab = "Wage")

There is a clear association between wage and year – women are on average paid more in more recent years in the sample than in earlier years.

b. Create a box plot which depicts the distribution of numChildren for every year in the data. What do you observe?

boxplot(numChildren ~ year,
data = motherhood,
xlab = "Year",
ylab = "Number of children")

There is a clear association between the number of children a woman has and the sample year – women on average have more children recent years in the sample than in earlier years.

Question 5

The analysis above reveals that there is significant over time variation in women’s average wages in our sample, and that there is also a strong relationship between time and the number of children a woman has. It is therefore probable that “time” is an important omitted variable in this analysis, and something that we might want to control for.

In addition, we saw last week that when we are working with panel data, a powerful strategy for overcoming omitted variable bias is to use a fixed-effect model, where we include a different intercept term for each of the units in our data. In this example, we have a panel where each women represents a unit, and we have repeated observations of the same women over time. There may be many factors that vary across women, but that are stable within women over time, that are related to both wage level and the number of children a woman has, and so a fixed-effect model may again be helpful for ruling out omitted variable bias here.

Given this discussion, it seems natural that we might want to include two sets of fixed-effects here: one set for units (women), and the other for time (year). This reflects a general form of model for working with panel data called the two-way fixed effects model, in which there is a fixed effect for each unit and a fixed effect for each time period.

Run a two-way fixed-effect regression where the outcome is the wage and the predictor is the number of children that a woman has. Include fixed effects for each woman and each year. To do this, include the relevant variables within the factor() function as a part of the model formula, as below:

two_way_fe_model <- lm(wage ~ numChildren + factor(PUBID) + factor(year), data = motherhood)

Note that this regression may take a minute or two to run!

Why do we use factor() here? Because both PUBID and year are stored as numeric variables in the motherhood data, R will treat these as regular explanatory variables by default. However, we want R to estimate a separate intercept term for each unique value of these variables, and that is what factor() tells R to do.

Create a table of your fixed-effect model using screenreg() from the texreg package. To avoid printing out the coefficients for all of the fixed effects, set omit.coef = "year|PUBID" or use plm() from library(plm) to fit the fixed effects model. Interpret the coefficient associated with numChildren in both statistical and substantive terms.

library(texreg)

screenreg(list(two_way_fe_model),
omit.coef = "year|PUBID")
##
## =========================
##              Model 1
## -------------------------
## (Intercept)     -0.00
##                 (1.68)
## numChildren     -1.04 ***
##                 (0.06)
## -------------------------
## R^2              0.60
## Num. obs.    18199
## =========================
## *** p < 0.001; ** p < 0.01; * p < 0.05

The coefficient on the variable numChildren implies that each additional child that a woman has is associated with a decrease in wages of 1.041 dollars. The standard error for the numChildren coefficient is 0.065, which implies a test-statistic value of -16.019, and therefore that we can reject the null hypothesis of no effect at all conventional confidence levels.

It is important to note that in this model, where we control for baseline differences between women using the unit fixed-effects and differences in wages over time using the time fixed-effects, the numChildren coefficient is now negative. That is, once we account for the various forms of omitted variable bias using the fixed-effect model, we find that there is a negative and significant effect of children on women’s wages. This is the opposite conclusion that we would have drawn from the naive analysis in question 2.

Question 6

Estimate a new regression model, which still includes fixed effects for woman and year, but which also includes the following variables:

• Location (region, urban)
• Marital Status (marstat)
• Human Capital (educ, school, experience, tenure, tenure2)
• Job Characteristics (fullTime, firmSize, multipleLocations, unionized)

Report the coefficient and standard error associated with the numChildren variable in this model. Is the coefficient still statistically significant? Provide a brief substantive interpretation of this coefficient and the coefficients for any two other variables.

two_way_fe_model_2 <- lm(wage ~ numChildren + factor(region) + urban + marstat + educ + school +
experience + tenure + tenure2 + fullTime + firmSize + multipleLocations +
unionized  + factor(year) + factor(PUBID), data = motherhood)
library(texreg)

screenreg(list(two_way_fe_model, two_way_fe_model_2),
omit.coef = "year|PUBID")
##
## ====================================================
##                           Model 1       Model 2
## ----------------------------------------------------
## (Intercept)                  -0.00          3.46
##                              (1.68)        (2.05)
## numChildren                  -1.04 ***     -0.30 **
##                              (0.06)        (0.09)
## factor(region)2                            -2.22 ***
##                                            (0.46)
## factor(region)3                            -1.44 ***
##                                            (0.37)
## factor(region)4                            -0.07
##                                            (0.44)
## urban                                       0.20
##                                            (0.15)
## marstatMarried                              0.75 ***
##                                            (0.16)
## marstatNo romantic union                   -0.26
##                                            (0.14)
## educ2.High school                          -0.89 ***
##                                            (0.21)
## educ3.Some college                          0.33
##                                            (0.35)
## educ4.College                               3.24 ***
##                                            (0.31)
## schoolTRUE                                 -0.88 ***
##                                            (0.13)
## experience                                  0.33 ***
##                                            (0.04)
## tenure                                      0.31 ***
##                                            (0.06)
## tenure2                                    -0.02 ***
##                                            (0.01)
## fullTimeTRUE                                1.00 ***
##                                            (0.11)
## firmSize2. 30-299                          -0.06
##                                            (0.11)
## firmSize3. 300+                             1.32 ***
##                                            (0.15)
## multipleLocations                           0.37 ***
##                                            (0.11)
## unionized                                   1.24 ***
##                                            (0.18)
## ----------------------------------------------------
## R^2                           0.60          0.71
## *** p < 0.001; ** p < 0.01; * p < 0.05