9 Hypothesis Tests and Uncertainty in Regression

9.1 Overview

In the lecture this week we continued our discussion of statistical inference, and particularly focussed on hypothesis tests and uncertainty in regression estimates. We learned about the different steps of conducting a hypothesis test, and about how to interpret both t-statistics and p-values. We saw the close connection between hypothesis tests and confidence intervals, and drew attention to the fact that observing a “statistically significant” result may not tell us anything about the substantive significance of that result. We also discussed uncertainty in regression models, and saw that our estimated regression coefficients are a quantity of interest that will vary from sample to sample, just as with the difference in means. Accordingly, we saw that we can also construct and interpret standard errors, t-statistics, p-values, and confidence intervals for our regression estimates.

In seminar this week, we will:

  1. Practice conducting hypothesis tests for the difference in means.
  2. Practice conducting hypothesis tests for regression coefficients.
  3. Constructing confidence intervals for regression coefficients.
  4. Revisit fixed-effect models for panel data.

Before coming to the seminar

  1. Please read chapter 6, “Probability” and chapter 7, “Uncertainty” in Quantitative Social Science: An Introduction

9.2 Seminar

In this seminar, we will examine survey data in order to investigate the size of the wage penalty that mothers face in the USA, which you can download from the link above.

The data file is motherhood_revisited.csv, which is a CSV file. Store this file in your data folder as you have done in previous weeks. Then load the data into R:

motherhood <- read.csv("data/motherhood_revisited.csv")

The names and descriptions of variables are:

Name Description
PUBID ID of woman
year Year of observation
wage Hourly wage, in dollars
numChildren Number of children that the woman has (in this wave)
age Age in years
region Name of region (North East = 1, North Central = 2, South = 3, West = 4)
urban Geographical classification (urban = 1, otherwise = 0)
marstat Marital status
educ Level of education
school School enrollment (enrolled = TRUE, otherwise = FALSE)
experience Experience since 14 years old, in days
tenure Current job tenure, in years
tenure2 Current job tenure in years, squared
fullTime Employment status (employed full-time = TRUE, otherwise = FALSE)
firmSize Size of the firm
multipleLocations Multiple locations indicator (firm with multiple locations = 1, otherwise = 0)
unionized Job unionization status (job is unionized = 1, otherwise = 1)
industry Job’s industry type
hazardous Hazard measure for the job (between 1 and 2)
regularity Regularity measure for the job (between 1 and 5)
competitiveness Competitiveness measure for the job (between 1 and 5)
autonomy Autonomy measure for the job (between 1 and 5)
teamwork Teamwork requirements measure for the job (between 1 and 5)

Question 1

What years are included in the data? How many women are included, and how many person-years are included?

Reveal answer

# Number of years
length(unique(motherhood$year))

# Number of women
length(unique(motherhood$PUBID))

# Number of observations
nrow(motherhood)
## [1] 16
## [1] 1569
## [1] 18214

There are 16 unique years in this dataset. There are 1569 women in the data and 18214 person-year observations.

Question 2

Create a new variable – isMother – that takes a value of 1 if the woman has at least one child and a value of 0 otherwise.

motherhood$isMother <- ifelse(motherhood$numChildren > 0, 1, 0) 

# or

motherhood$isMother <- as.numeric(motherhood$numChildren > 0) 

a. Calculate the difference in mean wages between women with children and women without children.

Reveal answer

wage_mothers <- mean(motherhood$wage[motherhood$isMother == 1], na.rm = TRUE)
wage_not_mothers <- mean(motherhood$wage[motherhood$isMother == 0], na.rm = TRUE)
mother_not_mother_diff <- wage_mothers - wage_not_mothers
mother_not_mother_diff
## [1] 1.247316

In this sample, mothers earn on average 1.25 dollars more per hour than non-mothers.

b. Calculate the standard error for the difference in means.

Reveal answer

The formula for the standard error of the difference in means is \(SE(\hat{Y}_{X=1} - \hat{Y}_{X=0}) = \sqrt{\frac{Var(Y_{X=1})}{n_{X=1}} + \frac{Var(Y_{X=0})}{n_{X=0}}}\)

## Standard error
treat_var <- var(motherhood$wage[motherhood$isMother == 1], na.rm = TRUE)
control_var <- var(motherhood$wage[motherhood$isMother == 0], na.rm = TRUE)

treat_n <- sum(motherhood$isMother == 1, na.rm = TRUE)
control_n <- sum(motherhood$isMother == 0, na.rm = TRUE)

st_err <- sqrt(treat_var/treat_n + control_var/control_n)
st_err
## [1] 0.1007549

c. Calculate the t-statistic for the difference in means.

Reveal answer

# T-statistic
t_stat <- mother_not_mother_diff/st_err
t_stat
## [1] 12.37971

d. At the 95% confidence level, can we reject the null hypothesis that there is no difference in the wage levels of mothers and not mothers in the population?

Reveal answer

Yes, the t-statistic is much greater than 1.96, implying that we can reject the null hypothesis of no difference. The intuition here is that it is extremely unlikely that we would observe a difference in means this large in our sample if it were true that there were no difference between mothers and non-mothers in the population.

e. Use the t.test() function to conduct the same hypothesis test that you just conducted manually. What is the p-value? Does the 95% confidence interval include the value of 0?

Reveal answer

t.test(x = motherhood$wage[motherhood$isMother==1],
       y = motherhood$wage[motherhood$isMother==0],
       conf.level = 0.95)
## 
##  Welch Two Sample t-test
## 
## data:  motherhood$wage[motherhood$isMother == 1] and motherhood$wage[motherhood$isMother == 0]
## t = 12.38, df = 13709, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  1.049823 1.444810
## sample estimates:
## mean of x mean of y 
##  11.45436  10.20704

The p-value here is very small (2.2e-16 = 0.00000000000000022), which is consistent with the large t-statistic we calculated above. The confidence interval also does not, of course, include zero. Confidence intervals and hypothesis tests will always produce the same result for a given confidence level.

Question 3

a. Run a regression with wage as the outcome variable and numChildren as the explanatory variable. What is the estimated coefficient on the variable numChildren? Provide a brief substantive interpretation of the coefficient.

Reveal answer

simple_ols_model <- lm(wage ~ numChildren, data = motherhood)

The coefficient on the variable numChildren implies that each additional child that a woman has is associated with an increase of 43 cents in a woman’s hourly wage.

b. What is the standard error of the coefficient for numChildren?

Reveal answer

We can find the values of the standard error associated with each regression coefficient by using the summary() function:

summary(simple_ols_model)
## 
## Call:
## lm(formula = wage ~ numChildren, data = motherhood)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -11.531  -4.138  -1.962   2.112  49.612 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 10.38796    0.05755 180.509   <2e-16 ***
## numChildren  0.43424    0.05052   8.596   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.583 on 18197 degrees of freedom
##   (15 observations deleted due to missingness)
## Multiple R-squared:  0.004044,   Adjusted R-squared:  0.003989 
## F-statistic: 73.89 on 1 and 18197 DF,  p-value: < 2.2e-16

The standard error for the numChildren coefficient is 0.051.

c. Using the estimated coefficient and standard error for the numChildren variable, conduct a hypothesis test where the null hypothesis is that this coefficient is equal to zero in the population. Can you reject the null hypothesis at the 95% confidence level? Can you reject the null hypothesis at the 99% confidence level?

Reveal answer

The formula for the test statistic for testing a null hypothesis that a regression coefficient is equal to zero is:

\[ t= \frac{\hat{\beta}-\beta_{H_0}}{\hat{\sigma}_{\hat{\beta}}} = \frac{\hat{\beta}}{\hat{\sigma}_{\hat{\beta}}}\]

So, to calculate \(t\), we simply divide the estimated coefficient by the standard error:

t_stat <- 0.43424/0.05052

The test statistic for the numChildren variable is 8.595, which is far larger than the critical values for either the 95% (1.96) or 99% (2.58) confidence levels. Accordingly, we can easily reject the null hypothesis that the association between the number of children and a mother’s wage in the population is equal to zero.

d. What is the meaning of rejecting the null hypothesis in this exercise? Does this provide evidence of a causal relationship between the number of children and the wage level of mothers?

Reveal answer

Whether or not we reject the null hypothesis of no effect is a different question to whether the coefficient represents a causal effect. Here, rejecting the null hypothesis means that we are confident that the relationship between the number of children and the wage of the mother that we observe in our sample of data is very unlikely to have arisen by chance if the association between those two quantities is zero in the population.

However, we should not forget that the association we observe in our sample, however precisely estimated it may be, is still subject to confounding by omitted variables. There are many ways in which women who have more children differ from women with fewer children. For instance, women with more children may also be older on average, or they may have more experience, or different living situations. Each of these characteristics may also be associated with higher wage levels, and therefore even though we can reject the null hypothesis, we cannot conclude that our regression estimate gives us an unbiased estimate of the causal effect of children on their mothers’ wages.

Question 4

a. Create a box plot which depicts the distribution of wage for every year in the data. What do you observe?

Reveal answer

boxplot(wage ~ year,
        data = motherhood,
        xlab = "Year",
        ylab = "Wage")

There is a clear association between wage and year – women are on average paid more in more recent years in the sample than in earlier years.

b. Create a box plot which depicts the distribution of numChildren for every year in the data. What do you observe?

Reveal answer

boxplot(numChildren ~ year,
        data = motherhood,
        xlab = "Year",
        ylab = "Number of children")

There is a clear association between the number of children a woman has and the sample year – women on average have more children recent years in the sample than in earlier years.

Question 5

The analysis above reveals that there is significant over time variation in women’s average wages in our sample, and that there is also a strong relationship between time and the number of children a woman has. It is therefore probable that “time” is an important omitted variable in this analysis, and something that we might want to control for.

In addition, we saw last week that when we are working with panel data, a powerful strategy for overcoming omitted variable bias is to use a fixed-effect model, where we include a different intercept term for each of the units in our data. In this example, we have a panel where each women represents a unit, and we have repeated observations of the same women over time. There may be many factors that vary across women, but that are stable within women over time, that are related to both wage level and the number of children a woman has, and so a fixed-effect model may again be helpful for ruling out omitted variable bias here.

Given this discussion, it seems natural that we might want to include two sets of fixed-effects here: one set for units (women), and the other for time (year). This reflects a general form of model for working with panel data called the two-way fixed effects model, in which there is a fixed effect for each unit and a fixed effect for each time period.

Run a two-way fixed-effect regression where the outcome is the wage and the predictor is the number of children that a woman has. Include fixed effects for each woman and each year. To do this, include the relevant variables within the factor() function as a part of the model formula, as below:

two_way_fe_model <- lm(wage ~ numChildren + factor(PUBID) + factor(year), data = motherhood)

Note that this regression may take a minute or two to run!

Why do we use factor() here? Because both PUBID and year are stored as numeric variables in the motherhood data, R will treat these as regular explanatory variables by default. However, we want R to estimate a separate intercept term for each unique value of these variables, and that is what factor() tells R to do.

Create a table of your fixed-effect model using screenreg() from the texreg package. To avoid printing out the coefficients for all of the fixed effects, set omit.coef = "year|PUBID" or use plm() from library(plm) to fit the fixed effects model. Interpret the coefficient associated with numChildren in both statistical and substantive terms.

Reveal answer

library(texreg)

screenreg(list(two_way_fe_model),
          omit.coef = "year|PUBID")
## 
## =========================
##              Model 1     
## -------------------------
## (Intercept)     -0.00    
##                 (1.68)   
## numChildren     -1.04 ***
##                 (0.06)   
## -------------------------
## R^2              0.60    
## Adj. R^2         0.56    
## Num. obs.    18199       
## =========================
## *** p < 0.001; ** p < 0.01; * p < 0.05

The coefficient on the variable numChildren implies that each additional child that a woman has is associated with a decrease in wages of 1.041 dollars. The standard error for the numChildren coefficient is 0.065, which implies a test-statistic value of -16.019, and therefore that we can reject the null hypothesis of no effect at all conventional confidence levels.

It is important to note that in this model, where we control for baseline differences between women using the unit fixed-effects and differences in wages over time using the time fixed-effects, the numChildren coefficient is now negative. That is, once we account for the various forms of omitted variable bias using the fixed-effect model, we find that there is a negative and significant effect of children on women’s wages. This is the opposite conclusion that we would have drawn from the naive analysis in question 2.

Question 6

Estimate a new regression model, which still includes fixed effects for woman and year, but which also includes the following variables:

  • Location (region, urban)
  • Marital Status (marstat)
  • Human Capital (educ, school, experience, tenure, tenure2)
  • Job Characteristics (fullTime, firmSize, multipleLocations, unionized)

Report the coefficient and standard error associated with the numChildren variable in this model. Is the coefficient still statistically significant? Provide a brief substantive interpretation of this coefficient and the coefficients for any two other variables.

Reveal answer

two_way_fe_model_2 <- lm(wage ~ numChildren + factor(region) + urban + marstat + educ + school + 
               experience + tenure + tenure2 + fullTime + firmSize + multipleLocations + 
               unionized  + factor(year) + factor(PUBID), data = motherhood)
library(texreg)

screenreg(list(two_way_fe_model, two_way_fe_model_2),
          omit.coef = "year|PUBID")
## 
## ====================================================
##                           Model 1       Model 2     
## ----------------------------------------------------
## (Intercept)                  -0.00          3.46    
##                              (1.68)        (2.05)   
## numChildren                  -1.04 ***     -0.30 ** 
##                              (0.06)        (0.09)   
## factor(region)2                            -2.22 ***
##                                            (0.46)   
## factor(region)3                            -1.44 ***
##                                            (0.37)   
## factor(region)4                            -0.07    
##                                            (0.44)   
## urban                                       0.20    
##                                            (0.15)   
## marstatMarried                              0.75 ***
##                                            (0.16)   
## marstatNo romantic union                   -0.26    
##                                            (0.14)   
## educ2.High school                          -0.89 ***
##                                            (0.21)   
## educ3.Some college                          0.33    
##                                            (0.35)   
## educ4.College                               3.24 ***
##                                            (0.31)   
## schoolTRUE                                 -0.88 ***
##                                            (0.13)   
## experience                                  0.33 ***
##                                            (0.04)   
## tenure                                      0.31 ***
##                                            (0.06)   
## tenure2                                    -0.02 ***
##                                            (0.01)   
## fullTimeTRUE                                1.00 ***
##                                            (0.11)   
## firmSize2. 30-299                          -0.06    
##                                            (0.11)   
## firmSize3. 300+                             1.32 ***
##                                            (0.15)   
## multipleLocations                           0.37 ***
##                                            (0.11)   
## unionized                                   1.24 ***
##                                            (0.18)   
## ----------------------------------------------------
## R^2                           0.60          0.71    
## Adj. R^2                      0.56          0.66    
## Num. obs.                 18199         10688       
## ====================================================
## *** p < 0.001; ** p < 0.01; * p < 0.05

The coefficient for ‘numChildren’ is -0.3 and the estimated standard error is 0.09. We can tell that this is statistically significant at the 95% confidence level by noting that the standard error is well less than half the coefficient magnitude, that the t-stat is well above 1.96, or that the p-value (0.001) is well below the standard 0.05 threshold (these three things are equivalent). The coefficient suggests that each additional child that a woman has (keeping constant all other characteristics included in the model) is associated with a decrease of -30 cents in her hourly wage.

This implies that even when accounting for these additional control variables, in addition to the time and unit fixed-effects, the effect of additional children on women’s wages appears to be negative.

The following is an example interpretation of marital status, a categorical variable. The baseline category is “Cohabiting”. The coefficient for “Married” is 0.75 and significant, meaning that we expect married women to earn 75 cents more per hour than than otherwise comparable cohabiting women. Women “Not in a romantic union”, by contrast, on average earn 26 cents less per hour than comparable cohabiting women in our sample. However, we can see from the small t-statistic (-1.9) or relatively large p-value (0.057) that the coefficient is not significantly different from zero at the 95% confidence level. That is, the uncertainty around this estimate is too large for us to reject the null hypothesis that the true difference between cohabiting and no-union women is actually zero in the population.

9.3 Homework

There is no homework this week.