Yesterday’s post showed how to save the output from non-estimate results in Stata to check the balance for an experiment. Today, we’ll talk about how we can show this for slightly more complicated situations.

### Situation 1 – You Have Categorical Variables

This is almost the same as the quantitative data we were handling yesterday, but you can’t use a t-test or regression because both of these techniques are developed for continuous quantitative variables. Instead, let’s use Pearson’s Chi-Squared test for independence. The null hypothesis of this test is that our categorical variable is independent from the treatment; that is, knowing the treatment doesn’t give us additional information about the likelihood that the categorical variable takes a value. See this nice little site for more information, if you’re curious.

To calculate the chi-squared stat in Stata, we just use ` tab treat categorical_var, chi2 `

.

Then, if we ask for the results, ` ret li`

, we get the number of observations, the number of rows and columns, our chi-squared stat, and our p-value. If we choose to keep the chi-squared stat, we *need *to keep the number of rows and columns so that our reader can interpret it. Personally, I’d just save the p-value, since that’s what people will look up with the chi-squared statistic anyway.

To save our p-value, we do the same process as before:

`gen double p_value = .`

replace p_value = r(p) in 1

### Situation 2 – You Have Three or More Treatment Arms

This is the situation I found myself in, as I was taking this employment test. I was paralyzed – I knew that a t-test was standard for comparing treatment and control, but I wasn’t sure what’s considered “standard” when you have more than two groups to compare.

However, regardless of what’s “standard,” statistics has a firm answer to this: you’re going to need an ANOVA, short for analysis of variance. This basically compares the variance that exists within group to that which exists between them – if there’s far more variance between the groups than within them, that suggests to us that maybe these groups are statistically different.

To run an ANOVA in Stata, you’d use `anova variable_to_check treatment_indicator`

. But if we use `ret li`

here to try and save our information, we’ll notice that we haven’t got any of the results that we care about!

Since ANOVA is classified under “Linear models and related” in Stata, we’ll instead need to look at `eret li`

. Then, we’ll see that we have almost all of our results, with the noteable (and frankly, perplexing) absence of our p-value. However, we can generate this by setting the cell we want to put it in equal to `Ftail(e(df_m), e(df_r), e(F))`

.

Note: if you’ve got categorical variables *and* three or more treatment arms, you’d still use a chi-squared test. That test can handle multiple treatments.

### One Wrap-Up Note

I briefly considered leaving this for another blog post, since I’m reaching my word limit here, but I’ll be real: I have a strong desire not to write about balance tables for a third day in a row. So here’s a bit of a warning: plenty of people have noted that it doesn’t actually make sense to do a balance check for an experiment.

The whole *point *of statistics like the t-stat, ANOVA’s F-stat, and so on is that they’re meant to give us information about the population as a whole from the sample that we’re looking at. However, when we’re trying to deduce whether the treated group is different from the control, *there is no larger population to make inferences about*. This is it. So, if the proportion of men in control is not equal to proportion of men in treatment, we know that our randomization is not balanced on gender!

Then, the question should be: what is a large enough difference along one covariate to be evidence that this *isn’t* random? I can’t remember off the top of my head whether I’ve ever learned this, but my intuition is that you could figure this out by bootstrapping.

Here’s what I’m imagining: you’d through all of your observations into a bucket and you’d draw from that bucket with replacement to make one group that is marked as “control” and another group that is marked as “treatment.” You could repeat this 10,000 times and then compare your data to this generated set of distributions – if your data looks pretty weird compared to the generated data, then that’s indicative that something went wrong. But I’m just spit-balling. We’d have to prove it.

**Second point:** yes, balance checks don’t make a whole lot of sense for experiments. However, they certainly do make sense for *natural* experiments. If you have something distributed as-if-randomly, it’s a good idea to check it.

**Third point:** I’m not sure whether checking that the means are not significantly different is the best way to do this. Why don’t we check that the distributions aren’t significantly different? We do have statistics for that…