Jobs I Can’t Have: Data Journalist

I have a confession to make. I actually really hate finagling with visualizations on various data analysis platforms.

I reached this conclusion about 10 minutes ago, and I think that it’s already changing my life. Up until 10 minutes ago, if you’d asked me whether I’d like to spend time learning how to more effectively visualize data, I would’ve said hell yes.

And who could blame me for not knowing myself! Just look at statisticians of any public acclaim whatsoever. Nate Silver’s FiveThirtyEight is chockablock with effective visualizations. The data folks at the New York Times create works of such staggering beauty and simplicity that you can just look at the graph without reading the article and know what they’re saying. And then there’s whatever the hell this sorcery is:

Crazy Scatterplot
Here’s the original link, if you’d like to spend the next five hours trying to assemble this bad boy.

Since I’m trained as a statistician, there’s always going to be a part of me that whispers, “You should be able to make this graph too.”

But I think at a certain point, maybe part of growing up is admitting that there are some things you would like to be able to do that you are just never, ever going to be able to do. I’m a firm believer in the idea that just about anyone can learn just about anything, but I think we also have to recognize how much it’ll cost you to get to an expert level.

Case in point: I just spent about an hour making this dumb scatterplot1:

ugly scatterplot
Honestly? What even is this?

It probably would take another two hours to get it to a point where it makes any sense. And in the meantime, though fiddling with graphics does make time pass quickly, it’s a little alarming to look up and see that an entire evening has passed you by and all you have to show for it is some crappy scatterplot.

This is a life-changing revelation because, in my mind, I was holding the door of data visualization expert open. If that door is firmly shut, I’m probably not going to become a data journalist. I could still work with data, but I’m always going need help from someone with better artistic sense and ggplot2 skills than I have.

This is a relaxing kind of realization. I can stop castigating myself for being such shit at graphics.

1. In my defense though, I did create this in Python, which I am only just learning. But it would still need a lot of work to be intelligible. Back

Creating a Balance Table in Stata (Part 2)

Yesterday’s post showed how to save the output from non-estimate results in Stata to check the balance for an experiment. Today, we’ll talk about how we can show this for slightly more complicated situations.

Situation 1 – You Have Categorical Variables

This is almost the same as the quantitative data we were handling yesterday, but you can’t use a t-test or regression because both of these techniques are developed for continuous quantitative variables. Instead, let’s use Pearson’s Chi-Squared test for independence. The null hypothesis of this test is that our categorical variable is independent from the treatment; that is, knowing the treatment doesn’t give us additional information about the likelihood that the categorical variable takes a value. See this nice little site for more information, if you’re curious.

To calculate the chi-squared stat in Stata, we just use tab treat categorical_var, chi2 .

Then, if we ask for the results, ret li, we get the number of observations, the number of rows and columns, our chi-squared stat, and our p-value. If we choose to keep the chi-squared stat, we need to keep the number of rows and columns so that our reader can interpret it. Personally, I’d just save the p-value, since that’s what people will look up with the chi-squared statistic anyway.

To save our p-value, we do the same process as before:

gen double p_value = .
replace p_value = r(p) in 1

Situation 2 – You Have Three or More Treatment Arms

This is the situation I found myself in, as I was taking this employment test. I was paralyzed – I knew that a t-test was standard for comparing treatment and control, but I wasn’t sure what’s considered “standard” when you have more than two groups to compare.

However, regardless of what’s “standard,” statistics has a firm answer to this: you’re going to need an ANOVA, short for analysis of variance. This basically compares the variance that exists within group to that which exists between them – if there’s far more variance between the groups than within them, that suggests to us that maybe these groups are statistically different.

To run an ANOVA in Stata, you’d use anova variable_to_check treatment_indicator. But if we use ret li here to try and save our information, we’ll notice that we haven’t got any of the results that we care about!

Since ANOVA is classified under “Linear models and related” in Stata, we’ll instead need to look at eret li. Then, we’ll see that we have almost all of our results, with the noteable (and frankly, perplexing) absence of our p-value. However, we can generate this by setting the cell we want to put it in equal to Ftail(e(df_m), e(df_r), e(F)).

Note: if you’ve got categorical variables and three or more treatment arms, you’d still use a chi-squared test. That test can handle multiple treatments.

One Wrap-Up Note

I briefly considered leaving this for another blog post, since I’m reaching my word limit here, but I’ll be real: I have a strong desire not to write about balance tables for a third day in a row. So here’s a bit of a warning: plenty of people have noted that it doesn’t actually make sense to do a balance check for an experiment.

The whole point of statistics like the t-stat, ANOVA’s F-stat, and so on is that they’re meant to give us information about the population as a whole from the sample that we’re looking at. However, when we’re trying to deduce whether the treated group is different from the control, there is no larger population to make inferences about. This is it. So, if the proportion of men in control is not equal to proportion of men in treatment, we know that our randomization is not balanced on gender!

Then, the question should be: what is a large enough difference along one covariate to be evidence that this isn’t random? I can’t remember off the top of my head whether I’ve ever learned this, but my intuition is that you could figure this out by bootstrapping.

Here’s what I’m imagining: you’d through all of your observations into a bucket and you’d draw from that bucket with replacement to make one group that is marked as “control” and another group that is marked as “treatment.” You could repeat this 10,000 times and then compare your data to this generated set of distributions – if your data looks pretty weird compared to the generated data, then that’s indicative that something went wrong. But I’m just spit-balling. We’d have to prove it.

Second point: yes, balance checks don’t make a whole lot of sense for experiments. However, they certainly do make sense for natural experiments. If you have something distributed as-if-randomly, it’s a good idea to check it.

Third point: I’m not sure whether checking that the means are not significantly different is the best way to do this. Why don’t we check that the distributions aren’t significantly different? We do have statistics for that…

Creating a Balance Table in Stata (Part 1)

I recently had the extremely uncomfortable experience of taking a timed test for employment that wanted me to create a balance table for an experiment with three treatments and having no idea how to do it.

This was uncomfortable because I know, in theory, how a balance table ought to work. I’m trained as a statistician. It’s remarkably embarrassing to stumble on your supposed “core competency.” I wound up turning in some half-assed tables glued together in Excel. Needless to say, I didn’t get a call back.

So, this post is for those of you who find yourself in the unenviable position of having a little too much book knowledge and a little too little practical know-how when it comes to statistical analysis. (Let’s be real – this post is also for me to prove to myself that I do know some of the things!)

To start off: the basic idea of a balance table is that we want to assess whether our randomization worked. We’re interested in assigning people to two or more treatments, and a balance table is a nice check that we haven’t assigned all of the men to treatment 1 and all of the women to treatment 2, or some nonsense like that.

Then, for an experiment that just has treatment and control, we usually just conduct a t-test on a variety of participant traits that we’ve gathered data for.

In the balance table shown below, for the “Incentives Work” paper by Duflo, Hanna, and Ryan, the authors take an equivalent tack – they regress each characteristic on a variable that is 0 for control units and 1 for treated. Then, the coefficient that they get from that regression is just the difference between treatment and control on this trait, and the standard error of that coefficient tells them whether there’s a significant difference between the two groups.

Duflo balance table
It sure would be nice if I knew how to make one of these…

Unfortunately, understanding this doesn’t get us very close to being able to construct a table that’s suitable for publication. And indeed, the replication files for this particular table don’t shed any light on how it’s constructed. My guess is that the log file that the replication do-file spits out is then reformatted by hand for LaTeX?

If we’re handling a regression type output in Stata, we could use estout to package it up nicely, but what if we wanted to more explicitly compare means? Estout doesn’t seem to see t-test outputs, since they’re not considered “estimates.”

Instead, after a t-test, we can type “ret li” (short for “return list”) and that’ll spit out the numbers that we care about. Specifically, we want to preserve r(mu_1) and r(mu_2), our estimates of the means; r(N_1) and r(N_2), the number of observations in each treatment group; and most importantly, either r(t) and r(df_t), the t-stat and its degrees of freedom, or r(p), the probability that our two means are different from each other.

To save these, we’re going to generate results variables.

Code for balance table
Still to come: I learn how to use WordPress so that I can type code straight in instead of print-screening it like some kind of goon.

This creates the following:

mean_1 n_1 mean_2 n_2 t_stat df_t p_value varname varlabel
.641 39 .658 41 -.162 78 .871 open Proportion of Schools Open

This is honestly already so much better than what I ended up with on this job exam that I feel a little silly not learning it before.

Clearly here, you want to check the balance of more than one variable, so you’d loop over variables and add a local variable to keep tally of which row to put things in. See this article in the Stata Journal for an example of that – I basically pulled the code from that.