Prediction Models for the Democratic 2016 Primary

Prediction Models for the Democratic 2016 Primary

There are a lot of people creating models for the 2016 primary. I decided to focus on the Democratic primary as it should be easier to predict the outcome in what is primarily a two person race. While I do prefer Sanders over the other candidates, I am currently not planning on voting for him as I prefer to vote for left-wing candidates (Green Party and occasionally Socialists) and prefer the long term plan to build a third party over working within the Democratic party.


Election modeling is hard work and it is easy to make massive mistakes. So take all of my models with a large grain of salt. I have a background in Sociology and web development. I have not reviewed the political science literature on election prediction – rather I've been following election cycles (US, Canada, UK, and more) for over twenty years.


It is more likely that experts (like 538), polling averages (Pollster), and prediction markets (PredictIt) will be more accurate than my attempts.


My first attempt at predicting the primaries was March 15. The poll average ended up being better than my predictions 5 out of 5 times. There is only a 1 in 32 chance that this was random.


County Model

While most people care about the outcome at the state level, I have chosen a county level approach because it allows you to make county predictions and thus make predictions for the state-wide result based on early county results. And it also increases the sample size allowing you to include more variables in your model.


I haven't seen anyone else's county model, so I made my own.



Election Data

You can get election data from the CNN JSON.


For Primaries

http://data.cnn.com/ELECTION/2016primary/MI/county/D.json


For Caucuses

http://data.cnn.com/ELECTION/2016primary/IA/county/E.json


This county level data includes the following states with primaries: AR, GA, FL, IL, LA, MI, MS, NC, OH, OK, SC, TN, TX, and VA. It excludes VT and MA as I was unable to find county level election results for them (email me if you have this data!).

I included the following states with caucuses: IA, NV, CO, and NE.
I wasn't able to find county level data for caucuses in KS and ME.

I manually added the NH county results.

I do not have AL counties in my results (due to an oversight).



My model has 1822 counties.

For my dependent variable I chose to use Bernie's share of the total Bernie + Hillary vote.



Variables

The next step is figuring out what variables you should put in your model. For the most part, I used American Community Survey demographic variables that I got from the Census Reporter API.



Race

Race appears to be the most important factor. I am using the American Community Survey (ACS: 2010-2014) and percent Black, percent White non-Hispanic, percent Hispanic (includes whites), percent Asian, percent Other, percent Islander (Pacific and Hawaiian), percent Multi-Racial, and percent Native American.



Age and Sex

I have a simple median age variable (ACS). I also compiled data from the ACS age/sex table to create age ranges for each sex (18-29, 30-44, 45-64, 65+) under the hypothesis that men and women in different age groups would have different levels of support.



Unionization Rate

I got the unionization rate (percent of employed who are union members) at the state level from the latest Bureau of Labor Survey.



Google Search Trends

I used Google Trends for “Bernie Sanders” and “Hillary Clinton” at the state level for Jan 1, 2016 to March 20th. I currently prefer to use the ratio (Sanders (Sanders+Clinton)) as this controls for the fact that both search terms were more popular in early states. Though in practice it doesn't seem to make a big difference.



Education

I created several ranges from the ACS including below high school, high school or GED, some college (includes associate degrees), college degree, and post college degree (master's, phd, law, etc).



Income

I got median household income from the ACS. I also tried using log of income – as it can add a little but I feel unsure about the justification. I also tested the poverty rate. I also created an income-change variable that compares the median household income in 2007-2011 (ACS) to 2010-2014 (ACS), but that wasn't significant.



Obama 2008 Presidential Vote

I couldn't find reliable county votes for the 2012 presidential election, so I used the 2008 election.



Density

I got population density from the ACS. It generally was not a factor.



Cyclist Commuters

I included this ACS variable mostly for fun, but it actually is often significant. My data set (and the ACS) lack any kind of variable that measures liberal-conservative. My theory is that liberals are more likely to be cyclists, but it could also just be heavily correlated with areas with a high student population. Maybe I should test walking commuters as well?



Bernie Polling Average

I got this from Pollster, or if Pollster didn't have an average due to a lack of polls I used 538's number. Then I converted it to Bernie Support / (Bernie Support + Hillary Support). Models that include this variable exclude Nebraska – as I did not find any polling data for the Democratic caucus.



Caucus

I sometimes used the caucus variable (1 = caucus, 0 = primary). Bernie does slightly poorly in caucuses, but this isn't always significant.



Race Only Models

Let's start off with a simple model using percent black, percent white, and percent hispanic.

Adjusted R^2: 0.579


Coefficientsa

Model

Unstandardized Coefficients

Standardized Coefficients

t

Sig.


B

Std. Error

Beta



1

(Constant)

.274

.015


18.363

.000


pblack

-1.345

.053

-1.436

-25.289

.000


phispanic

.879

.054

.824

16.183

.000


pwhite2

-.669

.053

-.862

-12.657

.000

a. Dependent Variable: pBernie


This model is actually pretty surprising. Relative to the other races, Bernie does very poorly in areas with more blacks – which makes sense. However he also does poorly in areas with more whites and he does better in areas with more hispanics. Throughout all the models I've made Bernie consistently performs badly with blacks, however sometimes the result for hispanics will be positive and sometimes negative. I'm confused by this result.



All the Races

This model compares all of the races to whites.

Adj R^2 0.589




Coefficientsa

Model

Unstandardized Coefficients

Standardized Coefficients

t

Sig.


B

Std. Error

Beta



1

(Constant)

.205

.020


10.350

.000


pblack

-.680

.015

-.726

-46.829

.000


phispanic

.268

.020

.251

13.062

.000


pindian

.324

.091

.064

3.569

.000


pasian

.892

.153

.091

5.818

.000


pisland

2.950

1.541

.029

1.915

.056


pother

.229

.096

.045

2.381

.017


pmulti

1.100

.150

.134

7.335

.000

a. Dependent Variable: pBernie


You can see that adding additional races only marginally increases the R^2 (by 0.01).

Here you again have Bernie doing poorly with blacks. Surprisingly Bernie does well with all the other races and better so than with whites. His support from Islanders is only marginally significant. In some of my models I have noticed very strong support from multi-racial people (as in this model).



Race and Polling Model

This model includes both race and the Bernie Poll Average variable.

Adj R^2: 0.701




Coefficientsa

Model

Unstandardized Coefficients

Standardized Coefficients

t

Sig.


B

Std. Error

Beta



1

(Constant)

-.020

.020


-1.038

.300


pblack

-.508

.015

-.555

-34.531

.000


phispanic

.186

.018

.179

10.411

.000


pindian

.312

.090

.057

3.480

.001


pasian

.790

.131

.083

6.055

.000


pisland

.510

1.344

.005

.380

.704


pother

.157

.082

.032

1.907

.057


pmulti

.552

.136

.069

4.061

.000


bpoll

.734

.033

.359

21.985

.000

a. Dependent Variable: pBernie

Bernie Poll average is the second most important predictor. Islander is washed out of the model, and Other race is only marginally significant.



The Big Model
This includes pretty much everything.

Adj R^2: 0.788




Coefficientsa

Model

Unstandardized Coefficients

Standardized Coefficients

t

Sig.


B

Std. Error

Beta



1

(Constant)

-.861

.042


-20.549

.000


pblack

-.489

.013

-.534

-36.419

.000


phispanic

.153

.016

.147

9.738

.000


pindian

.309

.076

.057

4.065

.000


pother

.245

.070

.049

3.495

.000


pmulti

.299

.116

.038

2.580

.010


bpoll

.177

.038

.086

4.624

.000


gpSanders

2.200

.125

.322

17.639

.000


Income

-9.636E-7

.000

-.073

-4.032

.000


m3044

.462

.095

.060

4.851

.000


m4564

1.066

.146

.123

7.324

.000


f1829

1.072

.113

.155

9.477

.000


somecollege

.224

.038

.070

5.854

.000


college

.520

.052

.184

10.083

.000


bike

.634

.254

.030

2.492

.013

a. Dependent Variable: pBernie


Race continues to play the most critical role. At the county level, the demographics are more significant than the state wide polling average. While this makes sense because they are analyzing different geographical areas (and one might expect that counties could be very different than the entire state), it makes less sense considering that the Google Search Trend is measured at the state level (gpSanders) and that it is more significant than the polling data.



Sanders support increases in counties with lower median household income. I generally found this to be true, though not a very strong effect.



Sanders does well with counties with men aged 30-44 (though not in all models), men aged 45-64 (this was found in most models), and women aged 18-29 (also found in most models). I'm surprised that he does better with women 18-29 then with men of that age group. It is possible that this is a proxy variable for the presence of colleges and universities (as more women attend them than men, and young men might also be found in army bases and prisons – and the prisoners will often or always be unable to vote).



Sanders does better with counties with people with a partial or full college education. He does poorly with high school grads, below high school, and above college. This is seen in all my models.



Sanders does better with counties with people who cycle commute in most of my models. This is hilarious. It might be a proxy for liberals or college students.



This final model leaves a lot of the outcome (21.2%) unexplained. Because of that, you can apply this model to an entire state and try to predict the state result, but you are likely to only explain 78.8% of it.



Conclusion

I've spent a lot of time on this model and while I have learned a lot, I do not feel that this model is accurate enough to make meaningful predictions. I think it may be much easier to predict a state level outcome, where polling should be able to be a much stronger predictor than it is at the county level.



I would be very happy to hear from other people regarding their own models, possible data sources and variables that I should add, and any other advice that you might have.



Please <a href='mailto:aaron@campusactivism.org'>Email Me</a>