Prediction Models for the Democratic 2016 Primary

Submitted by akreider on Tue, 03/22/2016 - 15:54.

Prediction Models for the Democratic 2016 Primary

There are a lot of people creating models for the 2016 primary. I decided to focus on the Democratic primary as it should be easier to predict the outcome in what is primarily a two person race. While I do prefer Sanders over the other candidates, I am currently not planning on voting for him as I prefer to vote for left-wing candidates (Green Party and occasionally Socialists) and prefer the long term plan to build a third party over working within the Democratic party.

Election modeling is hard work and it is easy to make massive mistakes. So take all of my models with a large grain of salt. I have a background in Sociology and web development. I have not reviewed the political science literature on election prediction – rather I've been following election cycles (US, Canada, UK, and more) for over twenty years.

It is more likely that experts (like 538), polling averages (Pollster), and prediction markets (PredictIt) will be more accurate than my attempts.

My first attempt at predicting the primaries was March 15. The poll average ended up being better than my predictions 5 out of 5 times. There is only a 1 in 32 chance that this was random.

County Model

While most people care about the outcome at the state level, I have chosen a county level approach because it allows you to make county predictions and thus make predictions for the state-wide result based on early county results. And it also increases the sample size allowing you to include more variables in your model.

I haven't seen anyone else's county model, so I made my own.

Election Data

You can get election data from the CNN JSON.

For Primaries

http://data.cnn.com/ELECTION/2016primary/MI/county/D.json

For Caucuses

http://data.cnn.com/ELECTION/2016primary/IA/county/E.json

This county level data includes the following states with primaries: AR, GA, FL, IL, LA, MI, MS, NC, OH, OK, SC, TN, TX, and VA. It excludes VT and MA as I was unable to find county level election results for them (email me if you have this data!).

I included the following states with caucuses: IA, NV, CO, and NE.
I wasn't able to find county level data for caucuses in KS and ME.

I manually added the NH county results.

I do not have AL counties in my results (due to an oversight).

My model has 1822 counties.

For my dependent variable I chose to use Bernie's share of the total Bernie + Hillary vote.

Variables

The next step is figuring out what variables you should put in your model. For the most part, I used American Community Survey demographic variables that I got from the Census Reporter API.

Race

Race appears to be the most important factor. I am using the American Community Survey (ACS: 2010-2014) and percent Black, percent White non-Hispanic, percent Hispanic (includes whites), percent Asian, percent Other, percent Islander (Pacific and Hawaiian), percent Multi-Racial, and percent Native American.

Age and Sex

I have a simple median age variable (ACS). I also compiled data from the ACS age/sex table to create age ranges for each sex (18-29, 30-44, 45-64, 65+) under the hypothesis that men and women in different age groups would have different levels of support.

Unionization Rate

I got the unionization rate (percent of employed who are union members) at the state level from the latest Bureau of Labor Survey.

Google Search Trends

I used Google Trends for “Bernie Sanders” and “Hillary Clinton” at the state level for Jan 1, 2016 to March 20^th. I currently prefer to use the ratio (Sanders (Sanders+Clinton)) as this controls for the fact that both search terms were more popular in early states. Though in practice it doesn't seem to make a big difference.

Education

I created several ranges from the ACS including below high school, high school or GED, some college (includes associate degrees), college degree, and post college degree (master's, phd, law, etc).

Income

I got median household income from the ACS. I also tried using log of income – as it can add a little but I feel unsure about the justification. I also tested the poverty rate. I also created an income-change variable that compares the median household income in 2007-2011 (ACS) to 2010-2014 (ACS), but that wasn't significant.

Obama 2008 Presidential Vote

I couldn't find reliable county votes for the 2012 presidential election, so I used the 2008 election.

Density

I got population density from the ACS. It generally was not a factor.

Cyclist Commuters

I included this ACS variable mostly for fun, but it actually is often significant. My data set (and the ACS) lack any kind of variable that measures liberal-conservative. My theory is that liberals are more likely to be cyclists, but it could also just be heavily correlated with areas with a high student population. Maybe I should test walking commuters as well?

Bernie Polling Average

I got this from Pollster, or if Pollster didn't have an average due to a lack of polls I used 538's number. Then I converted it to Bernie Support / (Bernie Support + Hillary Support). Models that include this variable exclude Nebraska – as I did not find any polling data for the Democratic caucus.

Caucus

I sometimes used the caucus variable (1 = caucus, 0 = primary). Bernie does slightly poorly in caucuses, but this isn't always significant.

Race Only Models

Let's start off with a simple model using percent black, percent white, and percent hispanic.

Adjusted R^2: 0.579

Coefficients^a
Model		Unstandardized Coefficients		Standardized Coefficients	t	Sig.
		B	Std. Error	Beta
1	(Constant)	.274	.015		18.363	.000
	pblack	-1.345	.053	-1.436	-25.289	.000
	phispanic	.879	.054	.824	16.183	.000
	pwhite2	-.669	.053	-.862	-12.657	.000
a. Dependent Variable: pBernie

This model is actually pretty surprising. Relative to the other races, Bernie does very poorly in areas with more blacks – which makes sense. However he also does poorly in areas with more whites and he does better in areas with more hispanics. Throughout all the models I've made Bernie consistently performs badly with blacks, however sometimes the result for hispanics will be positive and sometimes negative. I'm confused by this result.

All the Races

This model compares all of the races to whites.

Adj R^2 0.589

Coefficients^a
Model		Unstandardized Coefficients		Standardized Coefficients	t	Sig.
		B	Std. Error	Beta
1	(Constant)	.205	.020		10.350	.000
	pblack	-.680	.015	-.726	-46.829	.000
	phispanic	.268	.020	.251	13.062	.000
	pindian	.324	.091	.064	3.569	.000
	pasian	.892	.153	.091	5.818	.000
	pisland	2.950	1.541	.029	1.915	.056
	pother	.229	.096	.045	2.381	.017
	pmulti	1.100	.150	.134	7.335	.000
a. Dependent Variable: pBernie

You can see that adding additional races only marginally increases the R^2 (by 0.01).

Here you again have Bernie doing poorly with blacks. Surprisingly Bernie does well with all the other races and better so than with whites. His support from Islanders is only marginally significant. In some of my models I have noticed very strong support from multi-racial people (as in this model).

Race and Polling Model

This model includes both race and the Bernie Poll Average variable.

Adj R^2: 0.701

Coefficients^a
Model		Unstandardized Coefficients		Standardized Coefficients	t	Sig.
		B	Std. Error	Beta
1	(Constant)	-.020	.020		-1.038	.300
	pblack	-.508	.015	-.555	-34.531	.000
	phispanic	.186	.018	.179	10.411	.000
	pindian	.312	.090	.057	3.480	.001
	pasian	.790	.131	.083	6.055	.000
	pisland	.510	1.344	.005	.380	.704
	pother	.157	.082	.032	1.907	.057
	pmulti	.552	.136	.069	4.061	.000
	bpoll	.734	.033	.359	21.985	.000
a. Dependent Variable: pBernie

Bernie Poll average is the second most important predictor. Islander is washed out of the model, and Other race is only marginally significant.

The Big Model
This includes pretty much everything.

Adj R^2: 0.788

Coefficients^a
Model		Unstandardized Coefficients		Standardized Coefficients	t	Sig.
		B	Std. Error	Beta
1	(Constant)	-.861	.042		-20.549	.000
	pblack	-.489	.013	-.534	-36.419	.000
	phispanic	.153	.016	.147	9.738	.000
	pindian	.309	.076	.057	4.065	.000
	pother	.245	.070	.049	3.495	.000
	pmulti	.299	.116	.038	2.580	.010
	bpoll	.177	.038	.086	4.624	.000
	gpSanders	2.200	.125	.322	17.639	.000
	Income	-9.636E-7	.000	-.073	-4.032	.000
	m3044	.462	.095	.060	4.851	.000
	m4564	1.066	.146	.123	7.324	.000
	f1829	1.072	.113	.155	9.477	.000
	somecollege	.224	.038	.070	5.854	.000
	college	.520	.052	.184	10.083	.000
	bike	.634	.254	.030	2.492	.013
a. Dependent Variable: pBernie

Race continues to play the most critical role. At the county level, the demographics are more significant than the state wide polling average. While this makes sense because they are analyzing different geographical areas (and one might expect that counties could be very different than the entire state), it makes less sense considering that the Google Search Trend is measured at the state level (gpSanders) and that it is more significant than the polling data.

Sanders support increases in counties with lower median household income. I generally found this to be true, though not a very strong effect.

Sanders does well with counties with men aged 30-44 (though not in all models), men aged 45-64 (this was found in most models), and women aged 18-29 (also found in most models). I'm surprised that he does better with women 18-29 then with men of that age group. It is possible that this is a proxy variable for the presence of colleges and universities (as more women attend them than men, and young men might also be found in army bases and prisons – and the prisoners will often or always be unable to vote).

Sanders does better with counties with people with a partial or full college education. He does poorly with high school grads, below high school, and above college. This is seen in all my models.

Sanders does better with counties with people who cycle commute in most of my models. This is hilarious. It might be a proxy for liberals or college students.

This final model leaves a lot of the outcome (21.2%) unexplained. Because of that, you can apply this model to an entire state and try to predict the state result, but you are likely to only explain 78.8% of it.

Conclusion

I've spent a lot of time on this model and while I have learned a lot, I do not feel that this model is accurate enough to make meaningful predictions. I think it may be much easier to predict a state level outcome, where polling should be able to be a much stronger predictor than it is at the county level.

I would be very happy to hear from other people regarding their own models, possible data sources and variables that I should add, and any other advice that you might have.

Please <a href='mailto:aaron@campusactivism.org'>Email Me</a>

Navigation

Blog Roll

Syndicate

Who's online

Prediction Models for the Democratic 2016 Primary

CampusActivism.org

User login