Saturday, 12 March 2016

How Race may Shape the Race

Next in our 3-part series of presidential primaries analysed by demographics, we'll look at race. Again, these are based on the 2012 data linked to in this post despite the fact that 2014 data is available here. This is partly so we cal look at the voting data from the 2012 presidential campaign rather than the 2014 mid-terms, but mostly because I've set up my spreadsheets with the 2012 data and it'd take numerous hours to unpick the restructured lists, add in the 2014 data and redo all the graphs. Basically I'm lazy.

The data we're using lets us look at the racial profile of registered voters as well as actual voters. Race is broken down into White, Black, Asian and Hispanic. For Hispanic data, we are using the Hispanic (not White) numbers to prevent double-counting individuals. We are ignoring the other racial categories, including various mixed race categories, for simplicity. This analysis is too blunt by far for such nuanced factors to be reliably included.

There is a lot of discussion on the Republican data here, and no real useful conclusion. The Democratic summary below is more succinct and has some interesting data. And then there's a TL:DR summary at the bottom is you can't even stomach that.

Registered Voter Race in the Republican Primaries

I'd like to start with the Hispanic data for registered voters, because it raises some methodological questions. Here is the graph of Trump, Cruz and Rubio support against Hispanic voter registration:

Line of best fit (Trump): y = -0.25x + 43.89
Line of best fit (Cruz): y = 0.28x + 33.20
Line of best fit (Rubio): y = -0.06x + 22.96

It is worth noting to begin with that Ted Cruz, as a former Governor of Texas, has an advantage in the Texan primary. Rubio has an gubernatorial advantage in Florida, and Kesich (not considered below) in Ohio, though both of these contests are this week and so not already mixed in the data. Trump has no gubernatorial advantage anywhere, for obvious reasons.

This is relevant because Texas is the most outlying data with 24.67% of registered voters in 2012 being recorded as Hispanic. Because this data is on the extreme of this graph it has a lot of leverage power on the line of best fit. If we ignore Texas as an outlier, the graph looks like this:

Line of best fit (Trump): y = 0.54x + 41.88
Line of best fit (Cruz): y = -0.62x + 35.50
Line of best fit (Rubio): y = 0.10x + 22.56

This inverts every trend. With Cruz's Texan support gone his positive 0.28x slope drops 0.9 to -0.62x, while Trump gains 0.79 and Rubio the remaining 0.16 (these changes sum to +0.05 due to rounding), reversing both of their negative trends. Is this more accurate for predicting the remaining states?

Part of the problem is data volatility. If we remove Nevada, the other outlier, the lines shift dramatically again:
Line of best fit (Trump): y = 0.34x + 42.32
Line of best fit (Cruz): y = 0.09x + 33.92
Line of best fit (Rubio): y = -0.34x + 23.53

However, not even the remaining data-points are helpful here. Part of the problem is what I call agitated data. The data is spread so far from a nice linear progression, that small each point exerts a large divergance and thus a considerable angular pull. Once the stabilising influence of Texas and Nevada are removed, there's something off a free-for-all in the data pool, and removing a single datapoint can drastically shift the line of best fit:

Texas and Nevada certainly tie this data down by exerting a disproportional influence on the data, but that raise the issue of how accurate these two points are. A consistent line of best fit is not particularly meaningful if it is consistently wrong:

Lines of best fit missing various states and (darker) with all data points
Plotted without (left) and with (right) Texas and Nevada included

Because outliers will vary from graph to graph (a state with an unusually large Hispanic population may have an Asian population close to the national median, for example) and excluding them raises issues of where we draw the cutoff, they will remain in the data set. The to this rule will be Texas, because of Ted Cruz's unique standing there. Where this has only a small impact on the data, Texas will be included in the graph and the linear equation for the chart without Texas will be provided in parentheses. Where there is a more dramatic impact (determined subjectively) a second graph will be provided.

Registered Voter Race in the Republican Primaries (For Real This Time)

So, to the Hispanic data:

Line of best fit (Trump): y = -0.25x + 43.89
Line of best fit (Cruz): y = 0.27x + 33.20
Line of best fit (Rubio): y = -0.06x + 22.96

Firstly, Rubio does not do anywhere close to as well as some commentators had been suggested on the Hispanic vote. Because his parents immigrated from Cuba, many expected Rubio to perform well with this minority so often maligned by the republican party. In fact, we see a slight negative trend as Hispanic registration increases. Initially the corresponding Trump vote makes sense, after his anti-Hispanic comments early in the campaign, which leaves Cruz to claim the Hispanic lead.

Why? Texas. This is one of the graphs where Texas, as a major outlier in the Hispanic population bell curve, really shakes up the data. If we accept the Cruz did well in Texas due to name recognition and remove this state:

Line of best fit (Trump): y = 0.54x + 41.88
Line of best fit (Cruz): y = -0.62x + 35.5
Line of best fit (Rubio): y = 0.10x + 22.56

There, that looks... wait, what? Rubio has shifted from mildly negative to mildly positive. Fair enough. Take away Texas and Cruz's support among strong Hispanic states falls. Makes sense. But Trump with a greater than 1:2 slope in the positive?

It is important to remember that Republican's don't get a huge slice of the Hispanic vote, and given the primaries have an even lower turn-out, this is probably not the result of Hispanic voters supporting Trump. What it might be, as at least one broadcaster has suggested, is the swell of anti-Hispanic sentiment from non-Hispanics in states with a high Hispanic population.

In fact, as the White population increases at the expense of racial diversity and interracial tensions presumably decrease, Trump actually loses support:
 Line of best fit (Trump): y = -0.09x + 50.04        (Excluding Texas: y = -0.15x + 55.51)
Line of best fit (Cruz): y = -0.02x + 36.23        (Excluding Texas: y = 0.03x + 30.99)
Line of best fit (Rubio): y = 0.12x + 13.81        (Excluding Texas: y = 0.12x + 13.94)

The votes lost by Trump flow on in full to Rubio, which is interesting in that he is the most moderate candidate and the racially diverse option. I'm not entirely sure why Cruz doesn't benefit more from Trump's losses, but I have heard it suggested that Cruz would be the Trump of the Republican field if Trump were not the Trump of the Republican field. Perhaps that's relevant here.

What is also interesting is that this data is very stable. Whether by coincidence or an actual pattern, the only outlier (Hawaii) places its datapoints very close to where the linear equations would lead without them:

Line of best fit (Trump): y = -0.08x + 49.22
Line of best fit (Cruz): y = 0.01x + 33.03
Line of best fit (Rubio): y = 0.08x + 16.60

Even without the Hawaiian data to stabilise the lines, the data is reasonably consistent, with Trump trending neutral to negatively, Rubio neutral to positively and Cruz with slight trends positive or negative approximating a neutral polling.

While this data finally looks stable enough to hazard a prediction off of, it has a problem that even if 120% of registered voters were White - a mathematical impossibility - Trump would still win the state, followed by Cruz, then Rubio.

A look at the Black registered voters results in a similar situation:

Line of best fit (Trump): y = 0.20x + 40.49        (Excluding Texas: y = 0.22x + 40.95)
Line of best fit (Cruz): y = -0.12x + 35.70
        (Excluding Texas: y = -0.13x + 35.19)
 Line of best fit (Rubio): y = -0.11x + 23.98        (Excluding Texas: y = -0.10x + 24.06)

The data is scattered evenly enough for there to be no certain outliers, so the lies should be pretty stable. Trump does well, again probably not so much off the Black vote as off non-Black voters who regularly conflict with the Black community, and Rubio and Cruz both suffer as a result. Whether the Black population forms 0% of registered voters (we're looking at you Idaho) or 100%, Trump will finish first, increasingly ahead of Cruz and then Rubio. Of course, if the explanation above is correct, that the increase in Trump support comes from non-Black voters, this trend cannot hold: by 100%, Trump should have the bulk of his vote eroded. This is, at best, what may be termed a limited-range trend: the pattern holds reasonably reliably up to a point. Just as Newtonian models of motion work perfectly until speeds approach lightspeed, or physics tends to break down at "extremes" like black-holes, very low temperatures or the early universe, the rule isn't broken - it just applies in particular situations.

So, again we have a stable graph with poor predictive power (always giving the state to Trump, which we no is not always the case). This just notes the general observation that Trump is winning many states, with Cruz in second place and Rubio in third.

So that leaves the Asian data:
Line of best fit (Trump): y = 0.13x + 42.45        (Excluding Texas: 0.13x + 43.05)
Line of best fit (Cruz): y = 0.00x + 34.28        (Excluding Texas: 0.01x + 33.59)
Line of best fit (Rubio): y = -0.12x + 23.14        (Excluding Texas: -0.12x + 23.27)

Which gives us very volatile data anchored by Hawaii out on the extreme of the graph. Just out of interest, here's the graph without Hawaii

Line of best fit (Trump): y = 0.28x + 42.22
Line of best fit (Cruz): y = -2.39x + 38.01
Line of best fit (Rubio): y = 2.09x + 19.69

Cruz nosedives immediately into the ground, with Rubio winning any state with more that around 12% of the registered voters being of Asian decent. Either the Asian population of the united states is the most influential political demographic discovered, or the data is too chaotic for a meaningful linear equation. Hawaii suggests the latter. So do I:

The variation in predicted Trump support after deleting one state or another certainly is eccentric, with the extremes for both him and Rubio coming from the exclusion of Minnesota and Nevada. However, the trend for Rubio is reasonably solid, with him not only leading but taking over 50% of the vote before the Asian population reaches 20% of the registered voters in all scenarios. More impressively, Cruz is consistently buried head-first in the ground by between 10% and 20%.

Perhaps, then, this shows a limited range trend as explained previously, which may only hold for populations where the Asian registered voters do not exceed around 5% or so. Beyond that other factors come into play or become exaggerated and alter the data. Though maybe not, given the minimal impact such a small population is likely to have on the state as a hole.

And even if that were the case, between 0% and 5% we, again, have Trump safely in first place, with Cruz normally outperforming Rubio. In other words, this method summarises the votes so far rather than predicting the outcome.

Another issue became evident in constructing that last graph. Although the nature of the data is such that Trump + Cruz + Rubio = 100% of the 3-candidate result, there is no accounting for negative numbers. In this extreme case where Cruz bottoms out quickly, we get absurdities like those shown for Minnesota: after around 17% of the registered voters are of Asian descent, both Trump and Rubio win over 50% of the vote at the same time. This is because although Trump + Rubio > 100%, the fact that Cruz is on a negative number of votes ensures that Trump + Cruz + Rubio = 100%.

In short, any form of these graphs where a candidate drops below 0% of the vote is inherently broken.

2012 Voter Race in the Republican Primaries

Here is the data for the Republican primary race, but based on voters at presidential elections rather than registered voters. Whether this is closer to or further from the demographic attending primaries can only be speculated upon at this stage. However, the trends for White, Black and Asian voters is almost identical, so the distinction is minor at best in these cases:

 Line of best fit (Trump): y = -0.11x + 51.68        (Excluding Texas: -0.17x + 56.66)
Line of best fit (Cruz): y = -0.00x + 34.53        (Excluding Texas: 0.05x + 29.68)
Line of best fit (Rubio): y = 0.11x + 13.84        (Excluding Texas: 0.11x + 14.04)

Line of best fit (Trump): y = 0.20x + 40.47        (Excluding Texas: 0.21x + 40.90)
Line of best fit (Cruz): y = -0.12x + 35.73        (Excluding Texas: -0.13x + 35.26)
Line of best fit (Rubio): y = -0.10x + 23.99        (Excluding Texas: -0.10x + 24.05)

Line of best fit (Trump): y = 0.13x + 42.47        (Excluding Texas: 0.13x + 43.07)
Line of best fit (Cruz): y = 0.01x + 34.27        (Excluding Texas: 0.02x + 33.57)
Line of best fit (Rubio): y = -0.12x + 23.13        (Excluding Texas: -0.12x + 23.26)

Both the White and Asian data also respond the same way they did previously when the Hawaiian outlier was removed:

Line of best fit (Trump): y = -0.12x + 16.65

Line of best fit (Cruz): y = 0.05x + 52.07

Line of best fit (Rubio): y = 0.08x +   30.11


Line of best fit (Trump): y = 0.22x + 42.34

Line of best fit (Cruz): y = -2.69x + 38.30

Line of best fit (Rubio): y = 2.46x + 19.29

So all of the above explanations can be directly applied here: Trump doing well on high minority participation, possibly as a result of non-minority attitudes in those states, and poorer in states with a large white population; the Asian data predicting a quick game over for Cruz, then messing up the data beyond that point with negative values; and all models predicting at Trump victory in all states.

The only slightly different plot comes from the Hispanic data:

Line of best fit (Trump): y = -0.18x + 43.57
Line of best fit (Cruz): y = 0.19x + 33.61
Line of best fit (Rubio): y = -0.03x + 22.96

And even this is largely the same, with Rubio on a slight negative slope and Cruz negating trump thanks to a Texan skew. The only real difference is that the intersection of Crux and Trump occurs at ~25% rather than ~20. Without the Texan data removed, the data behaves just like the registered voter data: strong Trump gain, Rubio slightly positive and Cruz heavily negative to provide these gains.

Line of best fit (Trump): y = 0.72x + 41.48
Line of best fit (Cruz): y = -0.86x + 36.06
Line of best fit (Rubio): y = 0.17x + 22.40

In essence this data is pretty much the same as the first batch. The same issues and (lack of) conclusions follow.

Registered Voter Race and 2012 Votes in the Democrat Primaries

The Democrat race is far simpler than the Republican race for many reasons. For one thing, the fact that there are only two major candidates means that any loss for one candidate is a gain for another and vice versa. In fact, if the line of best fit for Clinton is given by y = mx + c, then the fit for Sanders will be y = -mx + 100-c. Furthermore, just as with the Republican charts, there is little significant distinction between the data for registrations and voter turnout.

As far as possible outlier states, there are none visually in the data spread. However, Sanders is a senator from Vermont and Clinton used to be a senator for New York. Exactly how a former senator's state's backing compares to that of a recent senator is uncertain, but since New York has not held its primary yet we only have to exclude Vermont (which, incidentally, was a huge win for Sanders who gained 86% of the vote and all 16 delegates). Vermont's exclusion does not drastically change any of the graphs, but the line of best fit is recalculated in parentheses for it's exclusion.

There has been a lot of talk about how well Clinton has been doing among the Black vote - much to the dismay of Sanders supporters who often cite that Sanders was arrested several times in the 60s for his civil rights activism. Here is the data for that much discussed Black vote:

 Line of best fit (Clinton): y = 1.41x + 36.43        (Excluding Vermont: 1.29x + 39.27)
Line of best fit (Sanders): y = -1.41x + 63.57        (Excluding Vermont: -1.29x + 60.73)

Line of best fit (Clinton): y = 1.36x + 36.34        (Excluding Vermont: 1.25x + 39.17)
Line of best fit (Sanders): y = -1.36x + 63.66        (Excluding Vermont: -1.25x + 60.83)

There are several nice things about these graphs. Firstly, there is a solid trend. You can see just looking at the scatter of plots that there is a strong correlation between Black voters and support for Clinton. But also, the lines actually intersect - in both graphs where Black people make up around 10% or 11% of the population. Which means there is actual potential for meaningful predictions here.

The flip-side to this trend is evident in the White vote:

Line of best fit (Clinton): y = -1.15x + 145.45        (Excluding Vermont: -1.01x + 136.04)
Line of best fit (Sanders): y = 1.15x – 45.45        (Excluding Vermont: 1.01x -36.04)

Line of best fit (Clinton): y = -1.18x + 147.66        (Excluding Vermont: -1.05x + 138.77)
Line of best fit (Sanders): y = 1.18x – 47.66        (Excluding Vermont: 1.05x – 38.77)

I cannot describe the joy (and following serious reevaluation of my sanity) that I gained from these graphs. After examining the Republican data to find no meaningful data I was expecting to have to write yet another post about how my methods have one again yielded no interesting data. These series of graphs show not only lines that provide more than a blanket "player 1 wins" statement on the contest, but discrete data with far stronger correlations. The lines of best fit are mapping actual trends, not just a mathematically random laser shot through a particulate gas!

I can only speculate on why race plays a more significant role in the Democratic primaries, but the obvious hypothesis has to be the fact that people of colour tend to support democratic candidates, and are therefore better represented in Democratic primaries. Instead of a vague and possibly imagined causal chain for the Republicans (racial diversity leads to racial tension leads to Republican voting patterns), the correlation is direct. Black people, for whatever reason, support Hillary Clinton.

The Asian and Hispanic graphs are less informative, as the data points are scattered further from any assumed trend, as with the Republican data.

 Line of best fit (Clinton): y = -1.05x + 56.01        (Excluding Vermont: -1.46x + 58.77)
Line of best fit (Sanders): y = 1.05x + 43.99        (Excluding Vermont: 1.46x + 41.23)

Line of best fit (Clinton): y = -0.69x + 55.39        (Excluding Vermont: -1.25x + 58.33)
Line of best fit (Sanders): y = 0.69x + 44.61        (Excluding Vermont: 1.25x + 41.67)

Interestingly, Asian voters seems to slightly favor Sanders. However, predictions based on these lines are very inaugurate based on the very small Asian populations in the sampled states. Using the registered voter data, Sanders should lose states below roughly 6% Asian and win those over this threshold. The only state over this point, however, was Nevada, won by Clinton. And, of course, all 9 states won by Sanders were below this point. The Voting record data is similarly off, with the % of voters being Asian for Sanders to win exceeding any state so far.

However, this does provide some insight into Hawaii where the Asian population is so large (~42%) that it leaves the Black vote (2%-3%) too low to push Clinton over her Black vote threshold and the White vote (29%-30%) below the level needed by Sanders. Although it is unreliable to extrapolate so far from this data, I would suggest Sanders has an advantage in Hawaii.

 Line of best fit (Clinton): y = 0.07x + 54.04        (Excluding Vermont: -0.15x + 57.08)
Line of best fit (Sanders): y = -0.07x + 45.96        (Excluding Vermont: 0.15x + 42.92)

Line of best fit (Clinton): y = 0.04x + 54.18        (Excluding Vermont: -0.19x + 57.20)
Line of best fit (Sanders): y = -0.04x + 45.82        (Excluding Vermont: 0.19x + 42.80)

This data, however, is very unhelpful. The data is too scattered to produce a meaningful trend line, and it's predictions are a very broad "Clinton wins everything" generalisation. The removal of Sander's safe state of Vermont, counterintelligence, slides him from a slight negative to slight positive trend in both states. This is because Vermont was a strong win in a low-Hispanic state, suggesting (perhaps falsely) Sanders performs better in non-Hispanic areas.


So the republican primary data was not very useful at all. These are it's "predictions":

This was calculated by taking each line of best fit (for each racial group, using both the registration data on the left and actual voting data from 2012 on the right), plugging in the corresponding population (e.g. the percentage of registered voters recorded as White) as the x value and seeing which candidate would win. In other words, consulting the graphs for the demographic of each state and seeing where the respective candidates placed.

How accurate is this? Well, I personally doubt Trump will win all of those states, since his record so far has been lower. But for an actual number, we can apply the same method to the primaries already passed:

So 70% accurate in most cases, and that's with the data it's optimised for. Future data is likely to fare even worse.

The same predictive method can be used on the more informative Democratic data:

With slightly better results for the Black and White vote, which was our most telling data sets:

This is approaching a useable system, but it's still far from perfect. Next I tried something more complex - I tried to aggregate the equations for all four racial groups in hopes that the average would be closer than the parts. My formula, for those intrigued, was as follows:

Taking the linear equation in the form y = mx + c for each racial group race:

Where pop = the population.

This is best demonstrated by the absolutely fictional and in no way based on a real place "Ruerto Pico". Lets say Ruerto Pico is 50% Hispanic, 30% White, 15% Black and 5% Asian. Each has a linear equation for both the registration and electoral data for each candidate in both parties. Let's assume we're using the equation from the registration data for Trump in the Republican race.

Hispanic: y = -0.25x + 43.89
White: y = -0.09x + 50.04
Black: y = 0.20x + 40.49
Asian: y = 0.13x + 42.45

To combine the lines, we take 50% of the Hispanic equation, 30% of the White equation and so on:

y = (-0.25*50% + -0.09*30% + 0.20*15% + 0.13*5%) + (43.89*0.5 + 50.04*0.3 + 40.49*0.15 + 42.45*0.05)

y = -11.55% + 45.153

y = 33.6% of the vote.

The problem, however, is best demonstrated by the Democratic race, where Clinton's c value for the White Vote is over 140, and with this vote being the vast majority, this system was consistently granting Clinton more than 100% of the vote in each state.

So we're stuck with the data provided so far. First off, lets just accept that there is no useful information to be taken from the Republican race here. Next, lets accept that California and Hawaii are special cases, having a large Asian population which hands both states to Sanders in the Asian column of our predictions:

As shown previously
We'll come back to those later. Finally lets accept that with these two exceptions, all of our useful data comes from the Black and White votes. These graphs were the only ones with evident trends just by looking at the data points, and their predictions were the only ones to break 80% accuracy on decided states.

So let's boil the entire Democratic race down to Black v White, by plotting candidate success against the Black:White ratio like this...

 Line of best fit: y = 0.87x + 0.39

 Line of best fit: y = 0.87x + 0.39

These two very similar trends can be applied to the undecided states to give us one half-decent chance at a prediction:

We can back-check this method on won states to find it still holds it's (comparibly) high accuracy:

And thus we can consolidate our predictions:

Yes, this technically exceeds the allowed quota of tossups, but hopefully future analysis will help fill these in.
Note that both California and Hawaii go to sanders before we even need consider the Asian vote.


The racial demographics provide no useful data for the Republican nomination.
The racial demographics provide some useful data on the Democrat nomination.
Relatively high Asian populations favor Sanders.
The real useful data is that as the Black : White ratio increases, so does support for Clinton.
Predictions as shown in the right-hand column of the final table.

1 comment:

  1. Between starting crunching the data for this post and it's final publication, Cruz won Wyoming with 66% of the vote and Rubio won Washington DC with 37% of the vote, (Kasich -- written off in this analysis before we even started -- came a close second with 36% and his highest poll yet.)