Monday, 22 August 2016

Producing a scatter plot with Pandas and Seaborn

I've written before about Python and its sublime extension Numpy, which vastly extends its capabilities for handling chunky amounts of data. Numpy exists as part of a wider collection of software, often referred to as "the Scipy stack". As well as Scipy itself, two other components of this collection which sit on top of Numpy are Pandas and Matplotlib.

Pandas is a library of routines for manipulating PANel DisplAyS: two dimensional grids of data, called DataFrames, in which, like in a spreadsheet, data is arranged in to rows and columns. Pandas allows the creation and manipluation of these: reshaping, filtering and querying them.

Matplotlib is a powerful set of routines for producing all manner of plots and other visualisations from the data. 

The problem

As an exercise in learning Pandas, I decided to analyse the results of the 2015 UK General Election. The reason for choosing this rather dry subject was that it was:
  • A moderately complex data set
  • In the public domain
  • I already understood it to an extent
  • Any conclusions I reached could be easily verified 
I wanted to see if there was any correlation between how "safe" a seat was and voter turnout, on the assumption that in a really safe seat turnout would be lower, on the basis that people happy with the status quo wouldn't bother voting because they knew they didn't have to and those unhappy with the status quo wouldn't bother voting because they felt they couldn't change anything.

First get your data


So the first thing to do was to find and download the data set. Fortunately in the UK such data is publicly available from the Electoral Commission, the government body responsible for the adminstration of elections in the UK which also publishes scrupulously accurate data.

This was available as an Excel spreadsheet. This is somewhat irritating as Excel is, of course, a proprietary standard and the UK government committed to using open standards (e.g. ODS) some years ago. But never mind.

Now clean it up


The Electoral Commission data is excellent, but the Excel spreadsheet needs a little tidying up before importation. So I exported each of its different worksheets as CSV format files. (Pandas can read Excel, but this seemed cleaner). Once the CSVs were exported and things like columns headings in the first row of each were correct and any totals at the bottom of the files were removed, then the importation was simple enough.

(Note here that when importing the files it was also necessary to specify the encoding scheme used on the file, in this case ISO-8859-1). Soon these were all importing cleanly. There was, however, one important piece of information missing. The files refer to constituency and region IDs. These are administered by the Office of National Statistics, another government body which publishes a huge amount of excellent data for public consumption. So I downloaded this, cleaned it up as before, and it is now available for import as a CSV.

The code


import numpy as np
import pandas as pd
import seaborn as sns
from  matplotlib import pyplot
from matplotlib.ticker import FuncFormatter

Candidates = pd.read_csv('Candidates.csv', encoding = "ISO-8859-1")
Constituencies = pd.read_csv('Constituencies.csv', encoding = "ISO-8859-1")
Results = pd.read_csv('Results.csv', encoding = "ISO-8859-1")
Referendum = pd.read_csv('Referendum.csv', encoding = "ISO-8859-1")
Regions = pd.read_csv('Regions.csv', encoding = "ISO-8859-1")

# Identify those parties which scored more than `threshold` votes in any 
# constituencies to weed out the frivolous ones

# The CSV file has commas in the Total and Electorate fields. Convert from str 
# to int

# Tidy things up a bit. Give some of the columns better names. Merge the
# Labour and Labour & Co-operative figures. Count the speaker as a Conservative
# rather than being a party on his own. 

Results = Results.rename(columns={'Press Association ID Number': 'PAID', 
                  'Constituency Name': 'Constituency', 
                  'Constituency ID': 'ID', 
                  'Constituency Type': 'Type', 
                  'Election Year': 'Year',
                  ' Total number of valid votes counted ': 'Total',
                  'C': 'Con'})
                  
Results['Lab'] = Results.fillna(0)['Lab'] + Results.fillna(0)['Lab Co-op']
Results['Con'] = Results.fillna(0)['Con'] + Results.fillna(0)['Speaker']
Results = Results.drop(['Year', 'Lab Co-op', 'Speaker'], axis=1)
Results.Total = Results.Total.str.replace(',', '').astype(int)
Results.Electorate = Results.Electorate.str.replace(',', '').astype(int)
RawResults = Results.ix[:, 'Unnamed: 9': 'Zeb'].fillna(0)

q=pd.DataFrame(np.sort(RawResults.values)[:,-2:], columns=['2nd','1st'])
Results['Party'] = RawResults.idxmax(axis=1)
Results['Votes'] = q['1st'].astype(int)
Results['Majority'] = (q['1st']-q['2nd']).astype(int)
Results['Share'] = Results.Votes/Results.Total
Results['Turnout'] = Results.Total/Results.Electorate
Results['Safeness'] = Results.Majority/Results.Electorate

Results = Results.drop(RawResults, axis=1)
Results = Results.set_index('PAID')

Seats = pd.DataFrame(Results.groupby('Party').ID.nunique())
Seats = Seats.rename(columns={'ID':'Seats'})

# So to keep things uncluttered, only consider parties with > 10 seats 

MinSeats = 10
Parties = Seats[Seats.Seats > MinSeats].index

p=Results[Results.Party.isin(Parties)]
Apathy = p[['Turnout', 'Safeness', 'Party']]

sns.set_palette(['red', 'blue', 'yellow', 
                 'orange', 'purple', 'cyan', 
                 'magenta', 'green', 'brown'])

fg = sns.FacetGrid(Apathy, hue='Party')
fg.map(pyplot.scatter, "Turnout", "Safeness", s=2).add_legend()

fg.ax.yaxis.set_major_formatter(FuncFormatter(lambda y, _: '{:.0%}'.format(y)))
fg.ax.xaxis.set_major_formatter(FuncFormatter(lambda y, _: '{:.0%}'.format(y)))
fg.ax.set_title("2015 UK General Election\n")
fg.ax.xaxis.set_ticks(np.arange(0.4, 1.0, 0.1))
fg.ax.yaxis.set_ticks(np.arange(0.0, 0.6, 0.1))
fg.ax.set_xlabel('Turnout')
fg.ax.set_ylabel('Safeness')
fg.savefig('Apathy.png', figsize=(16,12), dpi=1200)

So the first thing I did is import various libraries - e.g. pandas, matplotlib - in to Python so I could make use of them later. Having imported pandas, I then imported the various CSV files in to their own DataFrames.

Next I tidied the frames up a little. So I altered some of the column names, aggregated the "Labour" and "Labour and Co-operative" votes as just Labour votes. (The Electoral Commission distinguishes between the two, but they sit as a single bloc in parliament). I also allocated the speaker (who according to the Electoral Commission is his own party) to the Conservative Party.

Because in the CSV file the numbers are formatted with a comma as a thousands separator, Pandas imported the Total and Electorate fields as strings, so I then had to remove the commas and convert the number from string (str) format to integer (int).

Now Pandas starts to come in to its own.

Munging the data

The raw election results are contained in 137 columns of the Results DataFrame, one column for each political party which fielded any candidates in the election. The first of these was "Unnamed: 9" and the last "Zeb". So to make handling them easier, I sliced them as  RawResults.

Next, for each of the 650 electoral constituencies (one per row) I found the largest and second largest votes (i.e. the winner and runner-up) from which I could calculate the turnout and the size of the majority. This then allowed me to calculate, amongst other things the "safeness" of the seat. This is a measure I have made up which is simply the majority divided by the total electorate which in theory gives a number between 0.0 for an extremely marginal seat to 1.0 for a seat where everybody voted and they all voted for one candidate - e.g. Pyongyang North.

The next step was to find the name (column index) of the winning party, using the idxmax() call. (Note that axis=1 means to do this row by row, rather than column by column which would be axis=0).

Then to simplify matters, having done with them, I dropped the various RawResults from the Results DataFrame and then re-indexed the DataFrame using the Press Association's IDfor each constituency.This isn't really necessary, but I prefer things tidy.

Next, I worked out the number of seats won by each party by grouping the results by party and then counting the number of unique IDs. I then filtered the results, excluding all the parties who won 10 seats or fewer, and then produced a DataFrame, Apathy, which had a row for each constituency and columns giving the turnout, "safeness" of the seat and the winning party.

Producing the scatter plot

Matplotlib has very powerful plotting capabilities, but there is a separate set of libraries, called Seaborn, which work with Pandas (and Scipy) using Matplotlib and the underlying Numpy. So the "stack" looks something like this.

Because Seaborn builds on the underlying Matplotlib, we can also use many of Matplotlib's features to, for example, alter the formatting of the axes to percentages and set the range of ticks on the two axes.

Seaborn would, by default, provide a suitably tasteful palette of colours, but I wanted to make sure that the colour used reflected those usually used by the parties - red for Labour etc., and so I set my own palette. (It contains some additional colours in case I wanted to increase the number of parties represented on the scatter plot).

This is what it produced.

which I think is quite beautiful.

Conclusions

The analysis doesn't come to any startling political conclusions: turnout is lower in Labour seats than Conservative ones. Both Labour and Conservative have a roughly equivalent distribution of safe and marginal seats. Turnout was typically higher in SNP seats and there are few safe ones, presumably both of these being because of the very hard-fought nature of the 2015 election in Scotland and the fact that the incumbent Labour MPs were displaced by an SNP insurgence. So no surprises there, then.

But, from a technical perspective it's quite remarkable. In a little over a screen of code we have performed a hgihly complex analysis and produced a publication-quality visualisation. The really remarkable thing, though, is that there are no nested loops or similar which we would normally require to iterate over the rows and columns extracting counts, totals, maxima etc. In any other language - including "ordinary" Python - the code would have been significantly more complex, longer and much more difficult to debug. In something like C the job would literally have taken several days worth of programming.

Finally, because Pandas uses the underlying Numpy libraries, which are written in native FORTRAN and C, to do all the heavy lifting, the code is blindingly fast.


No comments:

Post a Comment