Tuesday 30 August 2016

The joy of (data) sets

Background

I've recently been analysing the data from the 2015 UK General Election, which can be downloaded from the Electoral Commission's website.

One problem which interests me at the moment are the proposals by the Boundaries Commission to reduce the number of MPs from 650 to 600 and redraw constituency boundaries to redress an historic imbalance in size which has grown up over several years.

First catch your data

Now although the data from the Electoral Commission divides constituencies in to 12 regions, there is nothing in the data which tells us which constituencies are adjacent (that is they share any point on their border). In order to do this, it's necessary to get hold of the data on the shape and position of the 650 constituencies in the UK. This is available (along with a lot of other wonderful data) from the Ordnance Survey.

The data comes in the form of a shapefile: a standard format used throughout the world for encoding cartographic information. Fundamentally, a shapefile consists of a series of records which have some sort of key and other information and then shape data describing the geography of the record.

So the first thing to do was to check that the data we had downloaded was sensible, and the easiest way to do that is to use it to draw a map. In order to do this, I first converted the shapefile data to JSON format, a standrd text-based format for information interchange. I did this by adapting a couple of scripts I found online to cope with some of the idiosyncracies of the data.


"""
Convert shapefile to JSON

"""
import glob
import os
import shapefile
from json import dumps


files = glob.glob("Z:\Projects\OS Electoral Map Data\Data\GB\*.shp")

for i, f in enumerate(files):
    print("{}: Found <{}>".format(i+1, f))
    newname=os.path.split(f)[1].replace(".shp", ".json")
    if os.path.exists(newname):
        print("{} already exists.".format(newname))
    else:
        print("Creating {}".format(newname))
        
        reader = shapefile.Reader(f)
        fields = reader.fields[1:]
        field_names = [field[0] for field in fields]
    
        data = []
        for sr in reader.shapeRecords():
            raw = dict(zip(field_names, sr.record))
            for k in raw:
                if type(raw[k]) is bytes:
                    raw[k]=raw[k].decode('utf-8')
            geom = sr.shape.__geo_interface__
            data.append(dict(type="Feature", geometry=geom, properties=raw))
    
        with open(newname, "w") as geojson:
           geojson.write(dumps({"type": "FeatureCollection", "features": data}, indent=2) + "\n")
           geojson.close()

I then used it to plot a choropleth map of the data, showing the "safeness" of a constituency. (This is an ongoing activity and there will be a further post about it.)


import pandas as pd
import geopandas as gp
import matplotlib.pyplot as plt


geofile = 'westminster_const_region.json'
datafile = r'Z:\Projects\Election 2015\results.h5'
results = pd.read_hdf(datafile, 'apathy')
geodata = gp.read_file(geofile)
chorodata = pd.merge(geodata, results, left_on='CODE', right_on='ID')
ax=chorodata.plot(column='Safeness', cmap='Blues', figsize=(12,8))
fig=ax.get_figure()
plt.xticks(rotation='vertical')
fig.savefig('uk.png', dpi=1200)
plt.show()

Which produced the following output.

Which looks about right.

The shape of politics in the UK


# -*- coding: utf-8 -*-
"""
Program to find adjacent electoral constituencies based on the OS
shapefile data, more as an exercise in learning about shapefiles
than anything in particular.

"""

import shapefile
import itertools
import pandas as pd

class Constituency:
    
    def __init__(self, name, onsid, geo):
        self.name = name
        self.onsid = onsid
        self.shape = geo
        
    def __repr__(self):
        return("{} ({}): {} points.".format(self.name, self.onsid, len(self.shape)))

    def IsAdjacent(self, other):
        """
        Tests two consituencies to see if any of the points
        in their shape coincide. As the shape structures are 
        lists of lists of floats, convert them to lists of tuples
        then to sets. This makes it literally a couple of orders of
        magnitude faster than schlepping through the lists manually.
        
        Returns True if any points coincide, otherwise False
        """
        a = set([tuple(r) for r in self.shape])
        b = set([tuple(r) for r in other.shape])
        if a&b:
            return True
        else:
            return False

file = "Z:\Projects\OS Electoral Map Data\Data\GB\westminster_const_region.shp"
h5store = "Z:\Projects\Election 2015\election2015.h5"

hdf = pd.HDFStore(h5store)
print("Reading HDF5 store {}".format(h5store))
results=hdf.get('results')

print("Reading shapefile {}".format(file))
reader = shapefile.Reader(file)
records = reader.shapeRecords()

# Set up the list of constituencies. Note that r.record[8] is the ONS constituency
# ID - e.g. W07000049 is the code for Aberavon

codes = results[['ID', 'Constituency']].set_index('ID')
c = [Constituency(codes.ix[r.record[8], 'Constituency'], r.record[8], r.shape.points) for r in records]
# Now work out the adjacents
adjacencies = {f: {f: False for f in codes.values.flatten()} for f in codes.values.flatten()}
for i in range(len(c)-1):
    print("({}/{}) {}".format(i+1, len(c)-1, c[i].name))
    k = 0
    for j in range(i+1, len(c)):
        if c[i].IsAdjacent(c[j]):
            k += 1
            print(" {}: {}".format(k, c[j].name))
            adjacencies[c[i].name][c[j].name]=True
            adjacencies[c[j].name][c[i].name]=True

# Now write the adjacencies dictionary to the HDF store
hdf.put('adjacencies', pd.DataFrame(adjacencies))
hdf.close()


The next thing to do was to test the shapefile for adjacency. This is a computationally intensive task which a few years ago would have required a supercomputer. There are 632 constituencies in the OS data I downloaded (Northern Ireland is a separate case). So each of the 632 constituencies needs to be compared with the other 631.

In the worst case this would involve 6322 (just short of 4M) comparisons. Now we can short circuit that significantly by relying on the fact that it constituency S is adjacent to T, then T must also be adjacent to S. Therefore if we were comparing constituencies A to Z we would only have to compare AB, AC, AD .. AX, AY, AZ then BC, BD, BE and so on. So this essentially halves the number of comparisons. But this still works out to about 2M. Ordinarily this wouldn't be a problem, but each of the comparisons of two constituencies involves comparing two lists, each containing several hundred coordinate pairs, to see if they have common elements.

Ordinarily in Python one would simply use a for loop to compare each element in list s for membership of list t. But due to the fact that the comparison is done step by step via the Python interpreter this is very slow.

The answer is to use Python's set type, convering the shape date from lists to sets.This means that Python will then do all the heavy lifting internally and the comparison is much faster (by my estimate, in this case, literally thousands of times faster). There are, however, a couple of wrinkles.

Firstly the shape data for a constituency consists of a list containing lists of pairs of floats. (This for example is the first 10 points for Aberavon, the first constituency in the list.


c[0].shape[:10]
Out[42]: 
[[274462.8, 188515.9],
 [274479.5, 188521.2],
 [274496.6, 188524.9],
 [274533.4, 188527.6],
 [274592.5, 188526.8],
 [274602.8, 188532.9],
 [274621.7, 188536.4],
 [274630.2, 188536.6],
 [274638.6, 188536.2],
 [274647.0, 188534.1]]

Because Python sets can't contain mutable types, we have to convert the lists of floats to tuples.

Secondly we are essentially comparing floating-point quantites for equivalence. This obvisouly is much faster than using some sort of epsilon function, particularly as in order to do so would not only involve calculation rather than simple comparison but also mean that we could use the set-based solution. But actually, we are importing text-based values, rounded to one decimal place and the data used in one adjacent constituency would be generated by exactly the same system which generated it for another. So the comparison should be safe and, by inspection of a subset of the results, we know it is.

HDF, I think I love you

Now although we've written the code as efficiently as we might it still takes some time, about 25 minutes, to run even on a relatively fast (Intel Core i7) PC. Just loading the shapefile (which weighs in at 193MB uncompressed) takes several seconds, and so having calculated the data we need to save it for future use as we don 't want to have to go through this rigmarole every time we want to use it.

The code generates a dictionary of dictionaries. Ordinarily we would just save this to disk using the Python pickle function. However, we will be doing the downstream analysis of this using pandas and numpy, and so it makes sense to convert it to a pandas DataFrame and use the superb, extremely fast HDF5 to store the data.

The conversion from a nested dictionary in to a DataFrame is trivially simple, and results in a table with one row and one column for each of the 632 constituencies, and at the intersection of a particular row or column a boolean value which is True if the constituencies are adjacent or False if they are not.

This is literally as easy as opening an HDF store which will be created by the pd.HDFStore() call if it doesn't exist or opened for append if it does, and then getting and putting the data as we see fit. It really is that easy.

Conclusions

There's a whole load of free, very high qulity data sets out there, along with all sorts of free (Open Source) tools to use them, such as Python/pandas or, if you want to slice and dice the data without having to bother coding stuff like LibreOffice.

All you need is a computer and an internet connection (which you must already have to be reading this). So if something interests you or you actually want to bring facts in to an argument about something then get downloading and munging.

Although modern computers are extremely fast, we still need to consider the efficiency of the algorithms we choose, and even in a high-level language like Python we can make massive efficiency savings by applying a bit of thought to what would otherwise be an ugly, brute force and ignorance approach.

Saving our output in some machine readable format post-calculation makes a huge amount of sense. HDF5 offers a number of advantages over the Python pickle function, not least the fact that HDF5 format data can be read by many other languages other than Python. However, not all data is amenable/suitable to be easily written to HDF5.

Monday 22 August 2016

Producing a scatter plot with Pandas and Seaborn

I've written before about Python and its sublime extension Numpy, which vastly extends its capabilities for handling chunky amounts of data. Numpy exists as part of a wider collection of software, often referred to as "the Scipy stack". As well as Scipy itself, two other components of this collection which sit on top of Numpy are Pandas and Matplotlib.

Pandas is a library of routines for manipulating PANel DisplAyS: two dimensional grids of data, called DataFrames, in which, like in a spreadsheet, data is arranged in to rows and columns. Pandas allows the creation and manipluation of these: reshaping, filtering and querying them.

Matplotlib is a powerful set of routines for producing all manner of plots and other visualisations from the data. 

The problem

As an exercise in learning Pandas, I decided to analyse the results of the 2015 UK General Election. The reason for choosing this rather dry subject was that it was:
  • A moderately complex data set
  • In the public domain
  • I already understood it to an extent
  • Any conclusions I reached could be easily verified 
I wanted to see if there was any correlation between how "safe" a seat was and voter turnout, on the assumption that in a really safe seat turnout would be lower, on the basis that people happy with the status quo wouldn't bother voting because they knew they didn't have to and those unhappy with the status quo wouldn't bother voting because they felt they couldn't change anything.

First get your data


So the first thing to do was to find and download the data set. Fortunately in the UK such data is publicly available from the Electoral Commission, the government body responsible for the adminstration of elections in the UK which also publishes scrupulously accurate data.

This was available as an Excel spreadsheet. This is somewhat irritating as Excel is, of course, a proprietary standard and the UK government committed to using open standards (e.g. ODS) some years ago. But never mind.

Now clean it up


The Electoral Commission data is excellent, but the Excel spreadsheet needs a little tidying up before importation. So I exported each of its different worksheets as CSV format files. (Pandas can read Excel, but this seemed cleaner). Once the CSVs were exported and things like columns headings in the first row of each were correct and any totals at the bottom of the files were removed, then the importation was simple enough.

(Note here that when importing the files it was also necessary to specify the encoding scheme used on the file, in this case ISO-8859-1). Soon these were all importing cleanly. There was, however, one important piece of information missing. The files refer to constituency and region IDs. These are administered by the Office of National Statistics, another government body which publishes a huge amount of excellent data for public consumption. So I downloaded this, cleaned it up as before, and it is now available for import as a CSV.

The code


import numpy as np
import pandas as pd
import seaborn as sns
from  matplotlib import pyplot
from matplotlib.ticker import FuncFormatter

Candidates = pd.read_csv('Candidates.csv', encoding = "ISO-8859-1")
Constituencies = pd.read_csv('Constituencies.csv', encoding = "ISO-8859-1")
Results = pd.read_csv('Results.csv', encoding = "ISO-8859-1")
Referendum = pd.read_csv('Referendum.csv', encoding = "ISO-8859-1")
Regions = pd.read_csv('Regions.csv', encoding = "ISO-8859-1")

# Identify those parties which scored more than `threshold` votes in any 
# constituencies to weed out the frivolous ones

# The CSV file has commas in the Total and Electorate fields. Convert from str 
# to int

# Tidy things up a bit. Give some of the columns better names. Merge the
# Labour and Labour & Co-operative figures. Count the speaker as a Conservative
# rather than being a party on his own. 

Results = Results.rename(columns={'Press Association ID Number': 'PAID', 
                  'Constituency Name': 'Constituency', 
                  'Constituency ID': 'ID', 
                  'Constituency Type': 'Type', 
                  'Election Year': 'Year',
                  ' Total number of valid votes counted ': 'Total',
                  'C': 'Con'})
                  
Results['Lab'] = Results.fillna(0)['Lab'] + Results.fillna(0)['Lab Co-op']
Results['Con'] = Results.fillna(0)['Con'] + Results.fillna(0)['Speaker']
Results = Results.drop(['Year', 'Lab Co-op', 'Speaker'], axis=1)
Results.Total = Results.Total.str.replace(',', '').astype(int)
Results.Electorate = Results.Electorate.str.replace(',', '').astype(int)
RawResults = Results.ix[:, 'Unnamed: 9': 'Zeb'].fillna(0)

q=pd.DataFrame(np.sort(RawResults.values)[:,-2:], columns=['2nd','1st'])
Results['Party'] = RawResults.idxmax(axis=1)
Results['Votes'] = q['1st'].astype(int)
Results['Majority'] = (q['1st']-q['2nd']).astype(int)
Results['Share'] = Results.Votes/Results.Total
Results['Turnout'] = Results.Total/Results.Electorate
Results['Safeness'] = Results.Majority/Results.Electorate

Results = Results.drop(RawResults, axis=1)
Results = Results.set_index('PAID')

Seats = pd.DataFrame(Results.groupby('Party').ID.nunique())
Seats = Seats.rename(columns={'ID':'Seats'})

# So to keep things uncluttered, only consider parties with > 10 seats 

MinSeats = 10
Parties = Seats[Seats.Seats > MinSeats].index

p=Results[Results.Party.isin(Parties)]
Apathy = p[['Turnout', 'Safeness', 'Party']]

sns.set_palette(['red', 'blue', 'yellow', 
                 'orange', 'purple', 'cyan', 
                 'magenta', 'green', 'brown'])

fg = sns.FacetGrid(Apathy, hue='Party')
fg.map(pyplot.scatter, "Turnout", "Safeness", s=2).add_legend()

fg.ax.yaxis.set_major_formatter(FuncFormatter(lambda y, _: '{:.0%}'.format(y)))
fg.ax.xaxis.set_major_formatter(FuncFormatter(lambda y, _: '{:.0%}'.format(y)))
fg.ax.set_title("2015 UK General Election\n")
fg.ax.xaxis.set_ticks(np.arange(0.4, 1.0, 0.1))
fg.ax.yaxis.set_ticks(np.arange(0.0, 0.6, 0.1))
fg.ax.set_xlabel('Turnout')
fg.ax.set_ylabel('Safeness')
fg.savefig('Apathy.png', figsize=(16,12), dpi=1200)

So the first thing I did is import various libraries - e.g. pandas, matplotlib - in to Python so I could make use of them later. Having imported pandas, I then imported the various CSV files in to their own DataFrames.

Next I tidied the frames up a little. So I altered some of the column names, aggregated the "Labour" and "Labour and Co-operative" votes as just Labour votes. (The Electoral Commission distinguishes between the two, but they sit as a single bloc in parliament). I also allocated the speaker (who according to the Electoral Commission is his own party) to the Conservative Party.

Because in the CSV file the numbers are formatted with a comma as a thousands separator, Pandas imported the Total and Electorate fields as strings, so I then had to remove the commas and convert the number from string (str) format to integer (int).

Now Pandas starts to come in to its own.

Munging the data

The raw election results are contained in 137 columns of the Results DataFrame, one column for each political party which fielded any candidates in the election. The first of these was "Unnamed: 9" and the last "Zeb". So to make handling them easier, I sliced them as  RawResults.

Next, for each of the 650 electoral constituencies (one per row) I found the largest and second largest votes (i.e. the winner and runner-up) from which I could calculate the turnout and the size of the majority. This then allowed me to calculate, amongst other things the "safeness" of the seat. This is a measure I have made up which is simply the majority divided by the total electorate which in theory gives a number between 0.0 for an extremely marginal seat to 1.0 for a seat where everybody voted and they all voted for one candidate - e.g. Pyongyang North.

The next step was to find the name (column index) of the winning party, using the idxmax() call. (Note that axis=1 means to do this row by row, rather than column by column which would be axis=0).

Then to simplify matters, having done with them, I dropped the various RawResults from the Results DataFrame and then re-indexed the DataFrame using the Press Association's IDfor each constituency.This isn't really necessary, but I prefer things tidy.

Next, I worked out the number of seats won by each party by grouping the results by party and then counting the number of unique IDs. I then filtered the results, excluding all the parties who won 10 seats or fewer, and then produced a DataFrame, Apathy, which had a row for each constituency and columns giving the turnout, "safeness" of the seat and the winning party.

Producing the scatter plot

Matplotlib has very powerful plotting capabilities, but there is a separate set of libraries, called Seaborn, which work with Pandas (and Scipy) using Matplotlib and the underlying Numpy. So the "stack" looks something like this.

Because Seaborn builds on the underlying Matplotlib, we can also use many of Matplotlib's features to, for example, alter the formatting of the axes to percentages and set the range of ticks on the two axes.

Seaborn would, by default, provide a suitably tasteful palette of colours, but I wanted to make sure that the colour used reflected those usually used by the parties - red for Labour etc., and so I set my own palette. (It contains some additional colours in case I wanted to increase the number of parties represented on the scatter plot).

This is what it produced.

which I think is quite beautiful.

Conclusions

The analysis doesn't come to any startling political conclusions: turnout is lower in Labour seats than Conservative ones. Both Labour and Conservative have a roughly equivalent distribution of safe and marginal seats. Turnout was typically higher in SNP seats and there are few safe ones, presumably both of these being because of the very hard-fought nature of the 2015 election in Scotland and the fact that the incumbent Labour MPs were displaced by an SNP insurgence. So no surprises there, then.

But, from a technical perspective it's quite remarkable. In a little over a screen of code we have performed a hgihly complex analysis and produced a publication-quality visualisation. The really remarkable thing, though, is that there are no nested loops or similar which we would normally require to iterate over the rows and columns extracting counts, totals, maxima etc. In any other language - including "ordinary" Python - the code would have been significantly more complex, longer and much more difficult to debug. In something like C the job would literally have taken several days worth of programming.

Finally, because Pandas uses the underlying Numpy libraries, which are written in native FORTRAN and C, to do all the heavy lifting, the code is blindingly fast.