One geek and his dog: The joy of (data) sets

Background

I've recently been analysing the data from the 2015 UK General Election, which can be downloaded from the Electoral Commission's website.

One problem which interests me at the moment are the proposals by the Boundaries Commission to reduce the number of MPs from 650 to 600 and redraw constituency boundaries to redress an historic imbalance in size which has grown up over several years.

First catch your data

Now although the data from the Electoral Commission divides constituencies in to 12 regions, there is nothing in the data which tells us which constituencies are adjacent (that is they share any point on their border). In order to do this, it's necessary to get hold of the data on the shape and position of the 650 constituencies in the UK. This is available (along with a lot of other wonderful data) from the Ordnance Survey.

The data comes in the form of a shapefile: a standard format used throughout the world for encoding cartographic information. Fundamentally, a shapefile consists of a series of records which have some sort of key and other information and then shape data describing the geography of the record.

So the first thing to do was to check that the data we had downloaded was sensible, and the easiest way to do that is to use it to draw a map. In order to do this, I first converted the shapefile data to JSON format, a standrd text-based format for information interchange. I did this by adapting a couple of scripts I found online to cope with some of the idiosyncracies of the data.

"""
Convert shapefile to JSON

"""
import glob
import os
import shapefile
from json import dumps


files = glob.glob("Z:\Projects\OS Electoral Map Data\Data\GB\*.shp")

for i, f in enumerate(files):
    print("{}: Found <{}>".format(i+1, f))
    newname=os.path.split(f)[1].replace(".shp", ".json")
    if os.path.exists(newname):
        print("{} already exists.".format(newname))
    else:
        print("Creating {}".format(newname))
        
        reader = shapefile.Reader(f)
        fields = reader.fields[1:]
        field_names = [field[0] for field in fields]
    
        data = []
        for sr in reader.shapeRecords():
            raw = dict(zip(field_names, sr.record))
            for k in raw:
                if type(raw[k]) is bytes:
                    raw[k]=raw[k].decode('utf-8')
            geom = sr.shape.__geo_interface__
            data.append(dict(type="Feature", geometry=geom, properties=raw))
    
        with open(newname, "w") as geojson:
           geojson.write(dumps({"type": "FeatureCollection", "features": data}, indent=2) + "\n")
           geojson.close()

I then used it to plot a choropleth map of the data, showing the "safeness" of a constituency. (This is an ongoing activity and there will be a further post about it.)

import pandas as pd
import geopandas as gp
import matplotlib.pyplot as plt


geofile = 'westminster_const_region.json'
datafile = r'Z:\Projects\Election 2015\results.h5'
results = pd.read_hdf(datafile, 'apathy')
geodata = gp.read_file(geofile)
chorodata = pd.merge(geodata, results, left_on='CODE', right_on='ID')
ax=chorodata.plot(column='Safeness', cmap='Blues', figsize=(12,8))
fig=ax.get_figure()
plt.xticks(rotation='vertical')
fig.savefig('uk.png', dpi=1200)
plt.show()

Which produced the following output.

Which looks about right.

The shape of politics in the UK

# -*- coding: utf-8 -*-
"""
Program to find adjacent electoral constituencies based on the OS
shapefile data, more as an exercise in learning about shapefiles
than anything in particular.

"""

import shapefile
import itertools
import pandas as pd

class Constituency:
    
    def __init__(self, name, onsid, geo):
        self.name = name
        self.onsid = onsid
        self.shape = geo
        
    def __repr__(self):
        return("{} ({}): {} points.".format(self.name, self.onsid, len(self.shape)))

    def IsAdjacent(self, other):
        """
        Tests two consituencies to see if any of the points
        in their shape coincide. As the shape structures are 
        lists of lists of floats, convert them to lists of tuples
        then to sets. This makes it literally a couple of orders of
        magnitude faster than schlepping through the lists manually.
        
        Returns True if any points coincide, otherwise False
        """
        a = set([tuple(r) for r in self.shape])
        b = set([tuple(r) for r in other.shape])
        if a&b:
            return True
        else:
            return False

file = "Z:\Projects\OS Electoral Map Data\Data\GB\westminster_const_region.shp"
h5store = "Z:\Projects\Election 2015\election2015.h5"

hdf = pd.HDFStore(h5store)
print("Reading HDF5 store {}".format(h5store))
results=hdf.get('results')

print("Reading shapefile {}".format(file))
reader = shapefile.Reader(file)
records = reader.shapeRecords()

# Set up the list of constituencies. Note that r.record[8] is the ONS constituency
# ID - e.g. W07000049 is the code for Aberavon

codes = results[['ID', 'Constituency']].set_index('ID')
c = [Constituency(codes.ix[r.record[8], 'Constituency'], r.record[8], r.shape.points) for r in records]
# Now work out the adjacents
adjacencies = {f: {f: False for f in codes.values.flatten()} for f in codes.values.flatten()}
for i in range(len(c)-1):
    print("({}/{}) {}".format(i+1, len(c)-1, c[i].name))
    k = 0
    for j in range(i+1, len(c)):
        if c[i].IsAdjacent(c[j]):
            k += 1
            print(" {}: {}".format(k, c[j].name))
            adjacencies[c[i].name][c[j].name]=True
            adjacencies[c[j].name][c[i].name]=True

# Now write the adjacencies dictionary to the HDF store
hdf.put('adjacencies', pd.DataFrame(adjacencies))
hdf.close()

The next thing to do was to test the shapefile for adjacency. This is a computationally intensive task which a few years ago would have required a supercomputer. There are 632 constituencies in the OS data I downloaded (Northern Ireland is a separate case). So each of the 632 constituencies needs to be compared with the other 631.

In the worst case this would involve 632² (just short of 4M) comparisons. Now we can short circuit that significantly by relying on the fact that it constituency S is adjacent to T, then T must also be adjacent to S. Therefore if we were comparing constituencies A to Z we would only have to compare AB, AC, AD .. AX, AY, AZ then BC, BD, BE and so on. So this essentially halves the number of comparisons. But this still works out to about 2M. Ordinarily this wouldn't be a problem, but each of the comparisons of two constituencies involves comparing two lists, each containing several hundred coordinate pairs, to see if they have common elements.

Ordinarily in Python one would simply use a for loop to compare each element in list s for membership of list t. But due to the fact that the comparison is done step by step via the Python interpreter this is very slow.

The answer is to use Python's set type, convering the shape date from lists to sets.This means that Python will then do all the heavy lifting internally and the comparison is much faster (by my estimate, in this case, literally thousands of times faster). There are, however, a couple of wrinkles.

Firstly the shape data for a constituency consists of a list containing lists of pairs of floats. (This for example is the first 10 points for Aberavon, the first constituency in the list.

c[0].shape[:10]
Out[42]: 
[[274462.8, 188515.9],
 [274479.5, 188521.2],
 [274496.6, 188524.9],
 [274533.4, 188527.6],
 [274592.5, 188526.8],
 [274602.8, 188532.9],
 [274621.7, 188536.4],
 [274630.2, 188536.6],
 [274638.6, 188536.2],
 [274647.0, 188534.1]]

Because Python sets can't contain mutable types, we have to convert the lists of floats to tuples.

Secondly we are essentially comparing floating-point quantites for equivalence. This obvisouly is much faster than using some sort of epsilon function, particularly as in order to do so would not only involve calculation rather than simple comparison but also mean that we could use the set-based solution. But actually, we are importing text-based values, rounded to one decimal place and the data used in one adjacent constituency would be generated by exactly the same system which generated it for another. So the comparison should be safe and, by inspection of a subset of the results, we know it is.

HDF, I think I love you

Now although we've written the code as efficiently as we might it still takes some time, about 25 minutes, to run even on a relatively fast (Intel Core i7) PC. Just loading the shapefile (which weighs in at 193MB uncompressed) takes several seconds, and so having calculated the data we need to save it for future use as we don 't want to have to go through this rigmarole every time we want to use it.

The code generates a dictionary of dictionaries. Ordinarily we would just save this to disk using the Python pickle function. However, we will be doing the downstream analysis of this using pandas and numpy, and so it makes sense to convert it to a pandas DataFrame and use the superb, extremely fast HDF5 to store the data.

The conversion from a nested dictionary in to a DataFrame is trivially simple, and results in a table with one row and one column for each of the 632 constituencies, and at the intersection of a particular row or column a boolean value which is True if the constituencies are adjacent or False if they are not.

This is literally as easy as opening an HDF store which will be created by the pd.HDFStore() call if it doesn't exist or opened for append if it does, and then getting and putting the data as we see fit. It really is that easy.

Conclusions

There's a whole load of free, very high qulity data sets out there, along with all sorts of free (Open Source) tools to use them, such as Python/pandas or, if you want to slice and dice the data without having to bother coding stuff like LibreOffice.

All you need is a computer and an internet connection (which you must already have to be reading this). So if something interests you or you actually want to bring facts in to an argument about something then get downloading and munging.

Although modern computers are extremely fast, we still need to consider the efficiency of the algorithms we choose, and even in a high-level language like Python we can make massive efficiency savings by applying a bit of thought to what would otherwise be an ugly, brute force and ignorance approach.

Saving our output in some machine readable format post-calculation makes a huge amount of sense. HDF5 offers a number of advantages over the Python pickle function, not least the fact that HDF5 format data can be read by many other languages other than Python. However, not all data is amenable/suitable to be easily written to HDF5.

One geek and his dog

Tuesday, 30 August 2016

The joy of (data) sets

Background

First catch your data

The shape of politics in the UK

HDF, I think I love you

Conclusions

No comments:

Post a Comment