One geek and his dog

Saturday 24 September 2016

Screen scraping a web-based table in to Pandas with BeautifulSoup

In an ideal world, if we want to munge some data, it is readily accessible online or downloadable as XML, CSV format files. Sometimes this is not the case, and we may wish to download some data directly from a web-page.

This might be for a number of reasons. The author of the data just hasn't bothered making it downloadable. Or the data is changing rapidly e.g. exchange rates. Or the author of the data doesn't wants to make you visit his page in order to read the data so he gets page impressions and possible click-through revenues.

We could, if we wished, point a web-browser at the appropriate page and either save the page to a file and then load that file in to an editor or, in an extreme case copy the information we want by hand. However, another way is to scrape the information automatically. Now, let's be honest, scraping is a slightly desperate and grubby way of getting hold of the data, but it's much faster than attempting to transcribe it by hand.

As with so many things, Python is your friend and Swiss-Army knife here. This post covers scraping a table of information from an arbitrary website, importing it in to Pandas and cleaning it up. The site I have used is http://ukpollingreport.co.uk which contains a page with a table of historical opinion poll figures.

First look at the page

The first thing to do is to visit the page using your chosen browser and then look at its structure. If we scroll through it, we eventually find the division we want, containing the data table. In the source this is the part that begins as follows:

<div class='polltable'>
<table border='1' cellspacing='0' cellpadding='0' rules='rows' frame='hsides'>
<tr>
<td width='220'  rowspan='2' bgcolor='#F7F7F7'><span>&nbsp;</span></td>
<td width='105' rowspan='2' bgcolor='#F7F7F7'><strong><span>Survey End Date</span></strong></td>
<td width='51' rowspan='2' bgcolor='#F7F7F7'><strong><span>CON (%)</span></strong></td>
<td width='51' rowspan='2' bgcolor='#F7F7F7'><strong><span>LAB (%)</span></strong></td>
<td width='51' rowspan='2' bgcolor='#F7F7F7'><strong><span>LD (%)</span></strong></td>
<td width='51' rowspan='2' bgcolor='#F7F7F7'><strong><span>UKIP (%)</span></strong></td>
<td width='51' rowspan='2' bgcolor='#F7F7F7'><strong><span>Grn (%)</span></strong></td>
<td width='61' rowspan='2' bgcolor='#F7F7F7'><strong><span>Con Lead</span></strong></td>
</tr>
<tr></tr>
<tr>
<td align=left>Ipsos-MORI/Evening Standard </td>
<td align=center> 2016-09-14</td>
<td align=center> 40 </td>
<td align=center>34 </td>
<td align=center> 6</td>
<td align=center>9 </td>
<td align=center>5 </td>
<td class='conlead' align=center> 6</td>
</tr>

(or at least it did when I wrote this post).

Let your Python romp

Some webpages will only respond to certain browsers. This may be for commercial or security reasons, or it may be because the page feeds slightly different content depending upon the client browser (e.g. HSBC attempts to bully you in to downloading their ghastly Rapport "security" software if you log in with a Windows-based browser). Our browser would normally send suitable header information to the web-server at the far end, from which it would determine the browser type.

So if we simply attempt to request this page, we get an error message (403) returned telling us that it is forbidden. Fortunately, though, Python allows us to override its default headers with whatever we like and so we can essentially masquerade as whatever browser takes our fancy.

url = 'http://ukpollingreport.co.uk/voting-intention-2/'
header = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
       'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
       'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
       'Accept-Encoding': 'none',
       'Accept-Language': 'en-US,en;q=0.8',
       'Connection': 'keep-alive'}

request = urllib.request.Request(url, None, header)

One minor point of etiquette, though, is that the author might have some sound commercial reasons for not wanting some cheap punk like me to scrape his data. Please consider that.

Make soup

If we were just to download the raw html just using a simple reponse.read() we would simply get the block of HTML code which is normally decoded by your browser and rendered in to a webpage. HTML is fundamentally a hierarchical language, and so we can use the sublime BeautifulSoup Python library. This parses the HTML and gives us all manner of spiffy ways of searching and navigating the resulting structure.

So having made soup of the HTML. we now simply have to find the polltable division and then process it.

From inspection we see that the first row is the column headings, and that there is a blank column on either side which can be discarded. There is then a blank row, after which the data starts.

with urllib.request.urlopen(request) as response:
    page = response.read()
    soup = BeautifulSoup(page, "lxml")
    # Inspection of the page source tells us that the polling table is in a
    # <div 'class'='polltable'>, so search on that. Note that we can
    # discard the leftmost and two rightmost columns
    rows = []
    table = soup.find("div", attrs={'class': 'polltable'})
    for i, row in enumerate(table.find_all('tr')):
        if i == 0:
            colnames = row.text.split("\n")[1:-2]
            colnames[0] = "Source"
        elif i > 1:
            rows.append(row.text.split("\n")[1:-2])

Unleash the pandas

Having now got a list of column headings and a table (a list of lists) of data, it's simply a question of creating the appropriate pandas DataFrame, cleaning things up a bit and then we can play. So we convert the column names (extracted from the first row of the scraped tables) in to a list, and correct the first (missing) one.

Now as things stand, all the various cells in the table we have created as just str types - that is "17" rather than 17. So we now convert them to datetime or int as appropriate.

df = pd.DataFrame(rows)
df.columns = colnames
df['Survey End Date']=pd.to_datetime(df['Survey End Date'])
for c in df.columns[2:]:
    df[c] = pd.to_numeric(df[c], errors='coerce')

Next we clean up the column names, removing the trailing " (%)" from them, as this doesn't add anything. In this instance I am only interested in results from May 2015, so we filter the dataframe.

earliest = datetime.date(2010, 5, 1)
df = df[df['Survey End Date'] > earliest]

Saturday 3 September 2016

Pity the plight of the poor peripatetic pythonista

The senior management at Casa Snowy has taken the office over this weekend researching his trip to Hong Kong next weekend. So I have been relegated to using the dining table and my laptop rather than the proper keyboard and 24" monitor I normally use.

I've recently started to use Pycharm as my IDE for Python, as its debugging facilities suit my tastes more than Spyder. (They're both very good, just different).

Now as ever, all the projects I am working on at the moment are stored under subversion on the server. So working on them should be as simple as pointing Pycharm at the appropriate network share and away we go. Now I really was not intending this post to be a rant. But...

One of Python's great strengths is its open-source background. So the fact that no individual or corporation owns Python is one of the secrets of its success. This is for a couple of reasons. Firstly Python is available on pretty much any platform you are even likely to use, from a Raspberry Pi to the largest supercomptuer. Secondly, in addition to the standard libraries which come with "standard" Python as downloaded from python.org, there is a whole ecosystem of freely available libraries for more specialised stuff. For example the sublime numpy/scipy for number crunching, ReportLab for producing great looking PDFs, astropy for performing all manner of astronomical calculations and OpenCV for computer vision applications.

These are written and maintained by specialsts in their particular fields, and many thousands of pythonistas worldwide are profoundly grateful for their expertise and knowledge. However, herein lies a problem. Some packages, for example numpy/scipy are fantastically well maintained. Others less so. So, for example, there are still some packages which haven't quite made it over the line to Python 3, e.g. (at the time of writing) OpenCV. Now, there are instructions out there as to how to build such things manually from source. But I'd hoped we had now got to the stage where such things are required. I'm lazy. I just want to be able to do a

$ sudo apt-get install blah

or a

$sudo pip install blah

But these problems are a mere bagatelle compared with the egregious mucking-about I have had this afternoon as I tried to get Pycharm working cleanly. First it wouldn't recognise my SVN version control system. I spent much of last weekend attempting to get geopandas working on a couple of Windows systems, with limited success due to the lack of a Microsoft DLL, complete with a "help" page on the Microsoft website which managed to be both smug and utterly useless.

Today has been spent uninstalling, reinstalling and rebooting as I have attempted to overcome various strange, subtle interactions between 32-bit and 64-bit applications. I know that eventually this will all just work, but this really does seem to be a problem peculiar to the Microsoft universe: on Linux, MacOS and everything else stuff just works. Why on earth does it have to be so difficult in the Microsoft universe?

Anyway, after several hours of messing about, downgrading everything to 32-bit, finally it works. But what a palaver.

Tuesday 30 August 2016

The joy of (data) sets

Background

I've recently been analysing the data from the 2015 UK General Election, which can be downloaded from the Electoral Commission's website.

One problem which interests me at the moment are the proposals by the Boundaries Commission to reduce the number of MPs from 650 to 600 and redraw constituency boundaries to redress an historic imbalance in size which has grown up over several years.

First catch your data

Now although the data from the Electoral Commission divides constituencies in to 12 regions, there is nothing in the data which tells us which constituencies are adjacent (that is they share any point on their border). In order to do this, it's necessary to get hold of the data on the shape and position of the 650 constituencies in the UK. This is available (along with a lot of other wonderful data) from the Ordnance Survey.

The data comes in the form of a shapefile: a standard format used throughout the world for encoding cartographic information. Fundamentally, a shapefile consists of a series of records which have some sort of key and other information and then shape data describing the geography of the record.

So the first thing to do was to check that the data we had downloaded was sensible, and the easiest way to do that is to use it to draw a map. In order to do this, I first converted the shapefile data to JSON format, a standrd text-based format for information interchange. I did this by adapting a couple of scripts I found online to cope with some of the idiosyncracies of the data.

"""
Convert shapefile to JSON

"""
import glob
import os
import shapefile
from json import dumps


files = glob.glob("Z:\Projects\OS Electoral Map Data\Data\GB\*.shp")

for i, f in enumerate(files):
    print("{}: Found <{}>".format(i+1, f))
    newname=os.path.split(f)[1].replace(".shp", ".json")
    if os.path.exists(newname):
        print("{} already exists.".format(newname))
    else:
        print("Creating {}".format(newname))
        
        reader = shapefile.Reader(f)
        fields = reader.fields[1:]
        field_names = [field[0] for field in fields]
    
        data = []
        for sr in reader.shapeRecords():
            raw = dict(zip(field_names, sr.record))
            for k in raw:
                if type(raw[k]) is bytes:
                    raw[k]=raw[k].decode('utf-8')
            geom = sr.shape.__geo_interface__
            data.append(dict(type="Feature", geometry=geom, properties=raw))
    
        with open(newname, "w") as geojson:
           geojson.write(dumps({"type": "FeatureCollection", "features": data}, indent=2) + "\n")
           geojson.close()

I then used it to plot a choropleth map of the data, showing the "safeness" of a constituency. (This is an ongoing activity and there will be a further post about it.)

import pandas as pd
import geopandas as gp
import matplotlib.pyplot as plt


geofile = 'westminster_const_region.json'
datafile = r'Z:\Projects\Election 2015\results.h5'
results = pd.read_hdf(datafile, 'apathy')
geodata = gp.read_file(geofile)
chorodata = pd.merge(geodata, results, left_on='CODE', right_on='ID')
ax=chorodata.plot(column='Safeness', cmap='Blues', figsize=(12,8))
fig=ax.get_figure()
plt.xticks(rotation='vertical')
fig.savefig('uk.png', dpi=1200)
plt.show()

Which produced the following output.

Which looks about right.

The shape of politics in the UK

# -*- coding: utf-8 -*-
"""
Program to find adjacent electoral constituencies based on the OS
shapefile data, more as an exercise in learning about shapefiles
than anything in particular.

"""

import shapefile
import itertools
import pandas as pd

class Constituency:
    
    def __init__(self, name, onsid, geo):
        self.name = name
        self.onsid = onsid
        self.shape = geo
        
    def __repr__(self):
        return("{} ({}): {} points.".format(self.name, self.onsid, len(self.shape)))

    def IsAdjacent(self, other):
        """
        Tests two consituencies to see if any of the points
        in their shape coincide. As the shape structures are 
        lists of lists of floats, convert them to lists of tuples
        then to sets. This makes it literally a couple of orders of
        magnitude faster than schlepping through the lists manually.
        
        Returns True if any points coincide, otherwise False
        """
        a = set([tuple(r) for r in self.shape])
        b = set([tuple(r) for r in other.shape])
        if a&b:
            return True
        else:
            return False

file = "Z:\Projects\OS Electoral Map Data\Data\GB\westminster_const_region.shp"
h5store = "Z:\Projects\Election 2015\election2015.h5"

hdf = pd.HDFStore(h5store)
print("Reading HDF5 store {}".format(h5store))
results=hdf.get('results')

print("Reading shapefile {}".format(file))
reader = shapefile.Reader(file)
records = reader.shapeRecords()

# Set up the list of constituencies. Note that r.record[8] is the ONS constituency
# ID - e.g. W07000049 is the code for Aberavon

codes = results[['ID', 'Constituency']].set_index('ID')
c = [Constituency(codes.ix[r.record[8], 'Constituency'], r.record[8], r.shape.points) for r in records]
# Now work out the adjacents
adjacencies = {f: {f: False for f in codes.values.flatten()} for f in codes.values.flatten()}
for i in range(len(c)-1):
    print("({}/{}) {}".format(i+1, len(c)-1, c[i].name))
    k = 0
    for j in range(i+1, len(c)):
        if c[i].IsAdjacent(c[j]):
            k += 1
            print(" {}: {}".format(k, c[j].name))
            adjacencies[c[i].name][c[j].name]=True
            adjacencies[c[j].name][c[i].name]=True

# Now write the adjacencies dictionary to the HDF store
hdf.put('adjacencies', pd.DataFrame(adjacencies))
hdf.close()

The next thing to do was to test the shapefile for adjacency. This is a computationally intensive task which a few years ago would have required a supercomputer. There are 632 constituencies in the OS data I downloaded (Northern Ireland is a separate case). So each of the 632 constituencies needs to be compared with the other 631.

In the worst case this would involve 632² (just short of 4M) comparisons. Now we can short circuit that significantly by relying on the fact that it constituency S is adjacent to T, then T must also be adjacent to S. Therefore if we were comparing constituencies A to Z we would only have to compare AB, AC, AD .. AX, AY, AZ then BC, BD, BE and so on. So this essentially halves the number of comparisons. But this still works out to about 2M. Ordinarily this wouldn't be a problem, but each of the comparisons of two constituencies involves comparing two lists, each containing several hundred coordinate pairs, to see if they have common elements.

Ordinarily in Python one would simply use a for loop to compare each element in list s for membership of list t. But due to the fact that the comparison is done step by step via the Python interpreter this is very slow.

The answer is to use Python's set type, convering the shape date from lists to sets.This means that Python will then do all the heavy lifting internally and the comparison is much faster (by my estimate, in this case, literally thousands of times faster). There are, however, a couple of wrinkles.

Firstly the shape data for a constituency consists of a list containing lists of pairs of floats. (This for example is the first 10 points for Aberavon, the first constituency in the list.

c[0].shape[:10]
Out[42]: 
[[274462.8, 188515.9],
 [274479.5, 188521.2],
 [274496.6, 188524.9],
 [274533.4, 188527.6],
 [274592.5, 188526.8],
 [274602.8, 188532.9],
 [274621.7, 188536.4],
 [274630.2, 188536.6],
 [274638.6, 188536.2],
 [274647.0, 188534.1]]

Because Python sets can't contain mutable types, we have to convert the lists of floats to tuples.

Secondly we are essentially comparing floating-point quantites for equivalence. This obvisouly is much faster than using some sort of epsilon function, particularly as in order to do so would not only involve calculation rather than simple comparison but also mean that we could use the set-based solution. But actually, we are importing text-based values, rounded to one decimal place and the data used in one adjacent constituency would be generated by exactly the same system which generated it for another. So the comparison should be safe and, by inspection of a subset of the results, we know it is.

HDF, I think I love you

Now although we've written the code as efficiently as we might it still takes some time, about 25 minutes, to run even on a relatively fast (Intel Core i7) PC. Just loading the shapefile (which weighs in at 193MB uncompressed) takes several seconds, and so having calculated the data we need to save it for future use as we don 't want to have to go through this rigmarole every time we want to use it.

The code generates a dictionary of dictionaries. Ordinarily we would just save this to disk using the Python pickle function. However, we will be doing the downstream analysis of this using pandas and numpy, and so it makes sense to convert it to a pandas DataFrame and use the superb, extremely fast HDF5 to store the data.

The conversion from a nested dictionary in to a DataFrame is trivially simple, and results in a table with one row and one column for each of the 632 constituencies, and at the intersection of a particular row or column a boolean value which is True if the constituencies are adjacent or False if they are not.

This is literally as easy as opening an HDF store which will be created by the pd.HDFStore() call if it doesn't exist or opened for append if it does, and then getting and putting the data as we see fit. It really is that easy.

Conclusions

There's a whole load of free, very high qulity data sets out there, along with all sorts of free (Open Source) tools to use them, such as Python/pandas or, if you want to slice and dice the data without having to bother coding stuff like LibreOffice.

All you need is a computer and an internet connection (which you must already have to be reading this). So if something interests you or you actually want to bring facts in to an argument about something then get downloading and munging.

Although modern computers are extremely fast, we still need to consider the efficiency of the algorithms we choose, and even in a high-level language like Python we can make massive efficiency savings by applying a bit of thought to what would otherwise be an ugly, brute force and ignorance approach.

Saving our output in some machine readable format post-calculation makes a huge amount of sense. HDF5 offers a number of advantages over the Python pickle function, not least the fact that HDF5 format data can be read by many other languages other than Python. However, not all data is amenable/suitable to be easily written to HDF5.

Monday 22 August 2016

Producing a scatter plot with Pandas and Seaborn

I've written before about Python and its sublime extension Numpy, which vastly extends its capabilities for handling chunky amounts of data. Numpy exists as part of a wider collection of software, often referred to as "the Scipy stack". As well as Scipy itself, two other components of this collection which sit on top of Numpy are Pandas and Matplotlib.

Pandas is a library of routines for manipulating PANel DisplAyS: two dimensional grids of data, called DataFrames, in which, like in a spreadsheet, data is arranged in to rows and columns. Pandas allows the creation and manipluation of these: reshaping, filtering and querying them.

Matplotlib is a powerful set of routines for producing all manner of plots and other visualisations from the data.

The problem

As an exercise in learning Pandas, I decided to analyse the results of the 2015 UK General Election. The reason for choosing this rather dry subject was that it was:

A moderately complex data set
In the public domain
I already understood it to an extent
Any conclusions I reached could be easily verified

I wanted to see if there was any correlation between how "safe" a seat was and voter turnout, on the assumption that in a really safe seat turnout would be lower, on the basis that people happy with the status quo wouldn't bother voting because they knew they didn't have to and those unhappy with the status quo wouldn't bother voting because they felt they couldn't change anything.

First get your data

So the first thing to do was to find and download the data set. Fortunately in the UK such data is publicly available from the Electoral Commission, the government body responsible for the adminstration of elections in the UK which also publishes scrupulously accurate data.

This was available as an Excel spreadsheet. This is somewhat irritating as Excel is, of course, a proprietary standard and the UK government committed to using open standards (e.g. ODS) some years ago. But never mind.

Now clean it up

The Electoral Commission data is excellent, but the Excel spreadsheet needs a little tidying up before importation. So I exported each of its different worksheets as CSV format files. (Pandas can read Excel, but this seemed cleaner). Once the CSVs were exported and things like columns headings in the first row of each were correct and any totals at the bottom of the files were removed, then the importation was simple enough.

(Note here that when importing the files it was also necessary to specify the encoding scheme used on the file, in this case ISO-8859-1). Soon these were all importing cleanly. There was, however, one important piece of information missing. The files refer to constituency and region IDs. These are administered by the Office of National Statistics, another government body which publishes a huge amount of excellent data for public consumption. So I downloaded this, cleaned it up as before, and it is now available for import as a CSV.

The code

import numpy as np
import pandas as pd
import seaborn as sns
from  matplotlib import pyplot
from matplotlib.ticker import FuncFormatter

Candidates = pd.read_csv('Candidates.csv', encoding = "ISO-8859-1")
Constituencies = pd.read_csv('Constituencies.csv', encoding = "ISO-8859-1")
Results = pd.read_csv('Results.csv', encoding = "ISO-8859-1")
Referendum = pd.read_csv('Referendum.csv', encoding = "ISO-8859-1")
Regions = pd.read_csv('Regions.csv', encoding = "ISO-8859-1")

# Identify those parties which scored more than `threshold` votes in any 
# constituencies to weed out the frivolous ones

# The CSV file has commas in the Total and Electorate fields. Convert from str 
# to int

# Tidy things up a bit. Give some of the columns better names. Merge the
# Labour and Labour & Co-operative figures. Count the speaker as a Conservative
# rather than being a party on his own. 

Results = Results.rename(columns={'Press Association ID Number': 'PAID', 
                  'Constituency Name': 'Constituency', 
                  'Constituency ID': 'ID', 
                  'Constituency Type': 'Type', 
                  'Election Year': 'Year',
                  ' Total number of valid votes counted ': 'Total',
                  'C': 'Con'})
                  
Results['Lab'] = Results.fillna(0)['Lab'] + Results.fillna(0)['Lab Co-op']
Results['Con'] = Results.fillna(0)['Con'] + Results.fillna(0)['Speaker']
Results = Results.drop(['Year', 'Lab Co-op', 'Speaker'], axis=1)
Results.Total = Results.Total.str.replace(',', '').astype(int)
Results.Electorate = Results.Electorate.str.replace(',', '').astype(int)
RawResults = Results.ix[:, 'Unnamed: 9': 'Zeb'].fillna(0)

q=pd.DataFrame(np.sort(RawResults.values)[:,-2:], columns=['2nd','1st'])
Results['Party'] = RawResults.idxmax(axis=1)
Results['Votes'] = q['1st'].astype(int)
Results['Majority'] = (q['1st']-q['2nd']).astype(int)
Results['Share'] = Results.Votes/Results.Total
Results['Turnout'] = Results.Total/Results.Electorate
Results['Safeness'] = Results.Majority/Results.Electorate

Results = Results.drop(RawResults, axis=1)
Results = Results.set_index('PAID')

Seats = pd.DataFrame(Results.groupby('Party').ID.nunique())
Seats = Seats.rename(columns={'ID':'Seats'})

# So to keep things uncluttered, only consider parties with > 10 seats 

MinSeats = 10
Parties = Seats[Seats.Seats > MinSeats].index

p=Results[Results.Party.isin(Parties)]
Apathy = p[['Turnout', 'Safeness', 'Party']]

sns.set_palette(['red', 'blue', 'yellow', 
                 'orange', 'purple', 'cyan', 
                 'magenta', 'green', 'brown'])

fg = sns.FacetGrid(Apathy, hue='Party')
fg.map(pyplot.scatter, "Turnout", "Safeness", s=2).add_legend()

fg.ax.yaxis.set_major_formatter(FuncFormatter(lambda y, _: '{:.0%}'.format(y)))
fg.ax.xaxis.set_major_formatter(FuncFormatter(lambda y, _: '{:.0%}'.format(y)))
fg.ax.set_title("2015 UK General Election\n")
fg.ax.xaxis.set_ticks(np.arange(0.4, 1.0, 0.1))
fg.ax.yaxis.set_ticks(np.arange(0.0, 0.6, 0.1))
fg.ax.set_xlabel('Turnout')
fg.ax.set_ylabel('Safeness')
fg.savefig('Apathy.png', figsize=(16,12), dpi=1200)

So the first thing I did is import various libraries - e.g. pandas, matplotlib - in to Python so I could make use of them later. Having imported pandas, I then imported the various CSV files in to their own DataFrames.

Next I tidied the frames up a little. So I altered some of the column names, aggregated the "Labour" and "Labour and Co-operative" votes as just Labour votes. (The Electoral Commission distinguishes between the two, but they sit as a single bloc in parliament). I also allocated the speaker (who according to the Electoral Commission is his own party) to the Conservative Party.

Because in the CSV file the numbers are formatted with a comma as a thousands separator, Pandas imported the Total and Electorate fields as strings, so I then had to remove the commas and convert the number from string (str) format to integer (int).

Now Pandas starts to come in to its own.

Munging the data

The raw election results are contained in 137 columns of the Results DataFrame, one column for each political party which fielded any candidates in the election. The first of these was "Unnamed: 9" and the last "Zeb". So to make handling them easier, I sliced them as RawResults.

Next, for each of the 650 electoral constituencies (one per row) I found the largest and second largest votes (i.e. the winner and runner-up) from which I could calculate the turnout and the size of the majority. This then allowed me to calculate, amongst other things the "safeness" of the seat. This is a measure I have made up which is simply the majority divided by the total electorate which in theory gives a number between 0.0 for an extremely marginal seat to 1.0 for a seat where everybody voted and they all voted for one candidate - e.g. Pyongyang North.

The next step was to find the name (column index) of the winning party, using the idxmax() call. (Note that axis=1 means to do this row by row, rather than column by column which would be axis=0).

Then to simplify matters, having done with them, I dropped the various RawResults from the Results DataFrame and then re-indexed the DataFrame using the Press Association's IDfor each constituency.This isn't really necessary, but I prefer things tidy.

Next, I worked out the number of seats won by each party by grouping the results by party and then counting the number of unique IDs. I then filtered the results, excluding all the parties who won 10 seats or fewer, and then produced a DataFrame, Apathy, which had a row for each constituency and columns giving the turnout, "safeness" of the seat and the winning party.

Producing the scatter plot

Matplotlib has very powerful plotting capabilities, but there is a separate set of libraries, called Seaborn, which work with Pandas (and Scipy) using Matplotlib and the underlying Numpy. So the "stack" looks something like this.

Because Seaborn builds on the underlying Matplotlib, we can also use many of Matplotlib's features to, for example, alter the formatting of the axes to percentages and set the range of ticks on the two axes.

Seaborn would, by default, provide a suitably tasteful palette of colours, but I wanted to make sure that the colour used reflected those usually used by the parties - red for Labour etc., and so I set my own palette. (It contains some additional colours in case I wanted to increase the number of parties represented on the scatter plot).

This is what it produced.

which I think is quite beautiful.

Conclusions

The analysis doesn't come to any startling political conclusions: turnout is lower in Labour seats than Conservative ones. Both Labour and Conservative have a roughly equivalent distribution of safe and marginal seats. Turnout was typically higher in SNP seats and there are few safe ones, presumably both of these being because of the very hard-fought nature of the 2015 election in Scotland and the fact that the incumbent Labour MPs were displaced by an SNP insurgence. So no surprises there, then.

But, from a technical perspective it's quite remarkable. In a little over a screen of code we have performed a hgihly complex analysis and produced a publication-quality visualisation. The really remarkable thing, though, is that there are no nested loops or similar which we would normally require to iterate over the rows and columns extracting counts, totals, maxima etc. In any other language - including "ordinary" Python - the code would have been significantly more complex, longer and much more difficult to debug. In something like C the job would literally have taken several days worth of programming.

Finally, because Pandas uses the underlying Numpy libraries, which are written in native FORTRAN and C, to do all the heavy lifting, the code is blindingly fast.

Wednesday 6 April 2016

Numpy efficiency and lumpy sigmoids

The problem

Some time ago I wrote about the relative efficiency of the Python numpy library when comared with Octave and C, and I thought it would be interesting to see how numpy compares with Python itself.

For those unfamiliar with numpy, it is a wonderful open-source extension to standard Python which allows powerful multidimensional array handling.

One of numpy's most powerful features is "broadcasting". This allows a single operation to operate on entire arrays. This is important as the underlying code for manipulating arrays in numpy is actually written in highly optimised FORTRAN and C. As a result, although there is an overhead associated with creating the numpy array and any calls to the underlying code, the actual array manipulation is extremely efficient and fast.

Python is, of course, an interpreted (byte compiled) language, and so doesn't have the efficiency associated with compiled languages. Back in the days of yore, in order to perform an operation (say) multiplying two arrays of random numbers together, one might in the absolute worst case do something like:

   for m in range(n):  
     x.append(random.random())  
     y.append(random.random())  
   z = []    
   for a, b in zip(x, y):  
     z.append(a * b)

As these sorts of operation are quite commonplace, since Python 2.0 list comprehensions have been introduced, which allow us to do the same code as above much more elegantly and efficiently.

   x = [random.random() for k in range(n)]  
   y = [random.random() for k in range(n)]  
   z = [a*b for a, b in zip(x, y)]

List comprehensions have the additional advantage that they are much faster (about 1.8x to 1.9x faster on my system running Python 3.4).

So I decided to compare the performance of the list comprehension above with that of numpy broadcasting, where the list comprehensions above would be replaced with the wonderfully clear and understandable

   x = numpy.random.rand(n)  
   y = numpy.random.rand(n)  
   z = x*y

We would expect the relative performance of numpy versus list comprehension to be dependent on the size of the arrays involved. With very small arrays, the overhead of creating the numpy array and handling the calls to the underlying code wouldn't be worth it and numpy might even in a worse case be slightly less efficient. But as the array length, n, increases we would expect the performance to increase and plateau out to a maximum, with the performance curve being a classic sigmoid shape.

So I performed some tests determine the average time to multiply arrays of random floats together using list comprehensions and numpy broadcasting, for arrays of a number of lengths from 1 to 5,000,000. Timing is to the nearest microsecond, using the Python datetime library.

The results

The results, as expected, give a classic sigmoid shape. For larger array sizes we see a performance ratio of about 12:1.

I'd originally intended to write this post purely about the relative efficiency of numpy and list comprehensions. But I got quite a surprise when I looked at the graphs of the performance tests.

So as you can see, we get a broadly sigmoid curve, with there being little or no advantage for arrays shorter than about 10³ elements. (Note the logarithmic scale). Now although we would expect quantisation noise, caused by the fact that we are only timing accurate to the microsecond, and really short arrays will be processed faster than that for lower values of n. But the figures were averaged over 1,000 runs which should smooth things out somewhat.

But it's very noisy. Also the curve for Windows is much noisier than that for Linux. There are two possible causes of this.

Garbage collection. This is when Python periodically pausea to release objects from memory which are no longer required. This can be turned off by importing the Python gc library and enabling and disabling garbage collection at the appropriate times using gc.enable() and gc.disable(). Note that both systems have an abundance - 16 GB - of memory.

This is far an away the most likely cause. But it could also be...

Multiple cores. Python programs are sometimes moved between CPU cores during operation by the Global Interpreter Lock and also the operating system itself, for whatever reason, might also decide to unload something from one core and move it to another or operate an application across multiple cores. This can obviously make attempting to time stuff accurate to the microsecond difficult. It's possible to control this using the task manager in Windows and by using the taskset -c command in Linux.

So what effect does this have. First let's try disabling garbage collection.

Clearly there's a dramatic improvement in the Windows curve but the Linux one is fundamentally much the same. So what effect would binding the Python interpreter to a single CPU core have? To be honest, very little.

Curiously, on all the runs, there is an inflexion in the Linux curve at about n = 10⁵ which can't be explained by garbage collection etc. This consistently occurs in the same place every time the software is run, so something must be causing it but I have no idea what. (I've ruled out obvious things like paging/swapping, due to the huge amount of physical memory).

Conclusions

We do, as expected get a sigmoid curve.
Numpy begins to get more efficient in this simple case at n > 10³, but if we were performing significantly more complex calculations than generating and multiplying random numbers numpy would begin to win for lower n. (Perhaps this is something to be tested another time).
For large n (> 10⁵), numpy outperforms a Python list comprehension typically by a factor of 12.
Quantisation noise occurs at lower values of n but can be smoothed out by averaging over several runs. Noise is significantly worse, as is overall noise/discontinuity in the results under Windows. This is unlikely to be related to external system load, as both systems are relatively lightly loaded, and is more likely to be due to Linux coping better under load conditions. (It is noticeable, for example, that even at 100% load, most Linux systems will have a decent interactive response. In fact the Linux system on which these test were performed is significantly more heavily loaded than the Windows system, has a slower CPU and actually completed the tests approximately 30% faster than the Windows one.)
Clearly something must be causing the inflexion in the Linux curve at n ~ 10⁶. But I don't know what.

Tuesday 23 February 2016

Making a timelapse video with a webcam and Linux

Out of my window I can see a building site on the site of the old BBC studios on Oxford Road, Manchester. As there's going to be a few years building work, I thought it would be interesting to produce some sort of time-lapse film of the works from my office window.

This post explains how I did it.

I have an existing Ubuntu Linux system which runs 24x7 doing a variety of different things, and so I decided as this was likely to have a far better uptime than my Windows system to use this to do the hard work.

First connect your webcam

So the first thing I did was to install a (cheap) USB web-cam bought from my local PC dealer for about a tenner. Make sure that the webcam you get is either marked as Linux compatible or compliant with the UVC standard. Mine just worked straight out of the box.

The most difficult thing was physically fixing the webcam in place. Most of them come with a fairly flimsy plastic clip designed to secure it to the top of a monitor, but that was unsuitable for my purposes. As I want to leave it in fundamentally the same position for several months, I also needed to be confident that it was moved, say for cleaning, then I could reposition it reasonably accurately.

I ended up securing the webcam to the top of a old tin of Chinese tea with elastic bands and wedging the entire thing against the window fittings, so it shouldn't move overly much. Not particularly elegant. But it works beautifully.

Now select your software

I want the webcam to take a photo every minute or so and tried various bits of client software to do this. The one I selected as having the best combination of features for my requirements. Eventually I settled on the fswebcam client, which can be installed using the command

$ sudo apt-get install fswebcam

After a bit of experimentation I found the command that gave me the best results was

$ /usr/bin/fswebcam --background --device /dev/video0 --loop 60 --resolution 1280x720 --skip 10 --frames 3 --delay 5 --bottom-banner /home/tim/archive/tim/pix/\%Y-\%m-\%d_\%H:\%M:\%S.jpeg

The options do the following things:

--background flag tells the device to run in the background
--device specifies which device to use for capture, in this case the (rather obvious) /dev/video0
--loop makes the camera fire every 60 seconds
--resolution sets the image resolution. 1280 x 720 results in JPEG image of anywhere between about 40 and 140 KB depending on how much the image can be compressed
--skip 10 --frames 3 --delay 5 makes the camera makes the camera delay for 5 seconds, then skip the first 10 frames, then take an average image over the next 3 frames. I did this because I found that the (cheap) camera I had used had difficulty coping with changing light levels, and so, for examples, photos taken in the morning as the sun rose were completely overexposed. I recommend playing around with these settings until you get something which suits your requirements.
--bottom-banner produces a timestamped banner at the bottom of the image.

This is the sort of image it produces.

So I now had something which was taking an image every minute and placing it in the /home/tim/archive/pix directory with a filename something like 2016-02-23_10:12:00.jpeg (this is important as we want to be able to stitch many images together sequentially).

Note, however, that a 140 KB file being generated every minute does tend to chew your disk up: it will generate about 200 MB of output a day or, if you prefer, about 70 GB a year. My /home/tim/archive directory is actually a mount-point for a 1 TB disk, and so there's plenty of room, but if you have limited space then either reduce the image resolution or the frequency with which images are captured. I also periodically manually thin out the images (using rm), discarding, for example, any which are taken between 18:00 and 08:00 the following day as there is no building work going on during these periods. This reduces the daily output from 200 MB to about 80 MB.

Note also that if the frequency with which you want to take photos is measured in minutes, then rather than using the --loop option you could just as easily do it with a cron job, by creating a crontab using the

$ crontab -e

command. This would have the advantage that you could have much finer-grained control over the times at which pictures were taken - e.g. Monday to Friday, 08:00 to 18:00.

Although my system typically has uptimes of several months, it is rebooted occasionally. So I wanted to make sure that the image capture started automatically when it did so. I therefore put the fswebcam command above in the /etc/rc.local configuration file.

One bit of strangeness is that fswebcam doesn't work if you aren't logged in, even if it run as a cron job or with something like nohup. This doesn't matter on my system as I am automatically logged in at boot, but be aware of it and, if you find a solution please let me know!

Now stitch your frames together

I think that, really, most people won't be interested in a time-lapse video longer than about one or two minutes duration. So assuming that we're showing maybe 24 frames per second (FPS), then that's about 1,440 frames per minute. Now, taking a picture every minute for eight hours a day, means that per day my setup produces about 480 frames per day, or, if you prefer 20 seconds of video if I were to show all the frames.

Assuming that the timelapse is to be over a longer period, then it's necessary either to take fewer frames in the first place, by controlling the frequency with which fswebcam acquires images as discussed above or, alternatively, only selecting certain images. I chose this latter option, as I have plenty of storage space I didn't know if, at some stage, I might want to do a more granular video.

Now, my preferred way of stitching these frames together is ffmpeg. This is no longer included in the standard Ubuntu distribution and, instead, there is a fork of it called avconv

This has a really cool feature which allows sequentially numbered frames to be stitched together in to a video. Now as I only want to select some frames, I wrote a very simple Python script which will select the frames I want between certain hours of the day, days of the week and also introduce a "gap", so only selecting one frame in (say) 20 in order to keep things manageable.

The script creates sequentially numbered symbolic links to the image files in a separate directory for them to be subsequently stitched together using avconv. As I want to keep this blog post short, please contact me if you want a copy of the Python script.

Once the appropriate frames have been selected and sequentially numbered and placed in a subdirectory called enumerated, the separate images are then combined using avconv as follows:

$ /usr/bin/avconv -y -i enumerated/%06d.jpeg -r 25 -c:v libx264 -crf 20 -pix_fmt yuv420p output.mp4

I then use the wonderfully easy to use OpenShot 1.4.3 video editing suite. I've actually found that although the interface isn't nearly as slick-looking as OpenShot 2.0, the older version is much easier to use and so have reverted to using that. The advantage of using something like OpenShot is that it's also possible to add things like credits, background music and other effects as well as being able to save the resultant film in a number of formats or even upload it directly to YouTube.

Sunday 15 November 2015

Configuring a WIndows 8 / safe-boot laptop for dual-boot Linux use

I have a new assignment coming up which means I will be spending a lot of time on customer sites, and so I decided that it was time I got a new laptop. I saw a great deal on an HP Pavilion ???? and so I bought it. Although I'm an open-source advocate, self-employment makes me realistic enough to realise that there are certain applications which customers insist I use (e.g. Microsoft Word) or for which there is no credible open-source equivalent (e.g. SmartDraw).

Typically in the past I've configured my laptops to be dual-boot Windows and (usually Ubuntu) Linux, allowing me to do customer work using the Windows incarnation and Linux for everything else. Sadly the new laptop runs Windows 8.1 and also has the UEFI/safe-boot BIOS on it.

This post explains how I configured my new laptop to be dual boot. These steps worked for me and are provided in good faith. But they come with absolutely NO warranty whatsoever: if you end up turning your laptop in to an expensive door-stop then that's not my worry.

So the first thing to do was to generate a recovery disk which - if things went horribly wrong - would at least allow me to restore the laptop back to its factory state. It is really important to do this. Seriously. The HP laptop ships with a recovery partition predefined on the disk, and generating the recovery disk on the HP is trivially easy and essentially involves transferring the image from the disk to a 32 GB USB drive which is now safely in a drawer.
The recovery partition predefined on the HP Pavilion is removed as part of the process of producing the recovery disk. This left space at the top-end of the disk. I then used the disk-management utility in Windows 8.1 (right-click the Windows icon on the bottom left of the screen) to shrink the main (C:) partition to allow sufficient space for Linux to reside comfortably.
Again using the Windows disk manager I split the available space in to two partitions: one for swap-space (16 GB) and the remainder for the Linux file systems. When you do this I recommend you don't format the new partitions or assign them drive-letters.