Saturday 24 September 2016

Screen scraping a web-based table in to Pandas with BeautifulSoup

In an ideal world, if we want to munge some data, it is readily accessible online or downloadable as XML, CSV format files. Sometimes this is not the case, and we may wish to download some data directly from a web-page.

This might be for a number of reasons. The author of the data just hasn't bothered making it downloadable. Or the data is changing rapidly e.g. exchange rates. Or the author of the data doesn't wants to make you visit his page in order to read the data so he gets page impressions and possible click-through revenues.

We could, if we wished, point a web-browser at the appropriate page and either save the page to a file and then load that file in to an editor or, in an extreme case copy the information we want by hand. However, another way is to scrape the information automatically. Now, let's be honest, scraping is a slightly desperate and grubby way of getting hold of the data, but it's much faster than attempting to transcribe it by hand.

As with so many things, Python is your friend and Swiss-Army knife here. This post covers scraping a table of information from an arbitrary website, importing it in to Pandas and cleaning it up. The site I have used is http://ukpollingreport.co.uk which contains a page with a table of historical opinion poll figures.

First look at the page

The first thing to do is to visit the page using your chosen browser and then look at its structure. If we scroll through it, we eventually find the division we want, containing the data table. In the source this is the part that begins as follows:


<div class='polltable'>
<table border='1' cellspacing='0' cellpadding='0' rules='rows' frame='hsides'>
<tr>
<td width='220'  rowspan='2' bgcolor='#F7F7F7'><span>&nbsp;</span></td>
<td width='105' rowspan='2' bgcolor='#F7F7F7'><strong><span>Survey End Date</span></strong></td>
<td width='51' rowspan='2' bgcolor='#F7F7F7'><strong><span>CON (%)</span></strong></td>
<td width='51' rowspan='2' bgcolor='#F7F7F7'><strong><span>LAB (%)</span></strong></td>
<td width='51' rowspan='2' bgcolor='#F7F7F7'><strong><span>LD (%)</span></strong></td>
<td width='51' rowspan='2' bgcolor='#F7F7F7'><strong><span>UKIP (%)</span></strong></td>
<td width='51' rowspan='2' bgcolor='#F7F7F7'><strong><span>Grn (%)</span></strong></td>
<td width='61' rowspan='2' bgcolor='#F7F7F7'><strong><span>Con Lead</span></strong></td>
</tr>
<tr></tr>
<tr>
<td align=left>Ipsos-MORI/Evening Standard </td>
<td align=center> 2016-09-14</td>
<td align=center> 40 </td>
<td align=center>34 </td>
<td align=center> 6</td>
<td align=center>9 </td>
<td align=center>5 </td>
<td class='conlead' align=center> 6</td>
</tr>

(or at least it did when I wrote this post).

Let your Python romp

Some webpages will only respond to certain browsers. This may be for commercial or security reasons, or it may be because the page feeds slightly different content depending upon the client browser (e.g. HSBC attempts to bully you in to downloading their ghastly Rapport "security" software if you log in with a Windows-based browser). Our browser would normally send suitable header information to the web-server at the far end, from which it would determine the browser type.

So if we simply attempt to request this page, we get an error message (403) returned telling us that it is forbidden. Fortunately, though, Python allows us to override its default headers with whatever we like and so we can essentially masquerade as whatever browser takes our fancy.

url = 'http://ukpollingreport.co.uk/voting-intention-2/'
header = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
       'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
       'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
       'Accept-Encoding': 'none',
       'Accept-Language': 'en-US,en;q=0.8',
       'Connection': 'keep-alive'}

request = urllib.request.Request(url, None, header)


One minor point of etiquette, though, is that the author might have some sound commercial reasons for not wanting some cheap punk like me to scrape his data. Please consider that.



Make soup


If we were just to download the raw html just using a simple reponse.read() we would simply get the block of HTML code which is normally decoded by your browser and rendered in to a webpage. HTML is fundamentally a hierarchical language, and so we can use the sublime BeautifulSoup Python library. This parses the HTML and gives us all manner of spiffy ways of searching and navigating the resulting structure.

So having made soup of the HTML. we now simply have to find the polltable division and then process it.

From inspection we see that the first row is the column headings, and that there is a blank column on either side which can be discarded. There is then a blank row, after which the data starts.

with urllib.request.urlopen(request) as response:
    page = response.read()
    soup = BeautifulSoup(page, "lxml")
    # Inspection of the page source tells us that the polling table is in a
    # <div 'class'='polltable'>, so search on that. Note that we can
    # discard the leftmost and two rightmost columns
    rows = []
    table = soup.find("div", attrs={'class': 'polltable'})
    for i, row in enumerate(table.find_all('tr')):
        if i == 0:
            colnames = row.text.split("\n")[1:-2]
            colnames[0] = "Source"
        elif i > 1:
            rows.append(row.text.split("\n")[1:-2])


Unleash the pandas

Having now got a list of column headings and a table (a list of lists) of data, it's simply a question of creating the appropriate pandas DataFrame, cleaning things up a bit and then we can play. So we convert the column names (extracted from the first row of the scraped tables) in to a list, and correct the first (missing) one.

Now as things stand, all the various cells in the table we have created as just str types - that is "17" rather than 17. So we now convert them to datetime or int as appropriate.

df = pd.DataFrame(rows)
df.columns = colnames
df['Survey End Date']=pd.to_datetime(df['Survey End Date'])
for c in df.columns[2:]:
    df[c] = pd.to_numeric(df[c], errors='coerce')

Next we clean up the column names, removing the trailing " (%)" from them, as this doesn't add anything. In this instance I am only interested in results from May 2015, so we filter the dataframe.

earliest = datetime.date(2010, 5, 1)
df = df[df['Survey End Date'] > earliest]



Saturday 3 September 2016

Pity the plight of the poor peripatetic pythonista

The senior management at Casa Snowy has taken the office over this weekend researching his trip to Hong Kong next weekend. So I have been relegated to using the dining table and my laptop rather than the proper keyboard and 24" monitor I normally use.

I've recently started to use Pycharm as my IDE for Python, as its debugging facilities suit my tastes more than Spyder. (They're both very good, just different).

Now as ever, all the projects I am working on at the moment are stored under subversion on the server. So working on them should be as simple as pointing Pycharm at the appropriate network share and away we go. Now I really was not intending this post to be a rant. But...

One of Python's great strengths is its open-source background. So the fact that no individual or corporation owns Python is one of the secrets of its success. This is for a couple of reasons. Firstly Python is available on pretty much any platform you are even likely to use, from a Raspberry Pi to the largest supercomptuer. Secondly, in addition to the standard libraries which come with "standard" Python as downloaded from python.org, there is a whole ecosystem of freely available libraries for more specialised stuff. For example the sublime numpy/scipy for number crunching, ReportLab for producing great looking PDFs, astropy for performing all manner of astronomical calculations and OpenCV for computer vision applications.

These are written and maintained by specialsts in their particular fields, and many thousands of pythonistas worldwide are profoundly grateful for their expertise and knowledge. However, herein lies a problem. Some packages, for example numpy/scipy are fantastically well maintained. Others less so. So, for example, there are still some packages which haven't quite made it over the line to Python 3, e.g. (at the time of writing) OpenCV. Now, there are instructions out there as to how to build such things manually from source. But I'd hoped we had now got to the stage where such things are required. I'm lazy. I just want to be able to do a

$ sudo apt-get install blah

or a

$sudo pip install blah

But these problems are a mere bagatelle compared with the egregious mucking-about I have had this afternoon as I tried to get Pycharm working cleanly. First it wouldn't recognise my SVN version control system. I spent much of last weekend attempting to get geopandas working on a couple of Windows systems, with limited success due to the lack of a Microsoft DLL, complete with a "help" page on the Microsoft website which managed to be both smug and utterly useless.

Today has been spent uninstalling, reinstalling and rebooting as I have attempted to overcome various strange, subtle interactions between 32-bit and 64-bit applications. I know that eventually this will all just work, but this really does seem to be a problem peculiar to the Microsoft universe: on Linux, MacOS and everything else stuff just works. Why on earth does it have to be so difficult in the Microsoft universe?

Anyway, after several hours of messing about, downgrading everything to 32-bit, finally it works. But what a palaver.