Saturday 24 September 2016

Screen scraping a web-based table in to Pandas with BeautifulSoup

In an ideal world, if we want to munge some data, it is readily accessible online or downloadable as XML, CSV format files. Sometimes this is not the case, and we may wish to download some data directly from a web-page.

This might be for a number of reasons. The author of the data just hasn't bothered making it downloadable. Or the data is changing rapidly e.g. exchange rates. Or the author of the data doesn't wants to make you visit his page in order to read the data so he gets page impressions and possible click-through revenues.

We could, if we wished, point a web-browser at the appropriate page and either save the page to a file and then load that file in to an editor or, in an extreme case copy the information we want by hand. However, another way is to scrape the information automatically. Now, let's be honest, scraping is a slightly desperate and grubby way of getting hold of the data, but it's much faster than attempting to transcribe it by hand.

As with so many things, Python is your friend and Swiss-Army knife here. This post covers scraping a table of information from an arbitrary website, importing it in to Pandas and cleaning it up. The site I have used is http://ukpollingreport.co.uk which contains a page with a table of historical opinion poll figures.

First look at the page

The first thing to do is to visit the page using your chosen browser and then look at its structure. If we scroll through it, we eventually find the division we want, containing the data table. In the source this is the part that begins as follows:


<div class='polltable'>
<table border='1' cellspacing='0' cellpadding='0' rules='rows' frame='hsides'>
<tr>
<td width='220'  rowspan='2' bgcolor='#F7F7F7'><span>&nbsp;</span></td>
<td width='105' rowspan='2' bgcolor='#F7F7F7'><strong><span>Survey End Date</span></strong></td>
<td width='51' rowspan='2' bgcolor='#F7F7F7'><strong><span>CON (%)</span></strong></td>
<td width='51' rowspan='2' bgcolor='#F7F7F7'><strong><span>LAB (%)</span></strong></td>
<td width='51' rowspan='2' bgcolor='#F7F7F7'><strong><span>LD (%)</span></strong></td>
<td width='51' rowspan='2' bgcolor='#F7F7F7'><strong><span>UKIP (%)</span></strong></td>
<td width='51' rowspan='2' bgcolor='#F7F7F7'><strong><span>Grn (%)</span></strong></td>
<td width='61' rowspan='2' bgcolor='#F7F7F7'><strong><span>Con Lead</span></strong></td>
</tr>
<tr></tr>
<tr>
<td align=left>Ipsos-MORI/Evening Standard </td>
<td align=center> 2016-09-14</td>
<td align=center> 40 </td>
<td align=center>34 </td>
<td align=center> 6</td>
<td align=center>9 </td>
<td align=center>5 </td>
<td class='conlead' align=center> 6</td>
</tr>

(or at least it did when I wrote this post).

Let your Python romp

Some webpages will only respond to certain browsers. This may be for commercial or security reasons, or it may be because the page feeds slightly different content depending upon the client browser (e.g. HSBC attempts to bully you in to downloading their ghastly Rapport "security" software if you log in with a Windows-based browser). Our browser would normally send suitable header information to the web-server at the far end, from which it would determine the browser type.

So if we simply attempt to request this page, we get an error message (403) returned telling us that it is forbidden. Fortunately, though, Python allows us to override its default headers with whatever we like and so we can essentially masquerade as whatever browser takes our fancy.

url = 'http://ukpollingreport.co.uk/voting-intention-2/'
header = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
       'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
       'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
       'Accept-Encoding': 'none',
       'Accept-Language': 'en-US,en;q=0.8',
       'Connection': 'keep-alive'}

request = urllib.request.Request(url, None, header)


One minor point of etiquette, though, is that the author might have some sound commercial reasons for not wanting some cheap punk like me to scrape his data. Please consider that.



Make soup


If we were just to download the raw html just using a simple reponse.read() we would simply get the block of HTML code which is normally decoded by your browser and rendered in to a webpage. HTML is fundamentally a hierarchical language, and so we can use the sublime BeautifulSoup Python library. This parses the HTML and gives us all manner of spiffy ways of searching and navigating the resulting structure.

So having made soup of the HTML. we now simply have to find the polltable division and then process it.

From inspection we see that the first row is the column headings, and that there is a blank column on either side which can be discarded. There is then a blank row, after which the data starts.

with urllib.request.urlopen(request) as response:
    page = response.read()
    soup = BeautifulSoup(page, "lxml")
    # Inspection of the page source tells us that the polling table is in a
    # <div 'class'='polltable'>, so search on that. Note that we can
    # discard the leftmost and two rightmost columns
    rows = []
    table = soup.find("div", attrs={'class': 'polltable'})
    for i, row in enumerate(table.find_all('tr')):
        if i == 0:
            colnames = row.text.split("\n")[1:-2]
            colnames[0] = "Source"
        elif i > 1:
            rows.append(row.text.split("\n")[1:-2])


Unleash the pandas

Having now got a list of column headings and a table (a list of lists) of data, it's simply a question of creating the appropriate pandas DataFrame, cleaning things up a bit and then we can play. So we convert the column names (extracted from the first row of the scraped tables) in to a list, and correct the first (missing) one.

Now as things stand, all the various cells in the table we have created as just str types - that is "17" rather than 17. So we now convert them to datetime or int as appropriate.

df = pd.DataFrame(rows)
df.columns = colnames
df['Survey End Date']=pd.to_datetime(df['Survey End Date'])
for c in df.columns[2:]:
    df[c] = pd.to_numeric(df[c], errors='coerce')

Next we clean up the column names, removing the trailing " (%)" from them, as this doesn't add anything. In this instance I am only interested in results from May 2015, so we filter the dataframe.

earliest = datetime.date(2010, 5, 1)
df = df[df['Survey End Date'] > earliest]



No comments:

Post a Comment