Wednesday 6 April 2016

Numpy efficiency and lumpy sigmoids

The problem

Some time ago I wrote about the relative efficiency of the Python numpy library when comared with Octave and C, and I thought it would be interesting to see how numpy compares with Python itself.

For those unfamiliar with numpy, it is a wonderful open-source extension to standard Python which allows  powerful multidimensional array handling.

One of numpy's most powerful features is "broadcasting". This allows a single operation to operate on entire arrays. This is important as the underlying code for manipulating arrays in numpy is actually written in highly optimised FORTRAN and C. As a result, although there is an overhead associated with creating the numpy array and any calls to the underlying code, the actual array manipulation is extremely efficient and fast.

Python is, of course, an interpreted (byte compiled) language, and so doesn't have the efficiency associated with compiled languages. Back in the days of yore, in order to perform an operation (say) multiplying two arrays of random numbers together, one might in the absolute worst case do something like:

   for m in range(n):  
     x.append(random.random())  
     y.append(random.random())  
   z = []    
   for a, b in zip(x, y):  
     z.append(a * b)  

As these sorts of operation are quite commonplace, since Python 2.0 list comprehensions have been introduced, which allow us to do the same code as above much more elegantly and efficiently.

   x = [random.random() for k in range(n)]  
   y = [random.random() for k in range(n)]  
   z = [a*b for a, b in zip(x, y)]  

List comprehensions have the additional advantage that they are much faster (about 1.8x to 1.9x faster on my system running Python 3.4).

So I decided to compare the performance of the list comprehension above with that of numpy broadcasting, where the list comprehensions above would be replaced with the wonderfully clear and understandable

   x = numpy.random.rand(n)  
   y = numpy.random.rand(n)  
   z = x*y  

We would expect the relative performance of numpy versus list comprehension to be dependent on the size of the arrays involved. With very small arrays, the overhead of creating the numpy array and handling the calls to the underlying code wouldn't be worth it and numpy might even in a worse case be slightly less efficient. But as the array length, n, increases we would expect the performance to increase and plateau out to a maximum, with the performance curve being a classic sigmoid shape.

So I performed some tests determine the average time to multiply arrays of random floats together using list comprehensions and numpy broadcasting, for arrays of a number of lengths from 1 to 5,000,000. Timing is to the nearest microsecond, using the Python datetime library.

The results

The results, as expected, give a classic sigmoid shape. For larger array sizes we see a performance ratio of about 12:1.

I'd originally intended to write this post purely about the relative efficiency of numpy and list comprehensions. But I got quite a surprise when I looked at the graphs of the performance tests.

So as you can see, we get a broadly sigmoid curve, with there being little or no advantage for arrays shorter than about 103 elements. (Note the logarithmic scale). Now although we would expect quantisation noise, caused by the fact that we are only timing accurate to the microsecond, and really short arrays will be processed faster than that for lower values of n. But the figures were averaged over 1,000 runs which should smooth things out somewhat.

But it's very noisy. Also the curve for Windows is much noisier than that for Linux. There are two possible causes of this.

Garbage collection. This is when Python periodically pausea to release objects from memory which are no longer required. This can be turned off by importing the Python gc library and enabling and disabling garbage collection at the appropriate times using gc.enable() and gc.disable(). Note that both systems have an abundance - 16 GB - of memory.

This is far an away the most likely cause. But it could also be...

Multiple cores. Python programs are sometimes moved between CPU cores during operation by the Global Interpreter Lock and also the operating system itself, for whatever reason, might also decide to unload something from one core and move it to another or operate an application across multiple cores. This can obviously make attempting to time stuff accurate to the microsecond difficult. It's possible to control this using the task manager in Windows and by using the taskset -c command in Linux.

So what effect does this have. First let's try disabling garbage collection.


Clearly there's a dramatic improvement in the Windows curve but the Linux one is fundamentally much the same. So what effect would binding the Python interpreter to a single CPU core have? To be honest, very little.

 

Curiously, on all the runs, there is an inflexion in the Linux curve at about n = 105 which can't be explained by garbage collection etc. This consistently occurs in the same place every time the software is run, so something must be causing it but I have no idea what. (I've ruled out obvious things like paging/swapping, due to the huge amount of physical memory).

Conclusions

  1. We do, as expected get a sigmoid curve.
  2. Numpy begins to get more efficient in this simple case at n > 103, but if we were performing significantly more complex calculations than generating and multiplying random numbers numpy would begin to win for lower n. (Perhaps this is something to be tested another time). 
  3. For large n (> 105), numpy outperforms a Python list comprehension typically by a factor of 12.
  4. Quantisation noise occurs at lower values of n but can be smoothed out by averaging over several runs. Noise is significantly worse, as is overall noise/discontinuity in the results under Windows. This is unlikely to be related to external system load, as both systems are relatively lightly loaded, and is more likely to be due to Linux coping better under load conditions. (It is noticeable, for example, that even at 100% load, most Linux systems will have a decent interactive response. In fact the Linux system on which these test were performed is significantly more heavily loaded than the Windows system, has a slower CPU and actually completed the tests approximately 30% faster than the Windows one.)
  5. Clearly something must be causing the inflexion in the Linux curve at n ~ 106. But I don't know what.