Python dis module and constant folding

Hi people! Recently, I was super confused when I found out that:

>>> pow(3,89)

runs slower than:

>>> 3**89

I tried to think of a suitable answer but couldn’t find any. I timed the execution of both of these statements using the timeit module in Python3:

$ python3 -m timeit 'pow(3,89)'
500000 loops, best of 5: 688 nsec per loop

$ python3 -m timeit '3**89'
500000 loops, best of 5: 519 nsec per loop

The difference is not big. It is only 0.1µs but it was still bugging me. If I can’t explain something in programming, I usually end up having sleepless nights 😅

I found the answer through the Python IRC channel on Freenode. The reason why pow is slightly slower is that in CPython there is an additional step of loading pow from the namespace. Whereas, in 3**9 there is no such loading required. This also means that the difference will remain more or less constant even if the input numbers get bigger and bigger.

The hypothesis is true:

$ python3 -m timeit 'pow(3,9999)'
5000 loops, best of 5: 58.5 usec per loop

$ python3 -m timeit '3**9999'
5000 loops, best of 5: 57.3 usec per loop

During the process of exploring the solution to this question I also got to learn about the dis module. It allows you to decompile the Python Bytecode and inspect it. This was a super exciting discovery mainly because I am learning more about reverse engineering binaries now-a-days and this module fits right in.

I disassembled the bytecode of the above statements like this in Python:

>>> import dis
>>> dis.dis('pow(3,89)')
#  1           0 LOAD_NAME                0 (pow)
#              2 LOAD_CONST               0 (3)
#              4 LOAD_CONST               1 (89)
#              6 CALL_FUNCTION            2
#              8 RETURN_VALUE

>>> dis.dis('3**64')
#  1           0 LOAD_CONST               0 (3433683820292512484657849089281)
#              2 RETURN_VALUE

>>> dis.dis('3**65')
#  1           0 LOAD_CONST               0 (3)
#              2 LOAD_CONST               1 (65)
#              4 BINARY_POWER
#              6 RETURN_VALUE

You can learn about how to understand the output of dis.dis by reading this answer on Stackoverflow

Ok back to the code. The disassembly of pow makes sense. It is loading pow from the namespace and then loading 3 and 89 to registers and finally calling the pow function. But why does the output of the next two disassemblies differ from each other? The only thing we changed is the exponent value from 64 to 65!

This question introduced me to another new concept of “constant folding“. It just means that when we have a constant expression Python evaluates the value of that expression at compile time so that when you actually run the program it doesn’t take as long to run because Python uses the already computed value. Think of it like this:

def one_plue_one():
    return 1+1

# --vs--

def one_plue_one():
    return 2

Python compiles the first function to the second one and uses that when you run the code. Neat, eh?

So why is constant folding working for 3**64 and not 3**65? Well, I don’t know. This probably has something to do with a limit to how many powers the system has pre-computed in memory. I can be totally wrong. The next step in my mind is to dabble in the Python source code in my free time and try to figure out what is happening. I am still trying to figure out an answer for this so if you have any suggestions please drop them in the comments section below.

What I want you to take away from this post is that you should seek solutions to basic questions. You never know where the answers will lead you. You might end up learning something entirely new (like I just did)! I hope you guys keep this flame of curiosity burning. See you later 🙂

PS: If you want to learn more about Python Bytecode and how to play around with it, someone (njs on #python) suggested me this talk. I personally haven’t watched it but I probably will once I get some free time.

Advertisements

2 thoughts on “Python dis module and constant folding”

  1. There’s also a nice quirk in that since Python evaluates left to right 1 + 2 + 3 + x would get fold optimised but x + 1 + 2 + 3 would not.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s