The Pass-Once Bug

The Background

Docstring convention is one my pet peeves. My view is that if you need to pick some arbitrary convention, might as well pick an official one. In that spirit, I’m an advocate of PEP257, the official Python docstring convention guide. Much like PEP8, I said to myself “surely there’d be a script to enforce it and report on errors”. Lo and behold – GreenSteam/pep257.

Since I found the pep257 script, I tried adding a few features I needed. Yesterday I saw that the build was failing for some time now. The CI build is configured to run for several different Python versions (2.6.x, 2.7.x, etc.), and the tests failed for some of them. I decided to try and solve them.

The Process

Get the Facts

Well, my first action was to run the tests and see the failures. I ran the py.test command and the tests passed. I looked over at the TravisCI build log and saw that the command run there was py.test --pep8. Indeed the log also included a failed PEP8 check. Easy peasy – fixing PEP8 checks is pretty straightforward and doesn’t require an intimate familiarity with the code base. I ran the tests with the PEP8 checks and fixed them. There were just a few comment formatting problems. I eagrly typed the test command again and… it failed.

The weird part about this is that it wasn’t PEP8 that failed the tests. It was an actual test that failed. I tried debugging it for some time, but I didn’t find the bug. So – what next?

My First Operation

Nurse, hand me the git bisect please!

So, since I’m only mildly familiar with the code and I failed to debug it, I took this opportunity to use a tool I’ve never tried before – git bisect. For those who aren’t familiar with it, here’s the basic workflow:

  1. Start the bisect process (git bisect start).
  2. Find and mark at least one good and one bad commit, manually (by doing git bisect good or git bisect bad).
  3. Let git run the bisect (git bisect run). It will update the working directory to a commit in the middle (in effect, it will run a binary search) and ask you whether it’s “good” or “bad”. After you answer, it will ask you about the next commit in the search, until you found the commit that ruined the codebase.

Except I never to got to the third step – I couldn’t find a single bad commit.

THIS IS IMPOSSIBLE.
I CAN SEE THE BUILD IS FAILING FOR THESE COMMITS IN TRAVIS.
WHAT GIVES?

Realization

After a long time of jumping between commits and running the tests, I had an idea. When I first ran the tests in the master branch, they passed. On the second run, the didn’t. Maybe that’s what happened in all these other commits as well?. Maybe the commits weren’t good? Maybe they just passed because I only ran them once? I jumped to another early commit and ran the tests. They passed. Arrow up and ENTER. They fail. What the hell is going on?

Well, in one of the test files there was the following line:

1
result = list(pep257.check([__file__]))

The key thing to see here is the __file__ parmeter. It’s pretty convenient to use it to refer to the file containing it, but here it caused the bug. The reason is that __file__ doesn’t always reference the same value. On the first run, it referred to the source file – i.e., <codebase>/test.py. On the second run it referred to the bytecode file, i.e., <codebase>/test.pyc. The bytecode doesn’t exist on the beginning of the first run, and that’s why __file__ can’t refer to it. However, since between two consecutive run the file wasn’t changed, Python didn’t have to rewrite the bytecode file and so __file__ could refer to it.

But why did this behvior cause the bug?

In most cases, the use of __file__ works just fine when it related to either the source or the bytecode file. However, PEP257 is a static analysis tool. This means that it accepts as a parameter a source file only. Docstrings have no meaning in bytecode form, so pep257 doesn’t return any errors when handed such a file. The solution was to simply replace __file__ with the path to the source file, and not just the executing file (which is either source or bytecode).

The Conclusion

I manage a small software team at work. One of the members in my team was assigned a debugging task, and he kept complaining about it. About two hours after he was assigned, he came to me and said “I looked at the code and I have no idea what’s wrong. I suck at debugging. Please give this task to someone else!”

Here’s my answer to these kinds of claims:

If you know what the bug is, you’re not debugging. You’re fixing the code. There are such things as “good” bugs. “Good” bugs are bugs that change your understanding of something; they’re bugs you call other people to see – you tell them the problem, then the solution and wait for that realization to hit them, just as it hit you; they’re bugs that make you proud of solving and sometimes they’re bugs that cause you to face-palm. They all start with looking at the code and shouting “THIS DOESN’T MAKE SENSE!”.

This is what we do.

Document for the Impatient

TL;DR?

A few days ago I got one of the best assignments I could think of at work – to check the web out for interesting Django packages and find out if any of them could be of use to us. It’s… pretty awesome.

So, in the last few days I’ve been reading a lot of package documentation. My goal was to get a list of popular and recommended Django packages, briefly review each one to get a feel of what it does and decide which ones are worth taking a deeper look into. The “briefly” part is where my expectations met a surprising reality.

What I found is that most (if not all) Django package developers are pretty responsible and provide a pretty comprehensive documentation. Thumbs up for that!

Unfortunately, while they are mostly comprehensive (which is great if you’re using the package and are looking for help with something) they are not very clear about the simple, high-level explanation about what the package does.

Case Study

Let’s look at an example. There’s a Django package called reversion1 Here’s the introduction from its README file:

Django Reversion

django-reversion is an extension to the Django web framework that provides comprehensive version control facilities.

Features

  • Roll back to any point in a model’s history – an unlimited undo facility!
  • Recover deleted models – never lose data again!
  • Admin integration for maximum usability.
  • Group related changes into revisions that can be rolled back in a single transaction.
  • Automatically save a new version whenever your model changes using Django’s flexible signalling framework.
  • Automate your revision management with easy-to-use middleware.

This looks nice – except that I couldn’t for the life of me understand what “version control” for models is. I understand that it’s probably either keeping track of data changes in your database over time (a sort of database backup, so you could revert it to two weeks ago, before that damn bug occured) or keeping track of Django model changes, i.e., model schemas, like south does. There is nothing in this introduction to give me a hint, so let’s look a bit further. There’s a “Getting Started” section in the docs, and that’s always helpful, so we’ll check it out for quick examples, right?

Getting started with django-reversion

To install django-reversion, follow these steps:

[..]

Admin integration

django-reversion can be used to add a powerful rollback and recovery facility to your admin site. To enable this, simply register your models with a subclass of reversion.VersionAdmin:

import reversion

class YourModelAdmin(reversion.VersionAdmin):

    pass

admin.site.register(YourModel, YourModelAdmin)

Whenever you register a model with the VersionAdmin class, be sure to run the ./manage.py createinitialrevisions command to populate the version database with an initial set of model data. Depending on the number of rows in your database, this command could take a while to execute.

For more information about admin integration, please read the admin integration documentation.

Low Level API

You can use django-reversion’s API to build powerful version-controlled views. For more information, please read the low level API documentation.

More information

[..]

There’s a code snippet here, which is excellent. Examples are great when you want to know how using a certain package looks like. I could further infer that reversion probably referes to model data and not schema from that bit about the rows in the database – but it’s purely coincidental.

But that “Admin Integration” bit is a bit weird. It seems reasonable that reverting to an earlier version of a certain model would be done in the Admin page, but from this paragraph it’s not clear how this would look like and how usable it is. Or to put it another way – pics or it didn’t happen. Why can’t I get a nice screenshot here? A solid screenshot of how the admin page looks like would probably inform me more about what reversion does than the many pages in this documentation.

All in all, it took me about 20 minutes of looking around the documentation to get a feel of what it does. This includes searching google for “django reversion screenshot” so I can see how it looks (partial luck with that search, by the way). This is way too long.

Be Better

A package documentation should serve (at least) two purposes:

  1. It should act as a “pitch” – allowing new users to quickly assess the purpose, usage and benefits of the package.
  2. It should act as a reference for users already using it, in case they have questions or issues with the package.

Most packages focus on the second point, but it’s only useful if you get people to use your software, and you get it with the pitch. So how do you do it?

Pitch Recipe

  1. There should be a specific section of your documentation dedicated to the pitch. It should be named clearly as such – “Getting Started”, “Quick Start”, “Tutorial” are all good, indicative names.

  2. That section should either be in the landing page for your docs or clearly linked from it (think bold letters).

  3. The pitch should clearly state what the software does. Ideally, it should also state what problems it solves. Think about drug commercials. Start with symptoms, then the solution.

  4. If there are different packages that do the same as yours – this is the place to let the user know what makes your package different.

  5. If your package has ANY graphical interface you HAVE to include a screenshot. You wouldn’t buy a painting just from a description, would you?

  6. You should include a minimum working example of your package. The example should be simple, yet meaningful. If you’re declaring a class called Foo, you’re doing it wrong. The example doesn’t only answer the question of HOW to use your software, it should also answer WHY.

A good example for this is the django-mptt tutorial.

You make great software – help people use it!




  1. Don’t take this as anything against the reversion package – I haven’t tried it and I can’t attest to its quality. The documentation is probably pretty good as reference material.

Python Tips: Iterate With a Sentinel Value

TL;DR 1

When f is a file, a socket, or anything you read from until you get an empty string (or another set value), you can use the iter function in a for loop to loop until that value is returned:

1
2
3
4
blocks = []
read_block = partial(f.read, 32)
for block in iter(read_block, ''):
    blocks.append(block)

The Problem

If you ever had to write code that uses sockets, reads blocks of bytes from a file or many other I/O-read loops, you probably recognize the structure of the following snippet:

1
2
3
4
5
6
blocks = []
while True:
    block = f.read(32)
    if block == '':
        break
    blocks.append(block)

This boilerplate code is very common, very ugly and – as you’ll find out in a minute – very avoidable. First let’s understand what were doing here, in plain language:

  1. Do forever:
    1.1. Read a block from f.
    1.2. If the value was '', break from the loop.
    1.3. Do something with the read value.

Why is this bad? There are two reasons:

  1. Usually, when we iterate over an objects or until a condition happens, we understand the scope of the loop in its first line. e.g., when reading a loop that starts with for book in books we realize we’re iterating over all the books. When we see a loop that starts with while not battery.empty() we realize that the scope of the loop is for as long as we still have battery.
    When we say “Do forever” (i.e., while True), it’s obvious that this scope is a lie. So it requires us to hold that thought in our head and search the rest of the code for a statement that’ll get us out of it. We are entering the loop with less information and so it is less readable.

  2. We are essentialy iterating over chunks of bytes. Out of the 4 lines in the loop, only one line refers to those bytes. So that’s a bad signal-to-noise ration, which also affects readability. For a reader unfamiliar with this code-form, it’s not clear that if block == '' is a technical, implemetation-driver detail. It might seem like an semantic value returned from the read.

The Solution

You might recall there’s a function called iter. It can accept an argument that supports iteration and returns an iterator for it. Using it like that, it seems pretty useless, as you just iterate over that collection without iter. But it also accepts another argument – a sentinel value:

In computer programming, a sentinel value [..] is a special value whose presence guarantees termination of a loop that processes structured (especially sequential) data. The sentinel value makes it possible to detect the end of the data when no other means to do so (such as an explicit size indication) is provided. The value should be selected in such a way that it is guaranteed to be distinct from all legal data values, since otherwise the presence of such values would prematurely signal the end of the data.

The sentinel value in this case is an empty string – since any successful read from an I/O device will return a non-empty string, it is guaranteed that no successful read will return this value.

When a sentinel value is supplied to iter, it will still return an iterator, but it will interpret its first argument differently – it will assume it’s callable (without arguments) and will call it repeatedly until it returns the sentinel value. Afterwards, the iterator would stop.

The trouble is that usually read functions do take an argument – usually the size to read (in bytes, lines, etc.), so we need to create a new function which takes no input and reads a constant size. We have two main tools for the job: partial (imported from functools) and lambda (a built-in keyword). The two following lines are equivalent 2:

1
2
read = partial(f.read, 32)
read = lambda: f.read(32)

partial is specifically designed to take functions that accept arguments and create “smaller” functions with some of those arguments set as constants. lambda is a little bit more flexible, but can also be useful for the same thing.

The only thing left is to tie all this together:

1
2
3
4
blocks = []
read_block = partial(f.read, 32)
for block in iter(read_block, ''):
    blocks.append(block)

If you’re not doing anything except appending the blocks together, you can even make it shorter:

1
blocks = ''.join(iter(partial(f.read, 32), ''))

Remember: Readability counts!


  1. I learned this Python tip from Raymond Hettinger’s excellent talk “Transforming code into Beautiful, Idiomatic Python”. I use his examples as well, and you should really just watch the talk instead of reading this. I’m putting this out there for two reasons: one – because writing about something helps me remember it, and two – because text is a more searchable and skimmable than video.

  2. There are some minor differences between the two generated functions. Alon Horev’s blog post on the subject is a very interesting read.

I Want to Fix It NOW

I work on Django projects both at work and at home, in the form of side projects (I created HTMLify as a first experiment in full-stack developing and am now working on a minimalistic feed reader).
The other day I got really bummed out at work. At first, I didn’t really understand why – I was just depressed. I got talking with a co-worker and he suggested that I introspect and try to pin-point why. After a while of thinking about it, I realized what was wrong. I am embarrassed by the product I’m making.

Grave Injustice

Now, this isn’t to say we’re building a bad product, or that it doesn’t work. My problem is that it has very rough edges regarding user experience and we almost never get around to fixing these issues. Sure, if there are bugs in functionality we usually solve them first, but there is a class of “comfort” problems which aren’t really bugs and we have a lot of them.
This is partly because the backend and frontend are split between two different software groups. But I’m not blaming this on the frontend guys. The problem is the attitude that more features that are more-or-less stable are better than less features that are rock solid.
Now, I get that sometimes there are time-critical features that give value to the customer (this is a big phrase for us, as we work in Scrum), I do. But I realized that I work very differently at work from how I work at home. When I’m working on something at home and I see something that bothers me I usually fix it immediately (I, of course, finish what I’m currently working on first). At work, it’s a completely different story. When I see an “injustice”, I send an email to our product owner. He adds it to the backlog. We discuss it in meetings and prioritize it.
Whenever I “walk past” the bit of code that’s responsible, I get upset. I want to fix it NOW! The existence of this issue is like a thorn in my side. I hate it.

The Heart Wants What the Heart Wants

I feel it’s like technical debt, but it’s not exactly the same. These issues are bugs to me, but our product owner doesn’t see them this way.
I don’t really know how to deal with being bummed out about this. I’d appreciate opinions from other developers who experience similar feelings. How do you settle the need for setting and discussing issues and priorities with your need to fix things as you see them? How do you deal with getting others to feel the same? If you practice Scrum, I would very much like to hear your methods and ideas.
Feel free to give advice in the comment section below or at hackernews.

Python Importing

When you start to work on even rudimentary Python application, the first thing you usually do is import some package you’re using. There are many ways to import packages and modules – some are extremely common (found in pretty much every Python file ever written) and some less so. In this post I will cover different ways to import or reload modules, some conventions regarding importing, import loops and some import easter-eggs you can find in Python.

Cheat Sheet

1
2
3
4
5
6
7
8
9
import foo
import foo.bar
from foo import bar
from foo import bar, baz
from foo import *
from foo import bar as fizz
from .foo import bar
foo = __import__("foo")
reload(foo)

Different Ways to Import

1. import foo

The basic Python import. The statement import foo looks for a foo model, loads it into memory and creates a module object called foo. How does Python knows where to find the foo module?

When a module named spam is imported, the interpreter first searches for a built-in module with that name. If not found, it then searches for a file named spam.py in a list of directories given by the variable sys.pathsys.path is initialized from these locations:

  • the directory containing the input script (or the current directory).
  • PYTHONPATH (a list of directory names, with the same syntax as the shell variable PATH).
  • the installation-dependent default.

(from the documentation)

If there is a bar object (which could be anything from a function to a submodule) it can be accessed like a member: foo.bar. You can also import several modules in one line by doing import foo, bar, but it is considered good practice to put each import in a single line.

2. import foo.bar

This makes foo.bar available without importing other stuff from foo. The difference from import foo is that if foo also had a baz member, it won’t be accessible.

3. from foo import bar

This statement imports bar which could be anything that is declared in the module. It could be a function definition, a class (albeit a not-conventionally-named class) or even a submodule (which make foo a package). Notice that if bar is a submodule of foo, this statement acts as if we simply imported bar (if it was in Python’s search path). This means that a bar object is created, and its type is 'module'. No foo object is created in this statement.

Multiple members of foo can be imported in the same line like so:

4. from foo import bar, baz

The meaning of this is pretty intuitive: it imports both bar and baz from the module foo. bar and baz aren’t neccessarily the same types: baz could be a submodule and bar could be function, for that matter. Unlike importing unrelated modules, it’s perfectly acceptable to import everything from one module in the same line.

5. from foo import *

Sometimes foo contains so many things, that it becomes cumbersome to import them manually. Instead, you can just import * to import them all at the same time. Don’t do this unless you know what you’re doing! It may seem convenient to just import * instead of specific members, but it is considered bad practice. The reason is that you are in fact “contaminating” your global namespace. Imagine that you do import * on a package where someone unwittingly declared the following function:

1
2
def list():
    raise RuntimeError("I'm rubber, you're glue!")

When you do import *, this list definition will override the global, built-in list type and you’ll get very, very unexpected errors. So it’s always better to know exactly what you’re importing. If you’re importing too much stuff from a certain package, you can either just suck it up or just import the package itself (import foo) and use the foo qualifier for every use. An interesting good use for import * is in Django settings file hierarchy. It’s convenient there because you actually do want to manipulate the global namespace with imported settings.

6. from foo import bar as fizz

This one is far less common than what we covered so far, but still well known. It acts like from foo import bar, except instead of creating a bar object, it creates a fizz module with the same meaning. There are two main reasons to use this kind of statement: the first is when you’re importing two similarly named objects from two different modules. You then use import as to differentiate them, like so:

1
2
from xml import Parser as XmlParser
from json import Parser as JsonParser

The other reason, which I’ve seen used a few times is when you import a lone-named function (or class) and use it extensively throughout your code and want to shorten its name.

7. from .foo import bar

Well, this escalated quickly.

This one is pretty rare and a lot of people are completely unaware of it. The only difference in this statement is that it uses a modified search path for modules. Namely, instead of searching the entire PYTHONPATH, it searches in the directory where the importing file lives. So if you have two files called fizz.py and foo.py, you can use this import in fizz, and it will import the correct file, even if you have another foo module in your PYTHONPATH. What is this good for? Well, sometime you create modules with generic names like common, but you might also have a common package in the base of your project. Instead of giving different names, you can explicitly import the one closest to you. You can also use this method to load modules from an ancestor in the directory tree by putting several dots. For example, from ..foo import Foo will search one directory up, from ...foo import Foo will search two directories up, etc.

8. foo = __import__("foo")

Ever wondered how can you import a module dynamically? This is how. Obviously you wouldn’t use it with an explicit string, but rather with a variable of some kind. Also notice that you have to explicitly assign the imported module to a variable, or you won’t have access to its attributes.

9. reload(foo)

This statement does exactly what it looks like. It reloads the foo module. It’s pretty useful when you have a console open playing with a bit of code you’re tweaking and want to continue without resetting your interpreter.

Note: If you used from foo import bar, it’s not enough to reload foo for bar to update. You need to both reload foo and call from foo import bar again.

Import Loops

An import loop would occur in Python if you import two or more modules in a cycle. For example, if in foo.py you would from bar import Bar and in bar.py you from foo import Foo, you will get an import loop:

1
2
3
4
5
6
7
8
Traceback (most recent call last):
  File "foo.py", line 1, in <module>
    from bar import Bar
  File "/tmp/bar.py", line 1, in <module>
    from foo import Foo
  File "/tmp/foo.py", line 1, in <module>
    from bar import Bar
ImportError: cannot import name Bar

When this happens to you, the solution is usually to move the common objects from foo.py and bar.py to a different file (say common.py). However, sometime there’s actually a real loop of dependencies. For example, a method in Bar needs to create a Foo instance and vice versa. When the dependency is in a limited scope, you should remember that you can use the import command wherever you want. Putting imports at the top of the file is the common convention, but sometimes you can solve import loops by importing in a smaller scope, like in a method definition.

Easter Eggs

What fun would an easter egg be if you didn’t try it by yourself? Try these and have fun!

1
2
3
4
5
import antigravity
import this
from __future__ import braces
import __hello__
from __future__ import barry_as_FLUFL

A Pretty Good Work Day

Edit: Emotions were running high when I wrote this and the original blog post that lead to this one. I realize now it was unprofessional to publicly vent like this, so I shelved the original blog post (which was 100% venting) and I edited this post to contain less venting and more constructive. I hope you enjoy it.

Those of you who follow my blog may remember I recently posted about the worst work day of my life. To get you all up to speed, the story is that I work on the backend of a web application and the frontend guys weren’t willing to work with our source control (Mercurial) or even operating system (Linux). When we handled the source control for them, they aggressively accused us of ruining their workspace (which we did, by mistake).

Well, I am happy to report that this saga has come to an end. Here’s what happened.

The frontend guys suggested this solution to our problem (excerpt from the original post):

The frontend guys would work on the VM we gave them. We’re still “in charge” of it, but we’re not allowed to modify any files. When we want to share versions of our code, we’ll use a third, public directory and copy our code manually to and from it. The merges will be done by hand from their side, and with Mercurial from our side.

This solution is bad. It’s bad because it’s a reinvention of the source control idea, only done manually so that the upshot of this would be that we will spend hours on end tinkering with diffs (removed files, especially, will be a nuisance) instead of actually developing. My team was pretty upset the days after this meeting. We were all frustrated with the new system we had to put up with, and we felt it was only because of internal politics and not actual technological reasons. We didn’t want to work with the frontend guys anymore and we were unhappy. We wanted to escalate the issue to upper management so we don’t have to work like this.

But our group leader (actually, his replacement) insisted that we give their way a chance. Not because their suggestion was good, but because it wasn’t just bad, it was painstakingly bad. It was so bad that it could not possibly work out for anyone. And that’s was he was counting on.

So we gave it a chance. The first thing we did was to show we were doing our part. We asked them to send us their version of the code and we merged it into our code base. It was annoying, complicated and it took a lot of time, but we did it without complaining (to them – we complained the hell out of it to our GL). Then, we sent them our code base, in it’s entirety. While they are only responsible for a small directory with some HTML/CSS/JS files, which they could email us, our project is several hundred thousands line of code, which contains an entire Linux framework, with C, Cython, Python and a custom build system that also compiles the Linux kernel. They needed it all because the backend uses this framework and they couldn’t run the server otherwise (as we explained to them in the original meeting). So, the zip was pretty big…

And now, we waited. They got the zip and used a diff tool to do the merge. After several hours, they started calling us for help. At day’s end, they asked us, on a “one-time” basis, to use Mercurial to recreate their project for them. We refused. We told them it wasn’t what we agreed to, and it’s their way we’re trying now. The next day, they still haven’t gotten around to building the project as they still had trouble unzipping the directory. At one point they backed up their directory, wiped it clean and started over, and still they were unsuccessful. 

Another day went by, and we agreed to recreate their project. We sent out an email calling for another meeting to work out a new solution, but we got the following response:

There’s no need to set up another meeting.

We didn’t realize the size of the project we we’re dealing with. We’ll be happy if you teach us how to use Mercurial so we can work with that from now on.

Wasn’t that a sight for sore eyes.

We felt victorious at last. Not only did we teach them Mercurial, they also started using it by connecting (via NX) to their dedicated Ubuntu VM, so in a way they’re also using Linux now. And you know what? They’re pretty happy with the new situation. They got up to speed really fast and started pushing changes to the code base.

The lesson to be learned here (other than the fact the source control is a necessity), is that sometimes the best way to make someone follow your path is to make him make his own mistakes and learn from them. I appreciate my GL very highly for keeping his cool and resolving this issue in such a peaceful and smart way.

Damn, was that a good day.

Bash Gibberish - Type Less, Do More

Over my several years of experience with bash I found several really useful tips and tricks. This post will deal with some of the more obfuscated looking “variables” that Bash provides.

Comic courtesy of http://themagnificentwhatever.com/

 $?: Check the status of the last command

1
2
3
4
5
6
$ hg branch
abort: no repository found in '<location>' (.hg not found)!
$ echo $?
255  <---- The "hg branch" command failed!
$ echo $?
0    <---- The "echo" command succeeded!

!!: Repeat the last command entirely

1
2
3
4
5
$ cat secret-file
cat: secret-file: Permission denied
$ sudo !!
sudo cat secret-file
DON'T READ THIS FILE!!!

I specifically love to use this with sudo because it looks like you’re yelling at your computer and it caves.

!$: Repeat the last parameter of the last command

While this seems more niche and less useful than the previous point, I actually use this the most. It’s useful when you want to run two command on the same file, which is very common.

1
2
3
$ mkdir -p /tmp/a/really/complex/path/you/dont/want/to/repeat
$ cd !$
cd /tmp/a/really/complex/path/you/dont/want/to/repeat

$$: Get your process id

1
2
3
4
5
6
$ echo $$
2122
$ ps
  PID TTY          TIME CMD
 2122 pts/1    00:00:00 bash
 2632 pts/1    00:00:00 ps

Conventions Are Arbitrary, So Use This One

Despite the title, my intention is not to start a flame war. I want to discuss docstring conventions in Python, but just as a case study for conventions in general.

I suggested a while ago in my team at work that we should probably agree on docstring conventions. Each of us had different conventions in mind, me included. Instead of just, you know, thinking for myself, I Googled “python docstring conventions” and lo and behold – the first result was PEP 257:

  • Triple quotes are used even though the string fits on one line. This makes it easy to later expand it.
  • The closing quotes are on the same line as the opening quotes. This looks better for one-liners.
  • There’s no blank line either before or after the docstring.
  • The docstring is a phrase ending in a period. It prescribes the function or method’s effect as a command (“Do this”, “Return that”), not as a description; e.g. don’t write “Returns the pathname …”.
  • The one-line docstring should NOT be a “signature” reiterating the function/method parameters (which can be obtained by introspection).
  • […]

And it goes on. Now, you may agree with some of these items and disagree with other (personally – the “phrase as a command” I love; the “put a blank line before the ending quotes”, not so much), but I think the value of this document is that it simply exists and is agreed upon. Caring about how many spaces to put before a curly brace is a phase. It’s one of the steps of teaching yourself how to program:

  • Get involved in a language standardization effort. It could be the ANSI C++ committee, or it could be deciding if your local coding style will have 2 or 4 space indentation levels. Either way, you learn about what other people like in a language, how deeply they feel so, and perhaps even a little about why they feel so.
  • Have the good sense to get off the language standardization effort as quickly as possible.

However, caring about which standardization to use is wholly different from caring about whether you’re complying with any. Standards are arbitrary, yes, but they’re important. I think PEP 257 is great. Not because they get the spacing right, but because it’s important to have an agreed upon document (signed by the BDFL – a bonus) that does just this. So when I review other people’s code, I correct them and I refer them to this PEP. If they complain I say “I don’t care where we put our spaces, but we should be consistent. Somebody already did the work of putting a document together, so why not use it?”.

Django QuerySets: Fucking Awesome? Yes

Django QuerySets are pretty awesome.

In this post I’ll explain a bit about what they are and how they work (if you’re already familiar with them, you can jump to the second part), I’ll argue that you should always return a QuerySet object if it’s possible and I’ll talk about how to do just that.

QuerySets Are Awesome

A QuerySet, in essence, is a list of objects of a given model. I say ‘list’ and not ‘group’ or the more formal ‘set’ because it is ordered. In fact, you’re probably already familiar with how to get QuerySets because that’s what you get when you call various Book.objects.XXX() methods. For example, consider the following statement:

1
Book.objects.all()

What all() returns is a QuerySet of Book instances which happens to include all Book instances that exist. There are other calls which you probably already know:

1
2
3
4
5
6
7
8
9
# Return all books published since 1990
Book.objects.filter(year_published__gt=1990)

# Return all books *not* written by Richard Dawkins
Book.objects.exclude(author=''Richard Dawkins'')

# Return all books, ordered by author name, then
# chronologically, with the newer ones first.
Book.objects.order_by(''author'', ''-year_published'')

The cool thing about QuerySets is that, since every one of these function both operates on and returns a QuerySet, you can chain them up:

1
2
3
4
5
6
7
# Return all book published after 1990, except for
# ones written by Richard Dawkins. Order them by
# author name, then chronologically, with the newer 
# ones first.
Book.objects.filter(year_published__gt=1990) \
            .exclude(author=''Richard Dawkins'') \
            .order_by(''author'', ''-year_published'')

And that’s not all! It’s also fast:

Internally, a QuerySet can be constructed, filtered, sliced, and generally passed around without actually hitting the database. No database activity actually occurs until you do something to evaluate the queryset.

So we’ve established that QuerySets are cool. Now what?

Return QuerySets Wherever Possible

I’ve recently worked on a django app where I had a Model that represented a tree (the data structure, not the christmas decoration). It meant that every instance had a link to its parent in the tree. It looked something like this:

1
2
3
4
5
6
7
8
9
10
11
class Node(models.Model):
    parent = models.ForeignKey(to=''self'', null=True, blank=True)
    value = models.IntegerField()

    def __unicode__(self):
        return ''Node #{}''.format(self.id)

    def get_ancestors(self):
        if self.parent is None:
            return []
        return [self.parent] + self.parent.get_ancestors()

This worked pretty well. Trouble was, I had to add another method, get_larger_ancestors, which should return all the ancestors whose value was larger then the value of the current node. This is how I could have implemented this:

1
2
3
    def get_larger_ancestors(self):
        ancestors = self.get_ancestors()
        return [node for node in ancestors if node.value > self.value]

The problem with this is that I’m essentially going over the list twice – one time by django and another time by me. It got me thinking – what if get_ancestors returned a QuerySet instead of a list? I could have done this:

1
2
    def get_larger_ancestors(self):
        return self.get_ancestors().filter(value__gt=self.value)

Pretty straight forward, The important thing here is that I’m not looping over the objects. I could perform however many filters I want on what get_larger_ancestors returned and feel safe that I’m not rerunning on a list of object of an unknown size. The key advantage here is that I keep using the same interface for querying. When the user gets a bunch of objects, we don’t know how he’ll want to slice and dice them. When we return QuerySet objects we guarantee that the user will know how to handle it.

But how do I implement get_ancestors to return a QuerySet? That’s a little bit trickier. It’s not possible to collect the data we want with a single query, nor is it possible with any pre-determined number of queries. The nature of what we’re looking for is dynamic and the alternative implementation will look pretty similar to what it is now. Here’s the alternative, better implementation:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
class Node(models.Model):
    parent = models.ForeignKey(to=''self'', null=True, blank=True)
    value = models.IntegerField()

    def __unicode__(self):
        return ''Node #{}''.format(self.id)

    def get_ancestors(self):
        if self.parent is None:
            return Node.objects.none()
        return Node.objects.filter(pk=self.parent.pk) | self.parent.get_ancestors()

    def get_larger_ancestors(self):
        return self.get_ancestors().filter(value__gt=self.value)

Take a while, soak it in. I’ll go over the specifics in just a minute.

The point I’m trying to make here is that whenever you return a bunch of objects – you should always try to return a QuerySet instead. Doing so will allow the user to freely filter, splice and order the result in a way that’s easy, familiar and provides better performance.

(On a side note – I am hitting the database in get_ancestors, since I’m using self.parent recursively. There is an extra hit on the database here – once when executing the function and another in the future, when actually inspecting the results. We do get the performance upside when we perform further fliters on the results which would have meant more hits on the database or heavy in-memory operations. The example here is to show how to turn non-trivial operations into QuerySets).

Common QuerySet Manipulations

So, returning a QuerySet where we perform a simple query is easy. When we want to implement something with a little more zazz, we need to perform relational operations (and some helpers, too). Here’s a handy cheat sheet (as an exercise, try to understand my implementation of get_larger_ancestors).

  • Union – The union operator for QuerySets is |, the pipe symbol. qs1 | qs2 returns a QuerySet with all the items from qs1 and all the items in qs2 while handling duplicates (items that are in both QuerySets will only appear once in the result).

  • Intersection – there is no special operator for intersection, because you already know how to do it! Chaining functions like filter and exclude are in fact performing an intersection between the original QuerySet and the new filter.

  • Difference – a difference (mathematically written as qs1 \ qs2) is all the items in qs1 that do not exist in qs2. Note that this operation is asymmetrical (as opposed to the previous operations). I’m afraid there is no built-in way to do this in python, but you can do this: qs1.exclude(pk__in=qs2)

  • Nothing – seems useless, but it actually isn’t, as the above example shows. A lot of time, when you’re dynamically building a QuerySet with unions, you need to start off with what would have been an empty list. This is how to get it: MyModel.objects.none().

Python: Common Newbie Mistakes, Part 2

Scoping

The focus of this part is an area of problems where scoping in Python is misunderstood. Usually, when we have global variables (okay, I’ll say it because I have to – global variables are bad), Python understands it if we access them within a function:

1
2
3
bar = 42
def foo():
    print bar

Here we’re using, inside foo, a global variable called bar and it works as expected:

1
2
>>> foo()
42

This is pretty cool. Usually we’ll use this feature for constants that we want to use throughout the code. It also works if we use some function on a global like so:

1
2
3
4
5
6
bar = [42]
def foo():
    bar.append(0)foo()

>>> print bar
[42, 0]

But what if we want to change bar?

1
2
3
4
5
6
>>> bar = 42
... def foo():
...     bar = 0
... foo()
... print bar
42

We can see that foo ran fine and without exceptions, but if we print the value of bar we’ll see that it’s still 42! What happens here is that the line bar = 0, instead of changing bar, created a new, local variable also called bar and set its value to 0. This is a tough bug to find and it causes some grief to newbies (and veterans!) who aren’t really sure of how Python’s scoping works. To understand when and how Python decided to treat variables as global or local, let’s look at a less common, but probably more baffling version of this mistake and add an assignment to bar after we print it:

1
2
3
4
bar = 42
def foo():
    print bar
    bar = 0

This shouldn’t break our code, right? We added an assignment after the print, so there’s no way it should affect it (Python is an interpreted language after all), right? Right??

1
2
3
4
5
6
7
8
>>> foo()
Traceback (most recent call last):
  File "<pyshell#4>", line 1, in <module>
    foo()
  File "<pyshell#3>", line 3, in foo
    print bar

UnboundLocalError: local variable ''bar'' referenced before assignment

WRONG.

How is this possible? Well, there are two parts to this misunderstanding. The first misconception is that Python, being an interpreted language (which is awesome, I think we can all agree), is executed line-by-line. In truth, Python is being executed statement-by-statement. To get a feel of what I mean, go to your favorite shell (you aren’t using the default one, I hope) and type the following:

1
def foo():

Press Enter. As you can see, the shell didn’t offer any output and it’s clearly waiting for you to continue with your function definition. It will continue to do so until you finish declaring you function. This is because a function declaration is a statement. Well, it’s a compound statements, that includes within it many other statements, but a statement notwithstanding. The content of your function isn’t being executed until you actually call it. What is being executed is that a function object is being created.

This leads us to the second point. Again, Python’s dynamic and interpreted nature leads us to believe that when the line print bar is executed, Python will look for a variable bar first in the local scope and then in the global scope. What really happens here is that the local scope is in fact not completely dynamic. When the def statement is executed, Python statically gathers information regarding the local scope of the function. When it reaches the line bar = 0 (not when it executes it, but when it reads the function definition), it adds “bar” to the list of local variable for foo. When foo is executed and Python tries to execute the line print bar, it looks for the variable in the local scope and it finds it, since it was statically accessed, but it knows that it wasn’t assigned yet – it has no value. So the exception is raised.

You could ask “why couldn’t an exception be raised when we were declaring the function? Python could have known in advance that bar was referenced before assignment”. The answer to that is that Python can’t know whether the local bar was assigned to or not. Look at the following:

1
2
3
4
5
bar = 42
def foo(baz):
    if baz > 0:
        print bar
    bar = 0

Python is playing a delicate game between static and dynamic. The only thing it knows for sure is that bar is assigned to, but it doesn’t know it’s referenced before assignment until it actually happens. Wait – in fact, it doesn’t even know it was assigned to!

1
2
3
4
5
bar = 42
def foo():
    print bar
    if False:
        bar = 0

When running foo, we get:

1
2
3
4
5
6
Traceback (most recent call last):
  File "<pyshell#17>", line 1, in <module>
    foo()
  File "<pyshell#16>", line 3, in foo
    print bar
UnboundLocalError: local variable 'bar' referenced before assignment

While we, intelligent beings that we are, can clearly see that the assignment to bar will never happen, Python ignores that fact and still declares bar as statically local.

I’ve been babbling about the problem long enough. We want solutions, baby! I’ll give you two.

1
2
3
4
5
6
7
8
9
10
>>> bar = 42
... def foo():
...     global bar
...     print bar
...     bar = 0
... 
... foo()
42
>>> bar
0

The first one is using the global keyword. It’s pretty self-explanatory. It let’s Python know that bar is a global variable and not local.

The second, more preferred solution, is – don’t. In the sense of – don’t use a global that isn’t constant. In my day-to-day I work on a lot of Python code and there isn’t one use of the global keyword. It’s nice to know about it, but in the end it’s avoidable. If you want to keep a value that is used throughout your code, define it as a class attribute for a new class. That way the global keyword is redundant since you need to qualify your variable access with its class name:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
>>> class Baz(object):
...     bar = 42
... 
... def foo():
...     print Baz.bar  # global
...     bar = 0  # local
...     Baz.bar = 8  # global
...     print bar
... 
... foo()
... print Baz.bar
42
0
8

Read the rest of the series: