Saturday, 26 June 2010

Flattening lists of lists in Python; List comprehension edition

List comprehensions are a powerful part of Python and I am forever finding new ways to abuse them.

I'm writing a toy package for finding duplicate files across a directory structure, just to build up confidence in using Python. Basically we have a function to generate a list of files to be found in a given directory (and any sub-trees if so required), and several other functions that take a list of files and produce a list of lists of files where each sublist contains the files that match each other on some criteria such as filesize or MD5 hash of the first kilobyte of file or by byte-by-byte comparison.

The later file-matchers produce more accurate matches between files with less false positive matches, but execute much more slowly as a result. In order to reduce the time taken to get a good set of matches across a set of files, you feed the output of the loose-but-fast matchers into the accurate-but-slow matchers, the idea being that each step on the chain reduces the size of input for the next function.

This is where we hit an interesting problem. Each matcher expects a flat list of files to compare whilst producing a list of nested lists. Feeding the output of a matcher as-is into the input of another matcher will cause an error unless the list is flattened first.

For example:

>>> nested_list = [['a','b','c],['d','e','f'],['g','h','i']]

>>> flattened_list = ['a','b','c','d','e','f','g','h','i']


So how do we flatten the list? Python doesn't have a built in flatten operation, presumably because what exactly a programmer wants to flatten in a data structure depends on the data structure and what it is used for. Fortunately for me in my current problem, I want a very basic flatten- so basic it can be done with list comprehensions.

To flatten the nested list in the previous example, we can do this:


>>> comprehensively_flattened_list = [i for sublist in nested_list for i in sublist]
>>> print comprehensively_flattened_list
 ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i']


Hooray!  But wait, will it work if the sublists are variable in length or will the comprehension try to treat each sublist as being the same length?

Only one way to find out...

>>> horrible_list =  [['b','a','d'],['t'],['e','s'],['t','l','i','s','t']]
>>> print horrible_list
[['b', 'a', 'd'], ['t'], ['e', 's'], ['t', 'l', 'i', 's', 't']]
>>> less_horrible_list = [i for sublist in horrible_list for i in sublist]
['b', 'a', 'd', 't', 'e', 's', 't', 'l', 'i', 's', 't']


Great stuff! For nested lists of known, uniform nesting depth, it looks like flattening can be neatly handled by list comprehensions alone. For more complex nested structures, I'm afraid it's back to boring old recursive traverse functions.

I give it a day before I find out that flattening through list comprehensions is a terrible idea.

Wednesday, 16 June 2010

10 PRINT "HELLO WORLD"

I started a general blog about a year ago, talking about life, the universe and my place in it. One of the minor issues is that whenever I get the urge to post something overly programmatic, I feel like I'm derailing the blog. As a result, I've started this new blog for all my hard-core technical gibberings with the other blog remaining vaguely general interest. In theory