In data analysis frameworks, it’s common to represent data as a two-dimensional rectangle, where each column is given a name. Every named column contains values of the same type and every row corresponds to related datums. I say “rectangle” instead of matrix because a matrix typically implies that every value has the same type. Database tables have named and typed columns, but they typically only support a limited set of operations. Of course there are special-purpose databases, like Kdb+, but they’re prohibitively expensive for most users.
Stata is one commercial program that supports a variety of operations on data rectangles. In fact, Stata’s language is entirely built around the idea of having a single rectangle in memory, upon which you sort, filter, generate columns, or merge with rectangles that you previously saved to disk. Each column has a name and a data type. The rectangles also support missing values, which is essential when working with real world data.
Unfortunately, although Stata is a decent tool for playing with data while doing research, it’s horrible as a production tool. Also, its language is so simple that it makes operations like loops or fill_forward a nightmare.
I want to perform high level rectangular operations within a more flexible language like Python. I’m already comfortable wrangling data in numpy, and I’m a huge fan of numpy’s MaskedArray package. Numpy supports named, typed columns via recarray, but recarray is generally incomplete. For example, there’s no built in way to merge (join) two recarrays. More importantly, recarray does not support masked values!
I have built my own my own module that replicates and expands upon most of Stata’s rectangular operations. My implementation includes a class that stores data in a dictionary mapping column names to numpy’s masked arrays. I can’t share the code because I wrote it at work.
I’m happy with my implementation, but I wish it were faster. Although I’m faster than Stata, two key operations that are slower than I’d like are: 1) merge (join) and 2) fill-forward. Both of these do too much work outside of optimized space of numpy. I’m thinking of rewriting these functions in C or Cython.
I’d love to hear comments about this general data structure. How are other people dealing with rectangles of real world data?
Robert Kern said
You should take a look at tabular, which extends recarrays for much the same purposes: http://pypi.python.org/pypi/tabular
Also, look at numpy.lib.recfunctions.join_by() for a builtin join operation.
Greg said
Thanks for your comment. I had not heard about either tabular or recfunctions. One key requirement for me that these options don’t meet (please correct me if I’m wrong) is support for masked values. I will look at these options more closely though!
yabonn said
I have been pulling my hair out about this too.Worse, I fear – having seen your maskedarray/timeseries code, that I don’t have much useful Python to add.
Notes from the wreckage of my various personal attempts :
- with a little bit of luck, there’s a hist function somewhere (matplotlib?) that’s fast enough to be of use. I’m thinking about something like binning on diff(dates) for fill operations?
- thanks to Greg for the pointers. Please tell if you experiment further with tabular/join_by.
- personally, I came back to a low low-tech list of dictionaries , plus a join(list_dics) function as my main tool. I began aiming for more rectangular things, but the non-series part of the data just kept on growing. One day, the tail began to move the dog, and each entity needed its dic, with one field for the series.
Chris said
I’ve written a Python package that works with “rectangular” data that has worked out pretty well for my needs. My team has been using it for some number of years now – it was written before the recarray constructs appeared. I always referred to this type of data as “record oriented” data. The package was written at work so I can’t share code, but maybe the concept will be useful.
I created a class called “DictOfArray” – where I store each “column” of data as a dictionary entry (in other words, retaining the “name” of the coumn) and then store the “rows” of similar data as a numpy array. In this way, anything that can be represented as a numpy array can be used – int, complex, float, text. Getting to a particular record is a very fast dictionary lookup, while operating on the contents uses the fast vectorized numpy constructs. Once I had the container class I wrote a number of analysis and plot functions to display and manipulate the data, such as select or eliminate data from the set based on a plot, compute stats, plot data on maps, etc.
I never needed to merge contents of more than one DictOfArray, but imagine that would be pretty fast as it would involve creating a new instance with preallocated numpy array sizes, then moving the contents of the original numpy arrays into the new longer array. The only real complaint I’ve had with the package is that it takes time to load the initial class. I never figured out how to write C code that could populate the DictOfArray contents. When loading dataset with more that a few 100k points, this can take a couple minutes. Once loaded, though, the package runs pretty fast.
Greg said
Thanks for your reply — very interesting. I took the same approach, except for two differences:
I’m also not satisfied with the performance of initializing a rectangle from a CSV file. My initialization code only takes a few seconds though (for 100k rows) — not several minutes. What’s your bottleneck?
Rob said
Hi Greg,
I’ve just started playing with PyTables (http://www.pytables.org/moin) which answers some of my needs for timeseries far better than a relational database… essentially it allows you to store data as numpy arrays (either fixed or extendable). Are you aware of this package? Any thoughts on its usefulness?
Regards,
Rob
Greg said
Hi Rob,
Yes, I’ve used PyTables — I think they’re great. I was even able to store scipy Timeseries objects in them using an extension which was in the development branch at the time That was about 2 years ago though, so it might be merged into the main PyTables by now.
Thanks for the note.
Greg