In data analysis frameworks, it’s common to represent data as a two-dimensional rectangle, where each column is given a name. Every named column contains values of the same type and every row corresponds to related datums. I say “rectangle” instead of matrix because a matrix typically implies that every value has the same type. Database tables have named and typed columns, but they typically only support a limited set of operations. Of course there are special-purpose databases, like Kdb+, but they’re prohibitively expensive for most users.
Stata is one commercial program that supports a variety of operations on data rectangles. In fact, Stata’s language is entirely built around the idea of having a single rectangle in memory, upon which you sort, filter, generate columns, or merge with rectangles that you previously saved to disk. Each column has a name and a data type. The rectangles also support missing values, which is essential when working with real world data.
Unfortunately, although Stata is a decent tool for playing with data while doing research, it’s horrible as a production tool. Also, its language is so simple that it makes operations like loops or fill_forward a nightmare.
I want to perform high level rectangular operations within a more flexible language like Python. I’m already comfortable wrangling data in numpy, and I’m a huge fan of numpy’s MaskedArray package. Numpy supports named, typed columns via recarray, but recarray is generally incomplete. For example, there’s no built in way to merge (join) two recarrays. More importantly, recarray does not support masked values!
I have built my own my own module that replicates and expands upon most of Stata’s rectangular operations. My implementation includes a class that stores data in a dictionary mapping column names to numpy’s masked arrays. I can’t share the code because I wrote it at work.
I’m happy with my implementation, but I wish it were faster. Although I’m faster than Stata, two key operations that are slower than I’d like are: 1) merge (join) and 2) fill-forward. Both of these do too much work outside of optimized space of numpy. I’m thinking of rewriting these functions in C or Cython.
I’d love to hear comments about this general data structure. How are other people dealing with rectangles of real world data?