Posts Tagged performance

Efficient Calculations on Named Columns of Realworld Data

In data analysis frameworks, it’s common to represent data as a two-dimensional rectangle, where each column is given a name.  Every named column contains values of the same type and every row corresponds to related datums.  I say “rectangle” instead of matrix because a matrix typically implies that every value has the same type.  Database tables have named and typed columns, but they typically only support a limited set of operations.  Of course there are special-purpose databases, like Kdb+, but they’re prohibitively expensive for most users.

Stata is one commercial program that supports a variety of operations on data rectangles.  In fact, Stata’s language is entirely built around the idea of having a single rectangle in memory, upon which you sort, filter, generate columns, or merge with rectangles that you previously saved to disk.  Each column has a name and a data type.  The rectangles also support missing values, which is essential when working with real world data.

Unfortunately, although Stata is a decent tool for playing with data while doing research, it’s horrible as a production tool.  Also, its language is so simple that it makes operations like loops or fill_forward a nightmare.

I want to perform high level rectangular operations within a more flexible language like Python.  I’m already comfortable wrangling data in numpy, and I’m a huge fan of numpy’s MaskedArray package.  Numpy supports named, typed columns via recarray, but recarray is generally incomplete.  For example, there’s no built in way to merge (join) two recarrays.  More importantly, recarray does not support masked values!

I have built my own my own module that replicates and expands upon most of Stata’s rectangular operations.  My implementation includes a class that stores data in a dictionary mapping column names to numpy’s masked arrays.  I can’t share the code because I wrote it at work.

I’m happy with my implementation, but I wish it were faster.  Although I’m faster than Stata, two key operations that are slower than I’d like are: 1) merge (join) and 2) fill-forward.  Both of these do too much work outside of optimized space of numpy.  I’m thinking of rewriting these functions in C or Cython.

I’d love to hear comments about this general data structure.  How are other people dealing with rectangles of real world data?

Comments (7)

Manage your Django Transactions

I’m building a Django application at work that performs several thousand inserts.  I was frustrated when it was taking several minutes to populate the database.  For reference, I’m running on a Windows XP box with a 2GHz Xeon, 4GB of RAM, and I’m using SQLite as the database.  In one straightforward piece of the program, it was taking over 3 minutes to insert 5000 rows.

Since I’m new to Django, I wasn’t immediately aware that, by default, every call to an ORM object’s save() results in a separate transaction.  When you’re doing thousands of inserts, the overhead is unbearable.  So after reading this, this, and this, I realized that the @transaction.commit_on_success decorator would allow me to lump a collection of saves into a single transaction.

from django.db import transaction

@transaction.commit_on_success
def myfunction(request):
    for i in range(5000):
        myobj = MyObject(mynumber=i)
        myobj.save()

After I made this change, the 5000 inserts took roughly 15 seconds instead of over 3 minutes.

Comments (2)

Follow

Get every new post delivered to your Inbox.