Archive for Programming Languages

Efficient Calculations on Named Columns of Realworld Data

In data analysis frameworks, it’s common to represent data as a two-dimensional rectangle, where each column is given a name.  Every named column contains values of the same type and every row corresponds to related datums.  I say “rectangle” instead of matrix because a matrix typically implies that every value has the same type.  Database tables have named and typed columns, but they typically only support a limited set of operations.  Of course there are special-purpose databases, like Kdb+, but they’re prohibitively expensive for most users.

Stata is one commercial program that supports a variety of operations on data rectangles.  In fact, Stata’s language is entirely built around the idea of having a single rectangle in memory, upon which you sort, filter, generate columns, or merge with rectangles that you previously saved to disk.  Each column has a name and a data type.  The rectangles also support missing values, which is essential when working with real world data.

Unfortunately, although Stata is a decent tool for playing with data while doing research, it’s horrible as a production tool.  Also, its language is so simple that it makes operations like loops or fill_forward a nightmare.

I want to perform high level rectangular operations within a more flexible language like Python.  I’m already comfortable wrangling data in numpy, and I’m a huge fan of numpy’s MaskedArray package.  Numpy supports named, typed columns via recarray, but recarray is generally incomplete.  For example, there’s no built in way to merge (join) two recarrays.  More importantly, recarray does not support masked values!

I have built my own my own module that replicates and expands upon most of Stata’s rectangular operations.  My implementation includes a class that stores data in a dictionary mapping column names to numpy’s masked arrays.  I can’t share the code because I wrote it at work.

I’m happy with my implementation, but I wish it were faster.  Although I’m faster than Stata, two key operations that are slower than I’d like are: 1) merge (join) and 2) fill-forward.  Both of these do too much work outside of optimized space of numpy.  I’m thinking of rewriting these functions in C or Cython.

I’d love to hear comments about this general data structure.  How are other people dealing with rectangles of real world data?

Comments (7)

Automated Fantasy Pro Football Pickem Weights

I’m participating in the Yahoo! Sports Fantasy Pro Football Pickem with some family and friends.   I don’t regularly play fantasy sports, but I was coaxed into donating money to the pool during a family trip this summer.  I blame the beer and port.

For the uninitiated, each week you must pick a winner in all 16 games, assigning a confidence weight from 1 to 16 to each game.  If your pick wins, then you are awarded points equal to the confidence weight that you assigned to that game.

Now that my money’s in, I want to make a solid showing.  My strategy is to leverage the gambling lines, mapping the point spreads and over under lines to my picks and weights.

I threw together a little Python script to parse the Yahoo odds web page.  For each game, the script averages the point spreads and over under lines.  The favored team is selected, and the picks are ranked according to point spread, with wider spreads getting a larger weight.  Ties on point spread are broken by the over under line, with a larger over under line mapping to a larger weight.

Now I’m wondering how to improve this simple algorithm.  I might break ties by giving a smaller over under a higher weight.  The reasoning would be that smaller over under lines will have less variance.  Surely there are other ways to improve the rankings.

Here’s the table that my little script prints out.

weight | winner                         loser                     spread    o/u
================================================================================
    16 | Minnesota Vikings         over Detroit Lions              -9.83  45.17
    15 | Washington Redskins       over St. Louis Rams             -9.58  37.00
    14 | Green Bay Packers         over Cincinnati Bengals         -9.08  42.00
    13 | Tennessee Titans          over Houston Texans             -6.50  40.92
    12 | Atlanta Falcons           over Carolina Panthers          -6.08  42.50
    11 | Buffalo Bills             over Tampa Bay Buccaneers       -4.42  42.00
    10 | New England Patriots      over New York Jets              -3.42  46.17
     9 | Jacksonville Jaguars      over Arizona Cardinals          -3.08  42.50
     8 | Indianapolis Colts        over Miami Dolphins             -3.08  42.17
     7 | San Diego Chargers        over Baltimore Ravens           -3.00  40.33
     6 | Denver Broncos            over Cleveland Browns           -3.00  38.83
     5 | Kansas City Chiefs        over Oakland Raiders            -3.00  38.50
     4 | Pittsburgh Steelers       over Chicago Bears              -2.92  37.50
     3 | Dallas Cowboys            over New York Giants            -2.75  45.00
     2 | San Francisco 49ers       over Seattle Seahawks           -1.10  39.58
     1 | New Orleans Saints        over Philadelphia Eagles        -1.00  46.08

Leave a Comment

Analysis of Google Code Jam Qualifier Results

I wrote a quick Python script to pull down all of the results from the qualifying rounds of Google Code Jam from 2008 and 2009.  I have put the rank and time of each submitted solution for all participants in these two Google Spreadsheets:

I also put a collection of summary statistics in this Google Spreadsheet.

Here are some conclusions drawn from these data:

  1. The number of participants increased by 1,449 people in 2009, a 20% increase.
  2. There were 2,572 people who participated both years.  This means 36% of the 2008 participants came back for more.
  3. The 2008 Qualifier had more difficult problems.  After normalizing the point totals (2008 had a max total of 75 points but 2009 had a max of 99 points), the average score was 15% higher in 2009.
  4. Problem C in 2008 was extremely hard.  Only 14% of the participants solved Problem C with the small data set, and only 9% solved Problem C with the large data set.
  5. The large data set for Problem C in 2009 was hard for many people.  Only 36% of people solved it.  Furthermore, of the people who solved the small data set, only 57% of those were able to solve the large data set.
  6. People worked roughly the same amount of time both years.  The 2009 times are all slightly larger, but the round was also extended by 2 hours because of technical problems early in the round.
  7. On average, for each participant, there is about a 3.5 hour gap between the submission time of the first solution and the submission time of the last solution.  Of course some people (like me) probably choose to tackle the problems whenever they found free moments during their day.

If people are interested, I’ll post my code and all the data somewhere.

Leave a Comment

Google Code Jam 2009 Qualifier

I finished all 3 problem in this year’s Google Code Jam.  I found them to be much easier than last year’s qualifier.  (Although last year a friend told me about the qualifier less than two hours before it ended, so I was rushed).

Last year I wrote my solutions in Haskell because I was learning the language.  This year I wasn’t so adventurous; I wrote my solutions using Python.

For the Alien Language challenge, I encoded the lexicon using a tree built with python dictionaries.  I imagine this could be done using a regex, but I haven’t thought that through.

For the Watersheds challenge, I simply built two 2d arrays: one to hold the topology and the other to hold the labels.  From each coordinate on the map, I followed the flow down hill until I reached a sink or reached a labeled coordinate.  This memoization ensures that you only visit each node once.

For the Welcome to Code Jam challenge, I memoized a mapping from (string_index, substring) to the number of times that substring can be completed starting at the index.  Again, this was straightforward.

I’m looking forward to seeing how other people solved each challenge.  I wonder how 16 year old Neal Wu made such quick work of the problems, solving all three in under 26 minutes.  That’s amazing.

Comments (6)

I don’t use debuggers

I just read this post, which says “Just about every developer uses a debugger, at least occasionally.”  I don’t know if I’m inadequate or working on a peculiar class of problems, but I haven’t used a debugger in roughly 10 years.

I have used a debugger twice.  Once I was chasing down a bug in a boutique embedded systems compiler.  The compiler had a bug, and staring at the C code didn’t cut it.  I also used the Perl debugger while writing my MS thesis (back when I wrote oodles of Perl code).  I used it because a friend showed it to me, and it seemed cool.

My primary debugging technique is code examination.  I simply look carefully at my code and think about it.  I also write unit tests.

I’m not fundementally opposed to debuggers.  It’s just that I haven’t used one in the past 10 years.

Leave a Comment

Randomized optimization algorithm

I thought of a simple randomized optimization algorithm today while I was working on a problem. The algorithm’s working well for me, and while I doubt it’s truly novel, it’s new to me.

Broadly speaking, I’m searching for for a vector of floating point numbers that optimizes my objective function. My objective function is complex enough to be thought of as a black box, and therefore there’s no closed-form gradient function. Although the function has no local minima, the parameters are subject to a number of boundary conditions.

I’m coding in python, so at first I tried using scipy.optimize.fmin. In the function passed to fmin, I returned a large number if the given vector violated the boundary conditions. Unfortunately those large numbers created singularities that caused fmin to get stuck. There are other functions within scipy.optimize that might work, but it wasn’t clear how to use them. This motivated me to code up my own algorithm.

The general approach is simple: at each step generate a random neighbor. If that neighbor is superior to the current point, move to it. I generate a neighbor by perturbing each element of my vector by some fixed delta. Specifically, I multiple a vector numpy.random.random_integers(low=-1, high=1, size=mysize) by delta. Then I normalize the result. Lastly, I find the closest vector point that satisfies the boundary conditions.

This has the succinctness that makes many randomized algorithm appealing. In addition, on my problem, its performance was better
than I expected. Two enhancements made it exceptional.

First, if I generate too many neighbors without finding a superior one, then I decrease delta by a factor of 10. The algorithm stops when delta drops below some threshold. This provides a natural transition between large, coarse steps for speed and small, fine steps for precision.

Second, when I’m at point x0, and I generate a superior point at x1, then I search for increasingly superior points along the trajectory from x0 to x1. Doing so yields the benefits of a poor man’s gradient.

example_search

The figure gives a rough illustration of the search technique. We start at a point with value 8, and randomly pick a neighbor using delta 4. Next we search along that trajectory and find a better point with value 4. Once again using delta 4 we pick a random superior neighbor. Note that we hop over a boundary point when we move there. Unfortunately we don’t randomly generate a better neighbor with delta 4 from that point (even though one exists). So we switch delta to 2, and continue the algorithm.

My code is currently embedded with logic that’s specific to the problem I’m solving.  If I abstract it out, I’ll post the code here.  I’m curious to hear from anybody else who’s implemented a randomized optimization algorithm like this one.

Comments (1)

Pythonic Data Analysis with MaskedArray and Timeseries

In the financial world, most quants analyze time series data using languages such as Matlab, R, SAS, or Stata.  I’ve used those tools, but I’m much happier working with a more general purpose language.  Most recently, I’ve been writing most of my code in Python.  Although people have implemented interfaces from Python to R and other numerical libraries, I prefer to avoid hopping in and out of Python.  Fortunately, Python, NumPy, and SciPy provide an expressive, flexible, and efficient platform for analyzing data.

Regardless of what programming platform you choose, the biggest challenge is wrangling real world data into the platform’s data structures so that you can take advantage of their high level operators.  When you’re wrangling, you’re typically confronted with a Procrustean bed that forces you to crudely fit your real world inputs to the pristine shape and view of your data structure.  This painful process unfortunately risks generating incorrect and misleading results.

In the case of python’s numpy, you need to fit your data into numpy arrays.  Recently I’ve needed to analyze a lot of real world time series data, so I’ve been exploring two important extensions to numpy: masked arrays and the scikits timeseries.  I want to share my experiences and show some code.

I’m coding in Python 2.5.2, using numpy 1.3.0.dev6370, scipy 0.7.0, and scikits.timeseries 0.67.0.dev-r1480.  I’ve run my code both on Windows and OS X 10.5.  A single file containing all of the example code is located on github here.  Everything that I’m writing related to this post is accessible via git://github.com/nodogbite/maskedarray_timeseries.git or by browsing here.

Real World Example

To ground this discussion, let’s consider a purely random dataset that approximates some simple real world daily financial data.  The function ‘generateFakeStockData’ pumps out roughly 8 years of daily return and trading volume data for a universe of 2000 tickers.  Using this data, we’ll calculate:

  • the volume-weighted returns for a given basket of stocks, and
  • the minimum, maximum, and average daily daily returns or trading volumes.

Starting at the top of the class hierarchy, consider the basic numpy array.  Fundamentally, a numpy array is a data structure that stores a multi-dimensional collection of homogeneous data.  Once data is in a numpy array, it’s easy to crunch it using either regular python code or optimized functions from the numpy and scipy libraries.

For our toy example, it’s tempting to tap into those capabilities by loading the data into numpy arrays.  We could put the daily returns into a two-dimensional array of floats, while putting the daily trading volume into an array of integers.  In both numpy arrays, the columns contain data for a particular ticker, and the rows contain data for a particular day.  Then, as the figure shows, we could compute a vector of total returns by writing ‘NP.prod(dailyReturns + 1, axis=0) – 1′.

numpyarray_stockdata1
Unfortunately, our data, and most real world data, aren’t perfectly aligned, and are interspersed with missing values.  These gaps aren’t easy to represent in the numpy array.  As a hack, we could insert a conspicuous flag value in place of missing values, and then derive a perfectly clean array or set of arrays every time we want to use a numpy operator.  This is both inefficient and error prone.  We won’t bother doing that with our example daily financial data.

Masked Arrays

The numpy.ma.MaskedArray, a subclass of the numpy array, was built for this situation.  Conceptually, a masked array is a numpy array coupled with a second array of booleans that has the same shape.  The first array is called the ‘data’, and the boolean array is called the ‘mask’.  The MaskedArray blanks out the data array value at every position where the mask array is ‘True’.

maskedarray_stockdata1
The MaskedArray redefines most of the numpy array’s functions so that it handles missing data just as you expect.  For example, sum and product simply ignore the masked entries.  Since MaskedArray is a subclass of ndarray, the switch is fairly seamless.  There are also a bunch of new capabilities, as well as some potential pitfalls, which I’ll point out.

Returning to our example, we write a function that loads data from ‘generateFakeStockData’ into a couple of masked arrays.  Let’s name the function ‘loadDataIntoMaskedArrays’.  This function is fairly representative of most functions that I’ve written to build masked arrays when I’m processing CSV files.  (Usually I’m a bit more sophisticated so that I can process any number of columns, instead of hard-coding information about the columns, as I’ve done here.)  As we read each datum, we record both its value and a boolean that will become its mask.  Finally, we pass both the data and the mask to the MaskedArray constructor.

Once the data is in the MaskedArray, we can use functions from its class and module to produce results that appropriately skip the missing data.  Similar to the numpy array, we write ‘MA.product(dailyReturns + 1, axis=0) – 1′ to generate a vector of total returns for every ticker.  Here we see a minor blemish in the API: the numpy package has both a ‘prod’ and ‘product’ function, but the numpy.ma package only has a ‘product’ function.

Timeseries

So far, we’ve only used the dates as a groupby key.  The ‘scikits.timeseries’ package enables us to conveniently couple a list of dates to our masked array.  Doing so will help us align data for different tickers, as well as produce subsets for a given date range.  There’s some nice documentation for the timeseries module here, but let’s write some timeseries code for our example.

Let’s call the function ‘loadDataIntoTimeseries’, which uses a helper function ‘makeTimeseriesGrid’.  This version of the loading function doesn’t have the bit of logic to fill in masked values for tickers that are missing from the initial dates.  Instead, we simply keep lists of observed dates, datums, and masks for every ticker.  Then we use functions from the timeseries package to properly align the rows and fill in missing values.

Creating a list of timeseries Date objects is the first step when constructing a timeseries object.  Every Date object has a frequency.  In this case we use the ‘B’ frequency, which stands for business days, or Monday through Friday.  Then, using that list of Date objects, you must create a DateArray object, which also has a frequency that’s equal to the frequency of the dates.  The time_series constructor also takes in a MaskedArray that has the same length as the DateArray.

The Date object has some useful functionality, such as when you add 1 to a Friday Date with business frequency, you get the subsequent Monday.  One feature that’s potentially harmful is that if you construct a business frequency Date using a weekend, you’ll get back a Date for the subsequent Monday.  I think a thrown exception would be more pythonic.

Let’s look at the function ‘makeTimeseriesGrid’.  Here we build up a timeseries for every ticker, align them to a common date list, and finally stack all of the data together into a single two-dimensional grid.  Note that I call ‘fill_missing_dates’ on each ticker timeseries.  This fills in all the missing dates in the DateArray and also inserts masked rows into the timeseries data at the corresponding locations.  I wish this were the default behavior because it leverages the principle point of MaskedArrays.  I think calling it is the best practice.

Each of the timeseries are not necessarily aligned on the same date list.  The timeseries package contains a function ‘aligned’, which behaves a bit like ‘fill_missing_dates’, to create a list of timeseries objects that contain the exact same dates.  Its signature is a bit awkward because it takes its input as a variable length argument list.

Finally, we use the numpy.ma.column_stack function to build a two-dimensional array.  Note that we access the ‘series’ property of each timeseries object, which returns its masked array.  The timeseries object also has a ‘data’ property which returns a plain, un-masked numpy array.  I wish ‘data’ returned the masked array because ‘series’ is a confusing name and I think a masked array should be the default.

To Be Continued

This post is pushing “too long didn’t read” length.  I will defer benchmarks and additional examples of using the constructed timeseries object to a subsequent post.

Comments (6)

Scatter Map of Madoff’s Victims

When I learned that the court had made a list of Madoff’s victims available, I immediately wanted to graph it.  I first forwarded the PDF one of my best friends.  Within a couple hours he had converted it to text, which he streamed through Yahoo Pipes to generate KML suitable for viewing with Google Maps.  The pipes filtered out duplicates, did address discovery and geolocation.  Unfortunately, Google Maps can’t display more than 80 markers at once, which is far less than the thousands of people whom Madoff ripped off.  He eventually did pull off a single view visualization, but it ran like a dog.

So last night I took a different approach.  I extracted out all of the zip codes from the list.  (Ignoring all of the international investors robbed by Madoff.)  Then I used http://geocoder.us to map each zip code to a latitude and longitude.  Next I wrote a short script in NodeBox to draw a projection of the zip codes on a canvas.  Each point is color coded to indicate the number of times that zip code occurs in the court document.  I was helped a lot by code and ideas in Ben Fry’s Visualizing Data book, which I have access to via Safari Bookshelf.  In particular, I stole his function for projecting out the latitude and longitude numbers.

Here are the resulting images.  The perfectionist in me would love to tinker with this for hours.

scatter map of Madoff's victims in the NYC Metro area

scatter map of Madoff's victims in the NYC Metro area

scatter map of Madoff's victims in NYC Metro area

scatter map of Madoff's victims in NYC Metro area

scatter map of Madoff's victims in New England

scatter map of Madoff's victims in New England

scatter map of Madoff's victims in Florida

scatter map of Madoff's victims in Florida

Comments (6)

Manage your Django Transactions

I’m building a Django application at work that performs several thousand inserts.  I was frustrated when it was taking several minutes to populate the database.  For reference, I’m running on a Windows XP box with a 2GHz Xeon, 4GB of RAM, and I’m using SQLite as the database.  In one straightforward piece of the program, it was taking over 3 minutes to insert 5000 rows.

Since I’m new to Django, I wasn’t immediately aware that, by default, every call to an ORM object’s save() results in a separate transaction.  When you’re doing thousands of inserts, the overhead is unbearable.  So after reading this, this, and this, I realized that the @transaction.commit_on_success decorator would allow me to lump a collection of saves into a single transaction.

from django.db import transaction

@transaction.commit_on_success
def myfunction(request):
    for i in range(5000):
        myobj = MyObject(mynumber=i)
        myobj.save()

After I made this change, the 5000 inserts took roughly 15 seconds instead of over 3 minutes.

Comments (2)

Hack for functools.partial and multiprocessing

Today I wrote some python code that mapped a function onto a sequence of financial time series data.  For the first version, I used itertools.imap to apply a partial function created with functools.partial to the sequence.  I needed a partial function with curried arguments because a couple of the inputs were created and fixed before the call to itertools.imap.

The program took 20 minutes to run because the dataset is rather large.  Since it’s an embarassingly parallel problem, I decided to give the multiprocessing package a whirl.  I’ve used multiprocessing’s parallel map functions before, so I naively created a worker Pool with 4 proceses and changed itertools.imap to pool.imap.  UnfortunatelyI was smacked with this error: “TypeError: type ‘partial’ takes at least one argument.”

It seems that the functools.partial object isn’t pickeled across the multiprocessing workers.  I worked around this by using itertools.izip, combined with itertools.repeat.  The itertools.repeat function mimics currying by repeating the fixed argument.  I’ve pasted a simple example of this below, and I’ve also put it here.  If anybody has a better way of doing this, please let me know.

import functools
import itertools
import multiprocessing

def uncurryDummyMultiply(t):
    return dummyMultiply(t[0], t[1])

def dummyMultiply(a, b):
    return a * b

def serialMap():
    mathFun = functools.partial(dummyMultiply, 10)
    total = 0
    for x in itertools.imap(mathFun, xrange(1000)):
        total += x
    return total

def parallelMap_Error():
    """Generates an Error!
    TypeError: type 'partial' takes at least one argument
    """
    mathFun = functools.partial(dummyMultiply, 10)
    pool    = multiprocessing.Pool(processes=4)
    total = 0
    for x in pool.imap(mathFun, xrange(1000)):
        total += x
    return total

def parallelMap_NoError():
    """Parallel version of serialMap that doesn't generate a TypeError
    """
    pool    = multiprocessing.Pool(processes=4)
    total = 0
    for x in pool.imap(uncurryDummyMultiply,
                       itertools.izip(itertools.repeat(10),
                                      xrange(1000)),
                       chunksize=50):
        total += x
    return total

if __name__ == "__main__":
    print "serialMap result:   %d" % serialMap()
    print "parallelMap result: %d" % parallelMap_NoError()

Leave a Comment

Older Posts »
Follow

Get every new post delivered to your Inbox.