Posts Tagged data

Efficient Calculations on Named Columns of Realworld Data

In data analysis frameworks, it’s common to represent data as a two-dimensional rectangle, where each column is given a name.  Every named column contains values of the same type and every row corresponds to related datums.  I say “rectangle” instead of matrix because a matrix typically implies that every value has the same type.  Database tables have named and typed columns, but they typically only support a limited set of operations.  Of course there are special-purpose databases, like Kdb+, but they’re prohibitively expensive for most users.

Stata is one commercial program that supports a variety of operations on data rectangles.  In fact, Stata’s language is entirely built around the idea of having a single rectangle in memory, upon which you sort, filter, generate columns, or merge with rectangles that you previously saved to disk.  Each column has a name and a data type.  The rectangles also support missing values, which is essential when working with real world data.

Unfortunately, although Stata is a decent tool for playing with data while doing research, it’s horrible as a production tool.  Also, its language is so simple that it makes operations like loops or fill_forward a nightmare.

I want to perform high level rectangular operations within a more flexible language like Python.  I’m already comfortable wrangling data in numpy, and I’m a huge fan of numpy’s MaskedArray package.  Numpy supports named, typed columns via recarray, but recarray is generally incomplete.  For example, there’s no built in way to merge (join) two recarrays.  More importantly, recarray does not support masked values!

I have built my own my own module that replicates and expands upon most of Stata’s rectangular operations.  My implementation includes a class that stores data in a dictionary mapping column names to numpy’s masked arrays.  I can’t share the code because I wrote it at work.

I’m happy with my implementation, but I wish it were faster.  Although I’m faster than Stata, two key operations that are slower than I’d like are: 1) merge (join) and 2) fill-forward.  Both of these do too much work outside of optimized space of numpy.  I’m thinking of rewriting these functions in C or Cython.

I’d love to hear comments about this general data structure.  How are other people dealing with rectangles of real world data?

Comments (5)

Fantasy Football Pickem Strategy Results after Week 4

After Brett Favre and the Vikings polished off the Packers last night, I found myself in 7th place out of 50 in my NFL Pickem pool.  As the following box plot shows, choosing weights using Vegas point spreads put me above the median in 3/4 weeks and in the top quartile 2/4 weeks.  In the second week, when I was below the median, I was still within the interquartile range.

weeklyperformance_boxplot

Here’s some of the data in the above box plot.

Week  |    Mean  StdDev     IQR     Min      lq  Median      uq     Max
=======================================================================
   1  |   114.5    8.22      11      95     109     115     120     133
   2  |    77.1   10.38      13      53      70      78      83     107
   3  |   104.6   15.76      14      15      99     108     113     126
   4  |    91.0    7.05       9      71      86      91      95     104

Leave a Comment

Probability NFL Favorite Wins, Given Point Spread

I’m continuing my investigation of what we can infer from NFL point spreads.  I wrote a little code to compute the conditional probability of the favorite team winning, given the size of the point spread.  Here are the results.

probwin_exact_spread

As you can see, the game is essentially a tossup until the point spread is at least a fieldgoal.  Once the spread exceeds a touchdown, the favorite wins nearly 4/5 times.  In the rare instance that the point spread exceeds two touchdowns, then the favorite is a lock to win.

Here’s the raw data used to generate this plot.

    spread       wins      games    probwin
============================================
      0.00         21         47      44.68
      1.00         70        138      50.72
      1.50         60        123      48.78
      2.00         87        159      54.72
      2.50        126        249      50.60
      3.00        351        587      59.80
      3.50        225        360      62.50
      4.00        101        155      65.16
      4.50         71        120      59.17
      5.00         97        133      72.93
      5.50         99        142      69.72
      6.00        108        168      64.29
      6.50        154        226      68.14
      7.00        177        238      74.37
      7.50         85        123      69.11
      8.00         75         95      78.95
      8.50         70         85      82.35
      9.00         71         91      78.02
      9.50         79         98      80.61
     10.00         66         95      69.47
     10.50         55         67      82.09
     11.00         40         48      83.33
     11.50         23         27      85.19
     12.00         20         25      80.00
     12.50         28         34      82.35
     13.00         20         24      83.33
     13.50         34         44      77.27
     14.00         26         33      78.79
     14.50         14         16      87.50
     15.00          9          9     100.00
     15.50          9          9     100.00
     16.00         14         14     100.00
     16.50          3          3     100.00
     17.00          5          5     100.00
     17.50          6          7      85.71
     18.00          2          2     100.00
     18.50          1          1     100.00
     19.00          1          1     100.00
     19.50          2          2     100.00
     20.00          1          1     100.00
     21.00          1          1     100.00
     22.00          1          1     100.00
     24.00          2          2     100.00

Leave a Comment

Simple Backtesting of Football Pickem Strategy

I was curious to see how my simple strategy for the weighted fantasy football pickem would do in previous years.  I found an archive of NFL point spreads, over under lines, and game results on goldsheet.com.  I don’t know how accurate the data are, but I wrote a little Python script to parse it.  The formatting of the data is pretty rough, so I’ve put a CSV version up on Google Docs.

Here’s a tabular summary of the results.  The key column is Percent, which shows the percentage of available points (MaxPosib) that my strategy got in each year.

    Year      Min      Max      Avg    Total   MaxPosib    Percent
==================================================================
    1993       27      103    60.94     1097       1466      74.83
    1994       46       95    64.82     1102       1500      73.47
    1995       50      114    81.18     1380       1747      78.99
    1996       51      119    82.82     1408       1748      80.55
    1997       52      114    82.24     1398       1745      80.11
    1998       56      120    86.94     1478       1734      85.24
    1999       65      115    91.06     1548       1959      79.02
    2000       62      113    90.53     1539       1959      78.56
    2001       43      117    90.29     1535       1960      78.32
    2002       62      136    95.59     1625       2088      77.83
    2003       72      134    98.53     1675       2088      80.22
    2004       67      129    95.47     1623       2088      77.73
    2005       77      127   101.47     1725       2088      82.61
    2006       65      134    89.59     1523       2091      72.84
    2007       56      136   103.41     1758       2091      84.07
------------------------------------------------------------------
 average    56.73   120.40    87.66  1494.27    1890.13      78.96
std.dev.    12.20    11.84    11.67   189.42     211.33       3.39
     min    27.00    95.00    60.94  1097.00    1466.00      72.84
     max    77.00   136.00   103.41  1758.00    2091.00      85.24

So I expect to score 79% of the total points available this year if I use my strategy of assigning weights to the favored team, ordered by point spread.

The next step is obviously to try some other strategies and see what the results are.  First I want to see how poorly a completely random strategy does.

Comments (1)

Analysis of Google Code Jam Qualifier Results

I wrote a quick Python script to pull down all of the results from the qualifying rounds of Google Code Jam from 2008 and 2009.  I have put the rank and time of each submitted solution for all participants in these two Google Spreadsheets:

I also put a collection of summary statistics in this Google Spreadsheet.

Here are some conclusions drawn from these data:

  1. The number of participants increased by 1,449 people in 2009, a 20% increase.
  2. There were 2,572 people who participated both years.  This means 36% of the 2008 participants came back for more.
  3. The 2008 Qualifier had more difficult problems.  After normalizing the point totals (2008 had a max total of 75 points but 2009 had a max of 99 points), the average score was 15% higher in 2009.
  4. Problem C in 2008 was extremely hard.  Only 14% of the participants solved Problem C with the small data set, and only 9% solved Problem C with the large data set.
  5. The large data set for Problem C in 2009 was hard for many people.  Only 36% of people solved it.  Furthermore, of the people who solved the small data set, only 57% of those were able to solve the large data set.
  6. People worked roughly the same amount of time both years.  The 2009 times are all slightly larger, but the round was also extended by 2 hours because of technical problems early in the round.
  7. On average, for each participant, there is about a 3.5 hour gap between the submission time of the first solution and the submission time of the last solution.  Of course some people (like me) probably choose to tackle the problems whenever they found free moments during their day.

If people are interested, I’ll post my code and all the data somewhere.

Leave a Comment

SPY Closing Price Update

Just to round out my quick post on dirty financial data, I came into work today and saw that Thursday’s closing price for SPY is now correctly stated as 94.15.  Sometimes I half-wonder if somebody intentionally causes these mistakes just to put a stick in the wheel of quantitative backtests.

spy_hp_20090720

Leave a Comment

Dirty Financial Data: SPY Closing Price

Before joining finance, my naive assumption was that the market’s high stakes would necessitate accurate, high quality data.  In particular, I expected frequently traded stocks and ETFs on public exchanges to have accurately quoted end of day prices.  In reality, even those data are noisy.

Yesterday’s (7/16/2009) closing price for SPY is one example.  The NYSE reported a composite closing price of 93.11 for SPY, but the intraday price graph makes it clear that 94.15 is much more accurate.  Here are two Bloomberg screens showing both the tabulated closing price and the intraday price plot.

spy_hp_20090717

spy_gip2_20090717

Over 200 million shares of SPY trade every business day, yet bad closing prices like this still pop up.

Leave a Comment

First version of RankBuzzard.com

I spent the holiday weekend implementing a website using the Django web framework.  In just a few days, I managed to put together a fully functional website that displays Google Hot Trends data and collects user comments.  The general point might be described as “Why are these searches popular?”  You can see the results at RankBuzzard.com.

I last built a public website roughly 4 years ago, when I wrote imwatching.net (which is no longer up).  At a broad level, the two websites are similar.  Both collect time series data, put it in a database, and present it via HTML.  I built imwatching.net using a collection of perl scripts that used the CGI module.  I had no ORM.  I had to build my own user management logic.  I had no templating system.  As a result, it took me at least 3 or 4 times as long to develop imwatching.net.  The end result also wasn’t nearly as tidy and well structured as my implementation of RankBuzzard.com.

What’s more, I hosted imwatching.net on a dedicated server that I rented from serverbeach.com for roughly $100/month.  Although I was happy with the quality of the server, the $100/month cost ultimately caused me to close the site.  Now I’m paying $20/month for a virtual server at linode.com.  Although it’s not a direct comparison, I’m been very happy so far with the service at linode.com.

Hopefully some people will find RankBuzzard interesting, informative, and fun.  I have several ideas for how to improve it, and I plan on adding to it when I find the free time over the next few weeks.

Leave a Comment

Review of OmniGraphSketcher

Last night I downloaded OmniGraphSketcher and gave it a test drive.  I’m a huge fan of OmniGraffle, OmniGroup’s diagramming program, so I was curious to see what their new graphing program had to offer.  In short, I like the usability that it inherited from OmniGraffle, but they need to make several improvements before I would even consider paying them $29.95.

For years I’ve been searching for a graphing platform that helps me quickly create Tufte-quality graphics.  I’ve had the best luck with programming-based environments such as NodeBox, Processing, or PSTricks (within LaTeX).  Back in 2000 I even wrote my own framework called TexLogo, which was a language and interactive system for illustrating Latex documents with logo turtle graphics. These frameworks are all drawing programs with an API.  As such, they give you the finest level of control with the unfortunate downside of being labor intensive.  Whenever you use them you risk diving down the rabbit hole.

The other end of the spectrum are programs that easily produce bad or misleading graphs.  Excel and gnuplot easily crank out ugly graphics.

So I like OmniGraphSketcher’s approach.  The program mixes hand-guided drawing with x-y data coordinate locations.  The default result is clean and sharp, just like diagrams that come out of OmniGraffle.  Unfortunately it’s not a finished product.  Specifically, I need it to include the following:

  1. Units other than foating point numbers.  Most importantly, they need to add support for dates.  You can’t draw a time series graph with the program.
  2. Titles.  Yes, you can double click to add a label, but there’s no built in support for title or sub-titles.
  3. Legends.  This can also be done manually, but they could facilitate it so that you can link data points or series with a legend.
  4. More options for the axes.  For example I’d like the option to only show tick marks.  And I’d like the ability to only show specific tick marks.  I don’t want to be forced to use evenly spaced tick marks.

If the OmniGroup developers make these improvements, then I’m likely to send them thirty bucks.

Here’s a quick example graph that I threw together.

omnigraphsketcher_visits1

Here’s something closer to what I’d like that I made with OmniGraffle.

omnigraffle_visits

Comments (3)

Pythonic Data Analysis with MaskedArray and Timeseries

In the financial world, most quants analyze time series data using languages such as Matlab, R, SAS, or Stata.  I’ve used those tools, but I’m much happier working with a more general purpose language.  Most recently, I’ve been writing most of my code in Python.  Although people have implemented interfaces from Python to R and other numerical libraries, I prefer to avoid hopping in and out of Python.  Fortunately, Python, NumPy, and SciPy provide an expressive, flexible, and efficient platform for analyzing data.

Regardless of what programming platform you choose, the biggest challenge is wrangling real world data into the platform’s data structures so that you can take advantage of their high level operators.  When you’re wrangling, you’re typically confronted with a Procrustean bed that forces you to crudely fit your real world inputs to the pristine shape and view of your data structure.  This painful process unfortunately risks generating incorrect and misleading results.

In the case of python’s numpy, you need to fit your data into numpy arrays.  Recently I’ve needed to analyze a lot of real world time series data, so I’ve been exploring two important extensions to numpy: masked arrays and the scikits timeseries.  I want to share my experiences and show some code.

I’m coding in Python 2.5.2, using numpy 1.3.0.dev6370, scipy 0.7.0, and scikits.timeseries 0.67.0.dev-r1480.  I’ve run my code both on Windows and OS X 10.5.  A single file containing all of the example code is located on github here.  Everything that I’m writing related to this post is accessible via git://github.com/nodogbite/maskedarray_timeseries.git or by browsing here.

Real World Example

To ground this discussion, let’s consider a purely random dataset that approximates some simple real world daily financial data.  The function ‘generateFakeStockData’ pumps out roughly 8 years of daily return and trading volume data for a universe of 2000 tickers.  Using this data, we’ll calculate:

  • the volume-weighted returns for a given basket of stocks, and
  • the minimum, maximum, and average daily daily returns or trading volumes.

Starting at the top of the class hierarchy, consider the basic numpy array.  Fundamentally, a numpy array is a data structure that stores a multi-dimensional collection of homogeneous data.  Once data is in a numpy array, it’s easy to crunch it using either regular python code or optimized functions from the numpy and scipy libraries.

For our toy example, it’s tempting to tap into those capabilities by loading the data into numpy arrays.  We could put the daily returns into a two-dimensional array of floats, while putting the daily trading volume into an array of integers.  In both numpy arrays, the columns contain data for a particular ticker, and the rows contain data for a particular day.  Then, as the figure shows, we could compute a vector of total returns by writing ‘NP.prod(dailyReturns + 1, axis=0) – 1′.

numpyarray_stockdata1
Unfortunately, our data, and most real world data, aren’t perfectly aligned, and are interspersed with missing values.  These gaps aren’t easy to represent in the numpy array.  As a hack, we could insert a conspicuous flag value in place of missing values, and then derive a perfectly clean array or set of arrays every time we want to use a numpy operator.  This is both inefficient and error prone.  We won’t bother doing that with our example daily financial data.

Masked Arrays

The numpy.ma.MaskedArray, a subclass of the numpy array, was built for this situation.  Conceptually, a masked array is a numpy array coupled with a second array of booleans that has the same shape.  The first array is called the ‘data’, and the boolean array is called the ‘mask’.  The MaskedArray blanks out the data array value at every position where the mask array is ‘True’.

maskedarray_stockdata1
The MaskedArray redefines most of the numpy array’s functions so that it handles missing data just as you expect.  For example, sum and product simply ignore the masked entries.  Since MaskedArray is a subclass of ndarray, the switch is fairly seamless.  There are also a bunch of new capabilities, as well as some potential pitfalls, which I’ll point out.

Returning to our example, we write a function that loads data from ‘generateFakeStockData’ into a couple of masked arrays.  Let’s name the function ‘loadDataIntoMaskedArrays’.  This function is fairly representative of most functions that I’ve written to build masked arrays when I’m processing CSV files.  (Usually I’m a bit more sophisticated so that I can process any number of columns, instead of hard-coding information about the columns, as I’ve done here.)  As we read each datum, we record both its value and a boolean that will become its mask.  Finally, we pass both the data and the mask to the MaskedArray constructor.

Once the data is in the MaskedArray, we can use functions from its class and module to produce results that appropriately skip the missing data.  Similar to the numpy array, we write ‘MA.product(dailyReturns + 1, axis=0) – 1′ to generate a vector of total returns for every ticker.  Here we see a minor blemish in the API: the numpy package has both a ‘prod’ and ‘product’ function, but the numpy.ma package only has a ‘product’ function.

Timeseries

So far, we’ve only used the dates as a groupby key.  The ’scikits.timeseries’ package enables us to conveniently couple a list of dates to our masked array.  Doing so will help us align data for different tickers, as well as produce subsets for a given date range.  There’s some nice documentation for the timeseries module here, but let’s write some timeseries code for our example.

Let’s call the function ‘loadDataIntoTimeseries’, which uses a helper function ‘makeTimeseriesGrid’.  This version of the loading function doesn’t have the bit of logic to fill in masked values for tickers that are missing from the initial dates.  Instead, we simply keep lists of observed dates, datums, and masks for every ticker.  Then we use functions from the timeseries package to properly align the rows and fill in missing values.

Creating a list of timeseries Date objects is the first step when constructing a timeseries object.  Every Date object has a frequency.  In this case we use the ‘B’ frequency, which stands for business days, or Monday through Friday.  Then, using that list of Date objects, you must create a DateArray object, which also has a frequency that’s equal to the frequency of the dates.  The time_series constructor also takes in a MaskedArray that has the same length as the DateArray.

The Date object has some useful functionality, such as when you add 1 to a Friday Date with business frequency, you get the subsequent Monday.  One feature that’s potentially harmful is that if you construct a business frequency Date using a weekend, you’ll get back a Date for the subsequent Monday.  I think a thrown exception would be more pythonic.

Let’s look at the function ‘makeTimeseriesGrid’.  Here we build up a timeseries for every ticker, align them to a common date list, and finally stack all of the data together into a single two-dimensional grid.  Note that I call ‘fill_missing_dates’ on each ticker timeseries.  This fills in all the missing dates in the DateArray and also inserts masked rows into the timeseries data at the corresponding locations.  I wish this were the default behavior because it leverages the principle point of MaskedArrays.  I think calling it is the best practice.

Each of the timeseries are not necessarily aligned on the same date list.  The timeseries package contains a function ‘aligned’, which behaves a bit like ‘fill_missing_dates’, to create a list of timeseries objects that contain the exact same dates.  Its signature is a bit awkward because it takes its input as a variable length argument list.

Finally, we use the numpy.ma.column_stack function to build a two-dimensional array.  Note that we access the ’series’ property of each timeseries object, which returns its masked array.  The timeseries object also has a ‘data’ property which returns a plain, un-masked numpy array.  I wish ‘data’ returned the masked array because ’series’ is a confusing name and I think a masked array should be the default.

To Be Continued

This post is pushing “too long didn’t read” length.  I will defer benchmarks and additional examples of using the constructed timeseries object to a subsequent post.

Comments (6)