In the financial world, most quants analyze time series data using languages such as Matlab, R, SAS, or Stata. I’ve used those tools, but I’m much happier working with a more general purpose language. Most recently, I’ve been writing most of my code in Python. Although people have implemented interfaces from Python to R and other numerical libraries, I prefer to avoid hopping in and out of Python. Fortunately, Python, NumPy, and SciPy provide an expressive, flexible, and efficient platform for analyzing data.
Regardless of what programming platform you choose, the biggest challenge is wrangling real world data into the platform’s data structures so that you can take advantage of their high level operators. When you’re wrangling, you’re typically confronted with a Procrustean bed that forces you to crudely fit your real world inputs to the pristine shape and view of your data structure. This painful process unfortunately risks generating incorrect and misleading results.
In the case of python’s numpy, you need to fit your data into numpy arrays. Recently I’ve needed to analyze a lot of real world time series data, so I’ve been exploring two important extensions to numpy: masked arrays and the scikits timeseries. I want to share my experiences and show some code.
I’m coding in Python 2.5.2, using numpy 1.3.0.dev6370, scipy 0.7.0, and scikits.timeseries 0.67.0.dev-r1480. I’ve run my code both on Windows and OS X 10.5. A single file containing all of the example code is located on github here. Everything that I’m writing related to this post is accessible via git://github.com/nodogbite/maskedarray_timeseries.git or by browsing here.
Real World Example
To ground this discussion, let’s consider a purely random dataset that approximates some simple real world daily financial data. The function ‘generateFakeStockData’ pumps out roughly 8 years of daily return and trading volume data for a universe of 2000 tickers. Using this data, we’ll calculate:
- the volume-weighted returns for a given basket of stocks, and
- the minimum, maximum, and average daily daily returns or trading volumes.
Starting at the top of the class hierarchy, consider the basic numpy array. Fundamentally, a numpy array is a data structure that stores a multi-dimensional collection of homogeneous data. Once data is in a numpy array, it’s easy to crunch it using either regular python code or optimized functions from the numpy and scipy libraries.
For our toy example, it’s tempting to tap into those capabilities by loading the data into numpy arrays. We could put the daily returns into a two-dimensional array of floats, while putting the daily trading volume into an array of integers. In both numpy arrays, the columns contain data for a particular ticker, and the rows contain data for a particular day. Then, as the figure shows, we could compute a vector of total returns by writing ‘NP.prod(dailyReturns + 1, axis=0) – 1’.
Unfortunately, our data, and most real world data, aren’t perfectly aligned, and are interspersed with missing values. These gaps aren’t easy to represent in the numpy array. As a hack, we could insert a conspicuous flag value in place of missing values, and then derive a perfectly clean array or set of arrays every time we want to use a numpy operator. This is both inefficient and error prone. We won’t bother doing that with our example daily financial data.
The numpy.ma.MaskedArray, a subclass of the numpy array, was built for this situation. Conceptually, a masked array is a numpy array coupled with a second array of booleans that has the same shape. The first array is called the ‘data’, and the boolean array is called the ‘mask’. The MaskedArray blanks out the data array value at every position where the mask array is ‘True’.
The MaskedArray redefines most of the numpy array’s functions so that it handles missing data just as you expect. For example, sum and product simply ignore the masked entries. Since MaskedArray is a subclass of ndarray, the switch is fairly seamless. There are also a bunch of new capabilities, as well as some potential pitfalls, which I’ll point out.
Returning to our example, we write a function that loads data from ‘generateFakeStockData’ into a couple of masked arrays. Let’s name the function ‘loadDataIntoMaskedArrays’. This function is fairly representative of most functions that I’ve written to build masked arrays when I’m processing CSV files. (Usually I’m a bit more sophisticated so that I can process any number of columns, instead of hard-coding information about the columns, as I’ve done here.) As we read each datum, we record both its value and a boolean that will become its mask. Finally, we pass both the data and the mask to the MaskedArray constructor.
Once the data is in the MaskedArray, we can use functions from its class and module to produce results that appropriately skip the missing data. Similar to the numpy array, we write ‘MA.product(dailyReturns + 1, axis=0) – 1’ to generate a vector of total returns for every ticker. Here we see a minor blemish in the API: the numpy package has both a ‘prod’ and ‘product’ function, but the numpy.ma package only has a ‘product’ function.
So far, we’ve only used the dates as a groupby key. The ‘scikits.timeseries’ package enables us to conveniently couple a list of dates to our masked array. Doing so will help us align data for different tickers, as well as produce subsets for a given date range. There’s some nice documentation for the timeseries module here, but let’s write some timeseries code for our example.
Let’s call the function ‘loadDataIntoTimeseries’, which uses a helper function ‘makeTimeseriesGrid’. This version of the loading function doesn’t have the bit of logic to fill in masked values for tickers that are missing from the initial dates. Instead, we simply keep lists of observed dates, datums, and masks for every ticker. Then we use functions from the timeseries package to properly align the rows and fill in missing values.
Creating a list of timeseries Date objects is the first step when constructing a timeseries object. Every Date object has a frequency. In this case we use the ‘B’ frequency, which stands for business days, or Monday through Friday. Then, using that list of Date objects, you must create a DateArray object, which also has a frequency that’s equal to the frequency of the dates. The time_series constructor also takes in a MaskedArray that has the same length as the DateArray.
The Date object has some useful functionality, such as when you add 1 to a Friday Date with business frequency, you get the subsequent Monday. One feature that’s potentially harmful is that if you construct a business frequency Date using a weekend, you’ll get back a Date for the subsequent Monday. I think a thrown exception would be more pythonic.
Let’s look at the function ‘makeTimeseriesGrid’. Here we build up a timeseries for every ticker, align them to a common date list, and finally stack all of the data together into a single two-dimensional grid. Note that I call ‘fill_missing_dates’ on each ticker timeseries. This fills in all the missing dates in the DateArray and also inserts masked rows into the timeseries data at the corresponding locations. I wish this were the default behavior because it leverages the principle point of MaskedArrays. I think calling it is the best practice.
Each of the timeseries are not necessarily aligned on the same date list. The timeseries package contains a function ‘aligned’, which behaves a bit like ‘fill_missing_dates’, to create a list of timeseries objects that contain the exact same dates. Its signature is a bit awkward because it takes its input as a variable length argument list.
Finally, we use the numpy.ma.column_stack function to build a two-dimensional array. Note that we access the ‘series’ property of each timeseries object, which returns its masked array. The timeseries object also has a ‘data’ property which returns a plain, un-masked numpy array. I wish ‘data’ returned the masked array because ‘series’ is a confusing name and I think a masked array should be the default.
To Be Continued
This post is pushing “too long didn’t read” length. I will defer benchmarks and additional examples of using the constructed timeseries object to a subsequent post.