Posts Tagged finance

Pokerbot Lesson: Work Hard During the Good Times

As I mentioned in an earlier post, I wrote a poker bot in 2004.  I learned a lot from that project, and I’d like to share some of the lessons learned over a series of blog posts.

All poker players are familiar with the game’s swings.  Poker’s unpredictable sequence of wins and losses stems from the addictive cocktail of uncertainty and randomness that makes the game so popular.  There are many lessons to learn while riding the roller coaster of wins and losses, but I’ll focus on one that’s particular to a poker bot author (or an investor in the markets).

Imagine if you unleashed your poker bot algorithm and observed the following performance over the first 10,000 hands.

Pokerbot performance over 10,000 hands

Pokerbot performance over 10,000 hands

It takes a poker bot about 22 hours to play 10,000 hands.  This assumes that the bot plays 10 tables simultaneously, averaging 45 hands per hour at each table.  For comparison, it takes a human about 28 days to play 10,000 hands, assuming he plays an intense 30 hands an hour, 12 hours a day.  Although it’s not a large enough sample to yield a statistically significant win rate, it’s a sizable sample.

The first couple of times that this happened to me, I couldn’t help myself from leaning back and thinking that my troubles were over.  Instead of analyzing more data or writing more code, I’d relax.  Then I’d see this turn of fortune.

Fictional graph of big bets won or loss over time.

Fictional graph of big bets won or loss over time.

The downturn would give me a kick in the pants, and I’d dive in again to my work with vigor.

After this pattern happened a couple times, I realized how much time I was wasting.  Frustrated by this inefficient cycle, I made a point of staying vigilant even when the results looked promising.  I learned that you have to work hard even when everything’s going smoothly.

This lesson is applicable to both investing and software development.  When your PnL is positive, you need to make an extra effort to stay sharp.  When you’re writing software, you can’t assume that your program is correct just because it runs without throwing off errors.  Be suspicious and stay vigilant, even when things seem okay.

Comments (2)

Navigating New York’s Attitude toward Technologists

I just read Zed Shaw’s post where he announces that he found a job in San Francisco.  Congratulations to him.  In his post, I appreciated his concluding comparison of how companies in New York and San Fran perceive and approach technologists.  Zed says:

NYC prospects were looking for a badass/ninja/rockstar beta-male “techie” employee to make them rich creating lame applications for giant Finance/Fashion/Marketing companies.  SF prospects were looking for a partner to get rich with them creating great products for customers.

To a large extent, I agree.  New York is filled with companies that have non-technical core competencies.  These businesses treat technology as a necessary cost.  When they search for a rockstar ninja, their goal is to minimize cost.  The thinking is “that last tech we had was too slow and too sloppy, and his mistakes cost us money.  We need a superstar tech guy to come in here and do things right (not screw up).”

The dynamic is top-down.  The business wants a super star who will take orders and work efficiently.  On the whole, this makes perfect sense.  As Judge Smails says, “the world needs ditch diggers, too.”

This system fails for superstar techies like Zed Shaw.  This class of techies are often prima donna’s in the same vein as star wide receivers in the NFL.  Instead of taking orders, they want to provide advice.  Instead minimizing cost, they want to generate value.  Indeed, they want to be partners, not employees.  (Even if the company pays well, as many New York firms do).

San Francisco, on the other hand, is a town filled with ninjas looking to band together.  Each one is a Steve Jobs looking for a Steve Wozniak to join them in the garage.  This is a natural and attractive situation, as evidenced by Zed’s move from his beloved LES.

New Yorkers do have hope.  The key is to pursue and promote every opportunity to increase revenue instead of decreasing costs.  When you see this opportunities, you need to explicitly point them out.  Then, if you’re given the green light, work extra hard to execute.  If you score a couple wins, then you can take a partnership role (even in New York).

Comments (1)

Book Review: Nerds on Wall Street by David Leinweber

This weekend I read David Leinweber’s book Nerds on Wall Street: Math, Machines and Wired Markets.  While sitting at Radiance Tea, I read a book review in the Wall Street Journal.  Since I had previously enjoyed David’s chapter in How I Became a Quant, I paid $24 and instantly downloaded it to my Kindle.

David’s has a clear, concise, and humorous writing style.  Also, since I’m also a (significantly more junior) nerd on Wall Street, I’m in his target audience.

My big takeaway is the sense of pragmaticism and passion that oozes out of his stories. It’s clear that technology makes him tick, and that he’s always right on top of the newest techniques.  And he’s been at it for years, so his stories all carry true gravitas.

Data mining while developing quantitative strategies is one principal theme that’s addressed in most of the book’s chapters.  Personally, I wish he had addressed this in greater detail.  It’s easy to dismiss eggregious data mining that links butter production to S&P returns.  It’s also easy to say “withhold data from your training data.” I also felt that when he discussed a specific example of using genetic programming, he essentially admitted that the strategy was datamined, yet restrained by common sense.

I recommend this book.  It’s a quick and enjoyable read about an interesting topic.

Leave a Comment

SPY Closing Price Update

Just to round out my quick post on dirty financial data, I came into work today and saw that Thursday’s closing price for SPY is now correctly stated as 94.15.  Sometimes I half-wonder if somebody intentionally causes these mistakes just to put a stick in the wheel of quantitative backtests.

spy_hp_20090720

Leave a Comment

Dirty Financial Data: SPY Closing Price

Before joining finance, my naive assumption was that the market’s high stakes would necessitate accurate, high quality data.  In particular, I expected frequently traded stocks and ETFs on public exchanges to have accurately quoted end of day prices.  In reality, even those data are noisy.

Yesterday’s (7/16/2009) closing price for SPY is one example.  The NYSE reported a composite closing price of 93.11 for SPY, but the intraday price graph makes it clear that 94.15 is much more accurate.  Here are two Bloomberg screens showing both the tabulated closing price and the intraday price plot.

spy_hp_20090717

spy_gip2_20090717

Over 200 million shares of SPY trade every business day, yet bad closing prices like this still pop up.

Leave a Comment

Pythonic Data Analysis with MaskedArray and Timeseries

In the financial world, most quants analyze time series data using languages such as Matlab, R, SAS, or Stata.  I’ve used those tools, but I’m much happier working with a more general purpose language.  Most recently, I’ve been writing most of my code in Python.  Although people have implemented interfaces from Python to R and other numerical libraries, I prefer to avoid hopping in and out of Python.  Fortunately, Python, NumPy, and SciPy provide an expressive, flexible, and efficient platform for analyzing data.

Regardless of what programming platform you choose, the biggest challenge is wrangling real world data into the platform’s data structures so that you can take advantage of their high level operators.  When you’re wrangling, you’re typically confronted with a Procrustean bed that forces you to crudely fit your real world inputs to the pristine shape and view of your data structure.  This painful process unfortunately risks generating incorrect and misleading results.

In the case of python’s numpy, you need to fit your data into numpy arrays.  Recently I’ve needed to analyze a lot of real world time series data, so I’ve been exploring two important extensions to numpy: masked arrays and the scikits timeseries.  I want to share my experiences and show some code.

I’m coding in Python 2.5.2, using numpy 1.3.0.dev6370, scipy 0.7.0, and scikits.timeseries 0.67.0.dev-r1480.  I’ve run my code both on Windows and OS X 10.5.  A single file containing all of the example code is located on github here.  Everything that I’m writing related to this post is accessible via git://github.com/nodogbite/maskedarray_timeseries.git or by browsing here.

Real World Example

To ground this discussion, let’s consider a purely random dataset that approximates some simple real world daily financial data.  The function ‘generateFakeStockData’ pumps out roughly 8 years of daily return and trading volume data for a universe of 2000 tickers.  Using this data, we’ll calculate:

  • the volume-weighted returns for a given basket of stocks, and
  • the minimum, maximum, and average daily daily returns or trading volumes.

Starting at the top of the class hierarchy, consider the basic numpy array.  Fundamentally, a numpy array is a data structure that stores a multi-dimensional collection of homogeneous data.  Once data is in a numpy array, it’s easy to crunch it using either regular python code or optimized functions from the numpy and scipy libraries.

For our toy example, it’s tempting to tap into those capabilities by loading the data into numpy arrays.  We could put the daily returns into a two-dimensional array of floats, while putting the daily trading volume into an array of integers.  In both numpy arrays, the columns contain data for a particular ticker, and the rows contain data for a particular day.  Then, as the figure shows, we could compute a vector of total returns by writing ‘NP.prod(dailyReturns + 1, axis=0) – 1′.

numpyarray_stockdata1
Unfortunately, our data, and most real world data, aren’t perfectly aligned, and are interspersed with missing values.  These gaps aren’t easy to represent in the numpy array.  As a hack, we could insert a conspicuous flag value in place of missing values, and then derive a perfectly clean array or set of arrays every time we want to use a numpy operator.  This is both inefficient and error prone.  We won’t bother doing that with our example daily financial data.

Masked Arrays

The numpy.ma.MaskedArray, a subclass of the numpy array, was built for this situation.  Conceptually, a masked array is a numpy array coupled with a second array of booleans that has the same shape.  The first array is called the ‘data’, and the boolean array is called the ‘mask’.  The MaskedArray blanks out the data array value at every position where the mask array is ‘True’.

maskedarray_stockdata1
The MaskedArray redefines most of the numpy array’s functions so that it handles missing data just as you expect.  For example, sum and product simply ignore the masked entries.  Since MaskedArray is a subclass of ndarray, the switch is fairly seamless.  There are also a bunch of new capabilities, as well as some potential pitfalls, which I’ll point out.

Returning to our example, we write a function that loads data from ‘generateFakeStockData’ into a couple of masked arrays.  Let’s name the function ‘loadDataIntoMaskedArrays’.  This function is fairly representative of most functions that I’ve written to build masked arrays when I’m processing CSV files.  (Usually I’m a bit more sophisticated so that I can process any number of columns, instead of hard-coding information about the columns, as I’ve done here.)  As we read each datum, we record both its value and a boolean that will become its mask.  Finally, we pass both the data and the mask to the MaskedArray constructor.

Once the data is in the MaskedArray, we can use functions from its class and module to produce results that appropriately skip the missing data.  Similar to the numpy array, we write ‘MA.product(dailyReturns + 1, axis=0) – 1′ to generate a vector of total returns for every ticker.  Here we see a minor blemish in the API: the numpy package has both a ‘prod’ and ‘product’ function, but the numpy.ma package only has a ‘product’ function.

Timeseries

So far, we’ve only used the dates as a groupby key.  The ‘scikits.timeseries’ package enables us to conveniently couple a list of dates to our masked array.  Doing so will help us align data for different tickers, as well as produce subsets for a given date range.  There’s some nice documentation for the timeseries module here, but let’s write some timeseries code for our example.

Let’s call the function ‘loadDataIntoTimeseries’, which uses a helper function ‘makeTimeseriesGrid’.  This version of the loading function doesn’t have the bit of logic to fill in masked values for tickers that are missing from the initial dates.  Instead, we simply keep lists of observed dates, datums, and masks for every ticker.  Then we use functions from the timeseries package to properly align the rows and fill in missing values.

Creating a list of timeseries Date objects is the first step when constructing a timeseries object.  Every Date object has a frequency.  In this case we use the ‘B’ frequency, which stands for business days, or Monday through Friday.  Then, using that list of Date objects, you must create a DateArray object, which also has a frequency that’s equal to the frequency of the dates.  The time_series constructor also takes in a MaskedArray that has the same length as the DateArray.

The Date object has some useful functionality, such as when you add 1 to a Friday Date with business frequency, you get the subsequent Monday.  One feature that’s potentially harmful is that if you construct a business frequency Date using a weekend, you’ll get back a Date for the subsequent Monday.  I think a thrown exception would be more pythonic.

Let’s look at the function ‘makeTimeseriesGrid’.  Here we build up a timeseries for every ticker, align them to a common date list, and finally stack all of the data together into a single two-dimensional grid.  Note that I call ‘fill_missing_dates’ on each ticker timeseries.  This fills in all the missing dates in the DateArray and also inserts masked rows into the timeseries data at the corresponding locations.  I wish this were the default behavior because it leverages the principle point of MaskedArrays.  I think calling it is the best practice.

Each of the timeseries are not necessarily aligned on the same date list.  The timeseries package contains a function ‘aligned’, which behaves a bit like ‘fill_missing_dates’, to create a list of timeseries objects that contain the exact same dates.  Its signature is a bit awkward because it takes its input as a variable length argument list.

Finally, we use the numpy.ma.column_stack function to build a two-dimensional array.  Note that we access the ‘series’ property of each timeseries object, which returns its masked array.  The timeseries object also has a ‘data’ property which returns a plain, un-masked numpy array.  I wish ‘data’ returned the masked array because ‘series’ is a confusing name and I think a masked array should be the default.

To Be Continued

This post is pushing “too long didn’t read” length.  I will defer benchmarks and additional examples of using the constructed timeseries object to a subsequent post.

Comments (6)

Scatter Map of Madoff’s Victims

When I learned that the court had made a list of Madoff’s victims available, I immediately wanted to graph it.  I first forwarded the PDF one of my best friends.  Within a couple hours he had converted it to text, which he streamed through Yahoo Pipes to generate KML suitable for viewing with Google Maps.  The pipes filtered out duplicates, did address discovery and geolocation.  Unfortunately, Google Maps can’t display more than 80 markers at once, which is far less than the thousands of people whom Madoff ripped off.  He eventually did pull off a single view visualization, but it ran like a dog.

So last night I took a different approach.  I extracted out all of the zip codes from the list.  (Ignoring all of the international investors robbed by Madoff.)  Then I used http://geocoder.us to map each zip code to a latitude and longitude.  Next I wrote a short script in NodeBox to draw a projection of the zip codes on a canvas.  Each point is color coded to indicate the number of times that zip code occurs in the court document.  I was helped a lot by code and ideas in Ben Fry’s Visualizing Data book, which I have access to via Safari Bookshelf.  In particular, I stole his function for projecting out the latitude and longitude numbers.

Here are the resulting images.  The perfectionist in me would love to tinker with this for hours.

scatter map of Madoff's victims in the NYC Metro area

scatter map of Madoff's victims in the NYC Metro area

scatter map of Madoff's victims in NYC Metro area

scatter map of Madoff's victims in NYC Metro area

scatter map of Madoff's victims in New England

scatter map of Madoff's victims in New England

scatter map of Madoff's victims in Florida

scatter map of Madoff's victims in Florida

Comments (6)

Open Source CDS Pricing

I just read an ISDA press release saying that the J.P. Morgan CDS Analytics Engine will soon be open source.  I hope it’s true because I’d love to feast my eyes on that code base.

For the uninitiated, CDS stands for Credit Default Swap.  Put simply, a CDS is an insurance contract between two parties that references some financial contract, such as a bond.  The party selling insurance receives payments from the buyer every quarter.  If the referenced entity (bond) defaults, the quarterly payments immediately stop, and the insurance seller pays out a large lump sum to cover any losses.

CDS contracts have a value that fluctuates with variables such as time, interest rates, and the perceived risk of default on the referenced entity.  As a result, there is a large, liquid market for these derivative contracts.  A single contract is typically written to insure a notional of $10, $20, $50, or $100 million.  Despite dealing with sums that large (or larger), there is no reference, open source pricing mechanism.

Said another way, if a bank or hedge fund wants to value one of its CDS positions, there’s no reference, open source pricing algorithm.  Instead, each desk might use an expensive, commercial library, or the CDS pricing screen on their (expensive) Bloomberg Terminal, or their own proprietary pricing algorithm.  The standard pricing model within the Bloomberg Terminal is their “J.P. Morgan model,” the details of which are only described at a high level in a one page document.  You can’t access the code.

At a previous job I wrote a proprietary CDS pricing algorithm.  I was shocked at how subtle variations in the algorithm produce signficantly different results.  Furthermore, it’s impossible to tie out your results exactly with a standard such as Bloomberg’s pricer.  And I’m wasn’t alone.  I spoke to quants at other banks and to vendors who sell their own pricers, and none of them were able to exactly match Bloomberg’s pricer!  I had a conspiracy theory that Bloomberg made their pricing output cryptographically secure so that traders would be forced to pay for their CDS pricer.

So if this J.P. Morgan code has any relation to the elusive code within the Bloomberg CDS screen, I can’t wait to see it.

Leave a Comment

WSJ: David Einhorn

Yesterday I read a friendly feature article about David Einhorn in the WSJ. The article, titled “A New Face Of Hedge Funds Isn’t Shy”, runs alongside a color photo of Einhorn wearing a lucky sweatshirt while he stands among poker tables during a break at the 2006 WSOP. The authors touch upon statements that he’s made as an activist investor, some of his charitable actions, and his avant guard working style.

Einhorn has recently ruffled some feathers by criticizing regulatory organizations such as the SEC. He makes the point that regulators don’t have sufficient incentives to properly scrutinize corporations. For example, he’s quoted as saying “The SEC is run by a corporate advocate, not an investor advocate, so investors are getting a false sense of security.”

I learned a little (emphasize little) about corporate oversight when I prepared for the Series 7 and Series 63. My studies left me with the impression that regulations are largely motivated by periodic financial crises, when Congress is pushed to take legislative action by their screaming constituents. This methodology seems flawed and error-prone. We’re seeing it right now with calls to take action in this credit crisis. I’d like to learn more about the details of corporate oversight because it touches on the intersection of finance and politics.

I also liked the article’s description of Einhorn’s working style. It says that he takes afternoon naps and is usually home in time to have dinner with his family. I envy that working style because I think it engenders a healthy, relaxed style of thinking. I personally know that winning poker players bring that attitude to the table. My guess is that Einhorn’s working style has played an important role in his success on Wall Street and at the poker table.

Leave a Comment

Follow

Get every new post delivered to your Inbox.