Posts Tagged software

Review of Programming Pearls by Jon Bentley

I just finished reading Programming Pearls by Jon Bentley.  This is a collection of his “Programming Pearls” columns for the Communications of the ACM magazine.  Bentley shares stories about implementing fundamental data structures and algorithms.

My biggest takeaway was Bentley’s authoritative yet humble tone.  This stands in contrast to the high bravado and chest thumping that defines so much technical communication in the current realm of blogs and tweets.  These days it seems that every writer is a gladiator who’s foremost goal is to demonstrate technical superiority.  It’s king of the cage for nerds.

Bentley, a researcher at Xerox PARC, and a CMU professor who advised Charles Leiserson, is certainly a king of the technical cage.  Even so, he makes you feel like he’s a coworker discussing his current project during lunch.  My mind  was generating imaginary suggestions for implementation techniques instead of simply digesting his word as gospel.

The columns typically highlight personalized implementation stories which often involve now antiquated machines such as the PDP-11.  One personal anecdote that I really appreciated was Ed Reingold’s solution to a toy problem of searching for a missing integer.  Bentley mentions that Reingold uses the question on one of his quizzes.  Prof Reingold was my undergraduate advisor, and these are exactly the type of questions that he asked on his quizzes, which I always thoroughly enjoyed.

Comments (1)

HP48G Emulator for OSX

When I was in high school I had a HP48GX calculator.  I cherished that calculator.  I wrote programs for it.  I shared those programs with a couple fellow nerds at school via the IrDA transport.  I sent away for programs that arrived by mail on a floppy disk, which I downloaded to the calculator via a serial connection.

I’ve heard veteran engineers reminisce about old machines like the PDP-11.  The HP48G was my PDP-11.

From time to time I get nostalgic about that calculator, and I search for an emulator.  I’m not sure why.

I just did one of those searches and was delighted to find an emulator with a codebase updated as recently as last month!  If you download the app and then load the JEMAC.KML file you’re presented with a full color, functional HP48GX!  The memories are thick.

Leave a Comment

Efficient Calculations on Named Columns of Realworld Data

In data analysis frameworks, it’s common to represent data as a two-dimensional rectangle, where each column is given a name.  Every named column contains values of the same type and every row corresponds to related datums.  I say “rectangle” instead of matrix because a matrix typically implies that every value has the same type.  Database tables have named and typed columns, but they typically only support a limited set of operations.  Of course there are special-purpose databases, like Kdb+, but they’re prohibitively expensive for most users.

Stata is one commercial program that supports a variety of operations on data rectangles.  In fact, Stata’s language is entirely built around the idea of having a single rectangle in memory, upon which you sort, filter, generate columns, or merge with rectangles that you previously saved to disk.  Each column has a name and a data type.  The rectangles also support missing values, which is essential when working with real world data.

Unfortunately, although Stata is a decent tool for playing with data while doing research, it’s horrible as a production tool.  Also, its language is so simple that it makes operations like loops or fill_forward a nightmare.

I want to perform high level rectangular operations within a more flexible language like Python.  I’m already comfortable wrangling data in numpy, and I’m a huge fan of numpy’s MaskedArray package.  Numpy supports named, typed columns via recarray, but recarray is generally incomplete.  For example, there’s no built in way to merge (join) two recarrays.  More importantly, recarray does not support masked values!

I have built my own my own module that replicates and expands upon most of Stata’s rectangular operations.  My implementation includes a class that stores data in a dictionary mapping column names to numpy’s masked arrays.  I can’t share the code because I wrote it at work.

I’m happy with my implementation, but I wish it were faster.  Although I’m faster than Stata, two key operations that are slower than I’d like are: 1) merge (join) and 2) fill-forward.  Both of these do too much work outside of optimized space of numpy.  I’m thinking of rewriting these functions in C or Cython.

I’d love to hear comments about this general data structure.  How are other people dealing with rectangles of real world data?

Comments (7)

Simple Backtesting of Football Pickem Strategy

I was curious to see how my simple strategy for the weighted fantasy football pickem would do in previous years.  I found an archive of NFL point spreads, over under lines, and game results on goldsheet.com.  I don’t know how accurate the data are, but I wrote a little Python script to parse it.  The formatting of the data is pretty rough, so I’ve put a CSV version up on Google Docs.

Here’s a tabular summary of the results.  The key column is Percent, which shows the percentage of available points (MaxPosib) that my strategy got in each year.

    Year      Min      Max      Avg    Total   MaxPosib    Percent
==================================================================
    1993       27      103    60.94     1097       1466      74.83
    1994       46       95    64.82     1102       1500      73.47
    1995       50      114    81.18     1380       1747      78.99
    1996       51      119    82.82     1408       1748      80.55
    1997       52      114    82.24     1398       1745      80.11
    1998       56      120    86.94     1478       1734      85.24
    1999       65      115    91.06     1548       1959      79.02
    2000       62      113    90.53     1539       1959      78.56
    2001       43      117    90.29     1535       1960      78.32
    2002       62      136    95.59     1625       2088      77.83
    2003       72      134    98.53     1675       2088      80.22
    2004       67      129    95.47     1623       2088      77.73
    2005       77      127   101.47     1725       2088      82.61
    2006       65      134    89.59     1523       2091      72.84
    2007       56      136   103.41     1758       2091      84.07
------------------------------------------------------------------
 average    56.73   120.40    87.66  1494.27    1890.13      78.96
std.dev.    12.20    11.84    11.67   189.42     211.33       3.39
     min    27.00    95.00    60.94  1097.00    1466.00      72.84
     max    77.00   136.00   103.41  1758.00    2091.00      85.24

So I expect to score 79% of the total points available this year if I use my strategy of assigning weights to the favored team, ordered by point spread.

The next step is obviously to try some other strategies and see what the results are.  First I want to see how poorly a completely random strategy does.

Comments (1)

Automated Fantasy Pro Football Pickem Weights

I’m participating in the Yahoo! Sports Fantasy Pro Football Pickem with some family and friends.   I don’t regularly play fantasy sports, but I was coaxed into donating money to the pool during a family trip this summer.  I blame the beer and port.

For the uninitiated, each week you must pick a winner in all 16 games, assigning a confidence weight from 1 to 16 to each game.  If your pick wins, then you are awarded points equal to the confidence weight that you assigned to that game.

Now that my money’s in, I want to make a solid showing.  My strategy is to leverage the gambling lines, mapping the point spreads and over under lines to my picks and weights.

I threw together a little Python script to parse the Yahoo odds web page.  For each game, the script averages the point spreads and over under lines.  The favored team is selected, and the picks are ranked according to point spread, with wider spreads getting a larger weight.  Ties on point spread are broken by the over under line, with a larger over under line mapping to a larger weight.

Now I’m wondering how to improve this simple algorithm.  I might break ties by giving a smaller over under a higher weight.  The reasoning would be that smaller over under lines will have less variance.  Surely there are other ways to improve the rankings.

Here’s the table that my little script prints out.

weight | winner                         loser                     spread    o/u
================================================================================
    16 | Minnesota Vikings         over Detroit Lions              -9.83  45.17
    15 | Washington Redskins       over St. Louis Rams             -9.58  37.00
    14 | Green Bay Packers         over Cincinnati Bengals         -9.08  42.00
    13 | Tennessee Titans          over Houston Texans             -6.50  40.92
    12 | Atlanta Falcons           over Carolina Panthers          -6.08  42.50
    11 | Buffalo Bills             over Tampa Bay Buccaneers       -4.42  42.00
    10 | New England Patriots      over New York Jets              -3.42  46.17
     9 | Jacksonville Jaguars      over Arizona Cardinals          -3.08  42.50
     8 | Indianapolis Colts        over Miami Dolphins             -3.08  42.17
     7 | San Diego Chargers        over Baltimore Ravens           -3.00  40.33
     6 | Denver Broncos            over Cleveland Browns           -3.00  38.83
     5 | Kansas City Chiefs        over Oakland Raiders            -3.00  38.50
     4 | Pittsburgh Steelers       over Chicago Bears              -2.92  37.50
     3 | Dallas Cowboys            over New York Giants            -2.75  45.00
     2 | San Francisco 49ers       over Seattle Seahawks           -1.10  39.58
     1 | New Orleans Saints        over Philadelphia Eagles        -1.00  46.08

Leave a Comment

Analysis of Google Code Jam Qualifier Results

I wrote a quick Python script to pull down all of the results from the qualifying rounds of Google Code Jam from 2008 and 2009.  I have put the rank and time of each submitted solution for all participants in these two Google Spreadsheets:

I also put a collection of summary statistics in this Google Spreadsheet.

Here are some conclusions drawn from these data:

  1. The number of participants increased by 1,449 people in 2009, a 20% increase.
  2. There were 2,572 people who participated both years.  This means 36% of the 2008 participants came back for more.
  3. The 2008 Qualifier had more difficult problems.  After normalizing the point totals (2008 had a max total of 75 points but 2009 had a max of 99 points), the average score was 15% higher in 2009.
  4. Problem C in 2008 was extremely hard.  Only 14% of the participants solved Problem C with the small data set, and only 9% solved Problem C with the large data set.
  5. The large data set for Problem C in 2009 was hard for many people.  Only 36% of people solved it.  Furthermore, of the people who solved the small data set, only 57% of those were able to solve the large data set.
  6. People worked roughly the same amount of time both years.  The 2009 times are all slightly larger, but the round was also extended by 2 hours because of technical problems early in the round.
  7. On average, for each participant, there is about a 3.5 hour gap between the submission time of the first solution and the submission time of the last solution.  Of course some people (like me) probably choose to tackle the problems whenever they found free moments during their day.

If people are interested, I’ll post my code and all the data somewhere.

Leave a Comment

Google Code Jam 2009 Qualifier

I finished all 3 problem in this year’s Google Code Jam.  I found them to be much easier than last year’s qualifier.  (Although last year a friend told me about the qualifier less than two hours before it ended, so I was rushed).

Last year I wrote my solutions in Haskell because I was learning the language.  This year I wasn’t so adventurous; I wrote my solutions using Python.

For the Alien Language challenge, I encoded the lexicon using a tree built with python dictionaries.  I imagine this could be done using a regex, but I haven’t thought that through.

For the Watersheds challenge, I simply built two 2d arrays: one to hold the topology and the other to hold the labels.  From each coordinate on the map, I followed the flow down hill until I reached a sink or reached a labeled coordinate.  This memoization ensures that you only visit each node once.

For the Welcome to Code Jam challenge, I memoized a mapping from (string_index, substring) to the number of times that substring can be completed starting at the index.  Again, this was straightforward.

I’m looking forward to seeing how other people solved each challenge.  I wonder how 16 year old Neal Wu made such quick work of the problems, solving all three in under 26 minutes.  That’s amazing.

Comments (6)

My Pokerbot’s Architecture

Back in 2004, I spent most of my time building a poker bot to play low stakes limit texas holdem.  (A poker bot is a program that automatically plays poker against humans on one of the major internet poker sites).  The project consumed me.  It was both challenging and exciting.  I successfully ran the bot for a few years before shutting it down when I took my first job on Wall Street.

I worked alone on the poker bot.  I designed and implemented the poker client interface, the AI, and monitoring framework.  Since then, I’ve enjoyed reading a few blog posts detailing the efforts and celebrations of other poker bot creators.  Just today I read this one.  I’d like to join the party and share some details about how I built my bot.

PokerBotArchitecture

When I started the project, I wasn’t sure how to interface with the poker client.  The poker sites don’t provide an API, so I either had to reverse engineer the network protocol or build a screen scraper.  I knew a lot about hacking network protocols, and I had never built a screen scraper.  (I also noticed that one of the sites used an open source SSL library that I could swap out).

Despite these advantages, I wanted to build a screen scraper because:

  1. It’s harder for the poker client to detect a screen scraper.  I didn’t want exhibit an identifying communication fingerprint.
  2. I didn’t want to reverse engineer multiple network protocols.  It seemed easier to get a screen scraper working on different poker clients.

I wrote the scraper using C#.  One weakness of my scraper was that the windows all had to be a certain size, and they could not overlap.  I worked around this by building a simple tiling window manager with an AutoHotKey script.

I routed information from the scraper through a piece of software that I called the poker gateway.  This gateway directed information to a hand history database, poker AI, lobby AI, and control center.  All of these components ran on a separate linux box, and I used Ruby to develop them all.

This architecture was robust and flexible.  As a result, I could focus most of my energy and thought on building the AI.

I’d enjoy hearing from other poker bot creators.  It would be fun to swap war stories.

Leave a Comment

I don’t use debuggers

I just read this post, which says “Just about every developer uses a debugger, at least occasionally.”  I don’t know if I’m inadequate or working on a peculiar class of problems, but I haven’t used a debugger in roughly 10 years.

I have used a debugger twice.  Once I was chasing down a bug in a boutique embedded systems compiler.  The compiler had a bug, and staring at the C code didn’t cut it.  I also used the Perl debugger while writing my MS thesis (back when I wrote oodles of Perl code).  I used it because a friend showed it to me, and it seemed cool.

My primary debugging technique is code examination.  I simply look carefully at my code and think about it.  I also write unit tests.

I’m not fundementally opposed to debuggers.  It’s just that I haven’t used one in the past 10 years.

Leave a Comment

Automate Everything for Peace of Mind

Anybody who has written a computer program appreciates the warm glow of having a machine do your work.  That pleasure often masks the burden of holding the program’s hand — kicking it off and checking that it ran correctly.

At least for me, the natural tendency is to stop short of complete automation.  Instead, if I’m being lazy, I’ll start the program manually and catch errors by noticing odd behavior.  Perhaps I’ll see a glaring failure message.  Or maybe I’ll just notice that the program ran too quickly.

I do this because I’ve seen trusty programs fail in myriad ways.  I’ve seen malformed or truncated inputs.  I’ve seen choppy or slow network connections.  I’ve seen disks fill up.  I’ve seen memory hogs.

When an error like that pops up when you’re sitting right there at the command prompt, it’s typically easy to diagnose and move on.  When run naively from a scheduler, you might not notice a problem until disaster strikes, sparking a late night emergency call, or worse.  Therefore, I have to fight my natural tendency to hand-hold my programs.

Through experience, I’ve learned to go the extra mile and properly automate my programs.  These are some of my best practices — I hope other people can point out ones that I’ve overlooked.

  1. Every program must use a proper logger.  The first thing I do when I write *any* program is to make sure it’s set up to log using that language’s preferred logging utility.  This allows you to properly investigate what happened every time the program runs.
  2. Save the logs from each run.  It’s possible that an error will go unnoticed for a while, and there’s nothing worse than saying “unfortunately the log file was deleted.”
  3. Validate the program’s inputs.  I’ve been burned before by people who will “always save their Excel spreadsheet in the same way,” “websites that never change,” and data providers that “guarantee their data are validated before given to you.”
  4. Sanity check the outputs.  If you always expect your program to generate some new data, then compare the new results to the old results, and verify that there are actually new results.
  5. Catch errors and retry.  Computers are deterministic, but real world resources are unpredictable. I’ve seen one-off errors too often.  Perhaps I call an external program that fails for an unknown reason.  Or a file copy fails, but when I retry, it works fine.
  6. Have an extremely robust email notification system.  When there’s an error, you should be notified.  One trick is that a failure in the email system should never cause the program to fail.  Perfect this library and reuse it.
  7. Try your program every time you change it.  This is easy to say, but it’s tempting to say “oh, but I’m just adding one line” and not run your unit tests.  If you change it, at the very least, give it a whirl.
  8. Run the program from multiple machines.  Computers fail at all the wrong times.  At the very least, design your program so that only one instance publishes a final result.  Then if that computer fails, you can run a quick command on the other machine to publish whatever that program creates.

Leave a Comment

Older Posts »
Follow

Get every new post delivered to your Inbox.