Review of Programming Pearls by Jon Bentley

I just finished reading Programming Pearls by Jon Bentley.  This is a collection of his “Programming Pearls” columns for the Communications of the ACM magazine.  Bentley shares stories about implementing fundamental data structures and algorithms.

My biggest takeaway was Bentley’s authoritative yet humble tone.  This stands in contrast to the high bravado and chest thumping that defines so much technical communication in the current realm of blogs and tweets.  These days it seems that every writer is a gladiator who’s foremost goal is to demonstrate technical superiority.  It’s king of the cage for nerds.

Bentley, a researcher at Xerox PARC, and a CMU professor who advised Charles Leiserson, is certainly a king of the technical cage.  Even so, he makes you feel like he’s a coworker discussing his current project during lunch.  My mind  was generating imaginary suggestions for implementation techniques instead of simply digesting his word as gospel.

The columns typically highlight personalized implementation stories which often involve now antiquated machines such as the PDP-11.  One personal anecdote that I really appreciated was Ed Reingold’s solution to a toy problem of searching for a missing integer.  Bentley mentions that Reingold uses the question on one of his quizzes.  Prof Reingold was my undergraduate advisor, and these are exactly the type of questions that he asked on his quizzes, which I always thoroughly enjoyed.

Leave a Comment

HP48G Emulator for OSX

When I was in high school I had a HP48GX calculator.  I cherished that calculator.  I wrote programs for it.  I shared those programs with a couple fellow nerds at school via the IrDA transport.  I sent away for programs that arrived by mail on a floppy disk, which I downloaded to the calculator via a serial connection.

I’ve heard veteran engineers reminisce about old machines like the PDP-11.  The HP48G was my PDP-11.

From time to time I get nostalgic about that calculator, and I search for an emulator.  I’m not sure why.

I just did one of those searches and was delighted to find an emulator with a codebase updated as recently as last month!  If you download the app and then load the JEMAC.KML file you’re presented with a full color, functional HP48GX!  The memories are thick.

Leave a Comment

Efficient Calculations on Named Columns of Realworld Data

In data analysis frameworks, it’s common to represent data as a two-dimensional rectangle, where each column is given a name.  Every named column contains values of the same type and every row corresponds to related datums.  I say “rectangle” instead of matrix because a matrix typically implies that every value has the same type.  Database tables have named and typed columns, but they typically only support a limited set of operations.  Of course there are special-purpose databases, like Kdb+, but they’re prohibitively expensive for most users.

Stata is one commercial program that supports a variety of operations on data rectangles.  In fact, Stata’s language is entirely built around the idea of having a single rectangle in memory, upon which you sort, filter, generate columns, or merge with rectangles that you previously saved to disk.  Each column has a name and a data type.  The rectangles also support missing values, which is essential when working with real world data.

Unfortunately, although Stata is a decent tool for playing with data while doing research, it’s horrible as a production tool.  Also, its language is so simple that it makes operations like loops or fill_forward a nightmare.

I want to perform high level rectangular operations within a more flexible language like Python.  I’m already comfortable wrangling data in numpy, and I’m a huge fan of numpy’s MaskedArray package.  Numpy supports named, typed columns via recarray, but recarray is generally incomplete.  For example, there’s no built in way to merge (join) two recarrays.  More importantly, recarray does not support masked values!

I have built my own my own module that replicates and expands upon most of Stata’s rectangular operations.  My implementation includes a class that stores data in a dictionary mapping column names to numpy’s masked arrays.  I can’t share the code because I wrote it at work.

I’m happy with my implementation, but I wish it were faster.  Although I’m faster than Stata, two key operations that are slower than I’d like are: 1) merge (join) and 2) fill-forward.  Both of these do too much work outside of optimized space of numpy.  I’m thinking of rewriting these functions in C or Cython.

I’d love to hear comments about this general data structure.  How are other people dealing with rectangles of real world data?

Comments (5)

Fantasy Football Pickem Strategy Results after Week 4

After Brett Favre and the Vikings polished off the Packers last night, I found myself in 7th place out of 50 in my NFL Pickem pool.  As the following box plot shows, choosing weights using Vegas point spreads put me above the median in 3/4 weeks and in the top quartile 2/4 weeks.  In the second week, when I was below the median, I was still within the interquartile range.

weeklyperformance_boxplot

Here’s some of the data in the above box plot.

Week  |    Mean  StdDev     IQR     Min      lq  Median      uq     Max
=======================================================================
   1  |   114.5    8.22      11      95     109     115     120     133
   2  |    77.1   10.38      13      53      70      78      83     107
   3  |   104.6   15.76      14      15      99     108     113     126
   4  |    91.0    7.05       9      71      86      91      95     104

Leave a Comment

Hari Balakrishnan on Congestion Control

I was fortunate enough to work in Hari Balakrishnan’s group when I was working toward a PhD at MIT (I never finished it — I’m ABD).  Anybody who’s ever worked with Hari will confirm that he’s brilliant.  Fine, he’s a Professor at MIT.  Big surprise.  (Fact is, even in that crowd, he excels).

What’s cool is anybody can learn from him.  Today I watched a video of him lecturing on congestion control, an area that he’s both a pioneer and an expert.  It’s a brilliant lecture.  It’s not punctuated by “ummms” and “so yeahs.”  It’s crisp and crystal clear.  He not only explains the core concepts, but he also hints at open problems.  I highly recommend checking out the lecture.

Leave a Comment

Probability NFL Favorite Wins, Given Point Spread

I’m continuing my investigation of what we can infer from NFL point spreads.  I wrote a little code to compute the conditional probability of the favorite team winning, given the size of the point spread.  Here are the results.

probwin_exact_spread

As you can see, the game is essentially a tossup until the point spread is at least a fieldgoal.  Once the spread exceeds a touchdown, the favorite wins nearly 4/5 times.  In the rare instance that the point spread exceeds two touchdowns, then the favorite is a lock to win.

Here’s the raw data used to generate this plot.

    spread       wins      games    probwin
============================================
      0.00         21         47      44.68
      1.00         70        138      50.72
      1.50         60        123      48.78
      2.00         87        159      54.72
      2.50        126        249      50.60
      3.00        351        587      59.80
      3.50        225        360      62.50
      4.00        101        155      65.16
      4.50         71        120      59.17
      5.00         97        133      72.93
      5.50         99        142      69.72
      6.00        108        168      64.29
      6.50        154        226      68.14
      7.00        177        238      74.37
      7.50         85        123      69.11
      8.00         75         95      78.95
      8.50         70         85      82.35
      9.00         71         91      78.02
      9.50         79         98      80.61
     10.00         66         95      69.47
     10.50         55         67      82.09
     11.00         40         48      83.33
     11.50         23         27      85.19
     12.00         20         25      80.00
     12.50         28         34      82.35
     13.00         20         24      83.33
     13.50         34         44      77.27
     14.00         26         33      78.79
     14.50         14         16      87.50
     15.00          9          9     100.00
     15.50          9          9     100.00
     16.00         14         14     100.00
     16.50          3          3     100.00
     17.00          5          5     100.00
     17.50          6          7      85.71
     18.00          2          2     100.00
     18.50          1          1     100.00
     19.00          1          1     100.00
     19.50          2          2     100.00
     20.00          1          1     100.00
     21.00          1          1     100.00
     22.00          1          1     100.00
     24.00          2          2     100.00

Leave a Comment

Simple Backtesting of Football Pickem Strategy

I was curious to see how my simple strategy for the weighted fantasy football pickem would do in previous years.  I found an archive of NFL point spreads, over under lines, and game results on goldsheet.com.  I don’t know how accurate the data are, but I wrote a little Python script to parse it.  The formatting of the data is pretty rough, so I’ve put a CSV version up on Google Docs.

Here’s a tabular summary of the results.  The key column is Percent, which shows the percentage of available points (MaxPosib) that my strategy got in each year.

    Year      Min      Max      Avg    Total   MaxPosib    Percent
==================================================================
    1993       27      103    60.94     1097       1466      74.83
    1994       46       95    64.82     1102       1500      73.47
    1995       50      114    81.18     1380       1747      78.99
    1996       51      119    82.82     1408       1748      80.55
    1997       52      114    82.24     1398       1745      80.11
    1998       56      120    86.94     1478       1734      85.24
    1999       65      115    91.06     1548       1959      79.02
    2000       62      113    90.53     1539       1959      78.56
    2001       43      117    90.29     1535       1960      78.32
    2002       62      136    95.59     1625       2088      77.83
    2003       72      134    98.53     1675       2088      80.22
    2004       67      129    95.47     1623       2088      77.73
    2005       77      127   101.47     1725       2088      82.61
    2006       65      134    89.59     1523       2091      72.84
    2007       56      136   103.41     1758       2091      84.07
------------------------------------------------------------------
 average    56.73   120.40    87.66  1494.27    1890.13      78.96
std.dev.    12.20    11.84    11.67   189.42     211.33       3.39
     min    27.00    95.00    60.94  1097.00    1466.00      72.84
     max    77.00   136.00   103.41  1758.00    2091.00      85.24

So I expect to score 79% of the total points available this year if I use my strategy of assigning weights to the favored team, ordered by point spread.

The next step is obviously to try some other strategies and see what the results are.  First I want to see how poorly a completely random strategy does.

Comments (1)

Automated Fantasy Pro Football Pickem Weights

I’m participating in the Yahoo! Sports Fantasy Pro Football Pickem with some family and friends.   I don’t regularly play fantasy sports, but I was coaxed into donating money to the pool during a family trip this summer.  I blame the beer and port.

For the uninitiated, each week you must pick a winner in all 16 games, assigning a confidence weight from 1 to 16 to each game.  If your pick wins, then you are awarded points equal to the confidence weight that you assigned to that game.

Now that my money’s in, I want to make a solid showing.  My strategy is to leverage the gambling lines, mapping the point spreads and over under lines to my picks and weights.

I threw together a little Python script to parse the Yahoo odds web page.  For each game, the script averages the point spreads and over under lines.  The favored team is selected, and the picks are ranked according to point spread, with wider spreads getting a larger weight.  Ties on point spread are broken by the over under line, with a larger over under line mapping to a larger weight.

Now I’m wondering how to improve this simple algorithm.  I might break ties by giving a smaller over under a higher weight.  The reasoning would be that smaller over under lines will have less variance.  Surely there are other ways to improve the rankings.

Here’s the table that my little script prints out.

weight | winner                         loser                     spread    o/u
================================================================================
    16 | Minnesota Vikings         over Detroit Lions              -9.83  45.17
    15 | Washington Redskins       over St. Louis Rams             -9.58  37.00
    14 | Green Bay Packers         over Cincinnati Bengals         -9.08  42.00
    13 | Tennessee Titans          over Houston Texans             -6.50  40.92
    12 | Atlanta Falcons           over Carolina Panthers          -6.08  42.50
    11 | Buffalo Bills             over Tampa Bay Buccaneers       -4.42  42.00
    10 | New England Patriots      over New York Jets              -3.42  46.17
     9 | Jacksonville Jaguars      over Arizona Cardinals          -3.08  42.50
     8 | Indianapolis Colts        over Miami Dolphins             -3.08  42.17
     7 | San Diego Chargers        over Baltimore Ravens           -3.00  40.33
     6 | Denver Broncos            over Cleveland Browns           -3.00  38.83
     5 | Kansas City Chiefs        over Oakland Raiders            -3.00  38.50
     4 | Pittsburgh Steelers       over Chicago Bears              -2.92  37.50
     3 | Dallas Cowboys            over New York Giants            -2.75  45.00
     2 | San Francisco 49ers       over Seattle Seahawks           -1.10  39.58
     1 | New Orleans Saints        over Philadelphia Eagles        -1.00  46.08

Leave a Comment

Google Code Jam: I advanced to Round 2!

Last night I participated in Round 1 of the 2009 Google Code Jam.  I thought all of the questions were difficult, but I squeaked by and advanced to Round 2.

I first worked on the Multi-base happiness problem.  I had never heard of happy numbers before, and wasn’t sure how to determine that a number was unhappy.  How many times should I iterate summing the squares before stopping?  Fortunately Wikipedia told me the remarkable property that the iterations either reduce to 1 or enter a cycle.  The other key to this problem is memoization.

I skipped Crossing the Road and tackled the Collecting Cards problem.  I first coded up a simulation.  This failed to generate results with sufficient precision.  I actually succeeded in coding up an alternate solution, but I didn’t submit it in time.  My solution walks down the expectation calculation, passing along the conditional probability of reaching each point.  This works fine, but I feel there must exist a super clean solution for this problem.

In any case, I was fortunate enough to pass on to the next round, where I’ll surely be destroyed.

Leave a Comment

Climbing Two Fourteeners in Colorado

Yesterday I hiked to the 14,271 foot peak of Mt Quandary.  Quandary Peak was my second fourteener.  A couple months ago, on July 4th, I summited Gray’s Peak.

Hiking reference guides declare that both hikes are easy by fourteener standards.  The key phrase there is “fourteener standards,” because they’re difficult.  Granted, I did both hikes just one day after flying in from New York, so I wasn’t acclimated to the altitude.  Even so, with a pack on your back and loose rocks under your feet, it’s always tough going past 13,000 feet.

Catching my breath on Mt Quandary's peak

Catching my breath on Mt Quandary's peak

The hike provides a lot of metaphors for business and software development.

  1. Ultimately, you only reach the top by repeatedly putting one foot in front of the other.  To ensure success, break down the journey into a series of short term, easy to attain goals.  That’s how you get things done.
  2. Experience helps.  On my first climb, even though I had read some trip reports, I was surprised at how difficult it was to keep moving in high altitudes.  I had to take breaks every 10 feet.  I wondered if the summit was worth the effort.  On the second hike, I expected to go slow, and I knew the summit was an ample reward.
  3. Although the ascent and summit contain all the glory, the descent is both more difficult and equally important.  It’s exhilarating to release a new product or major feature, but you have to be prepared to maintain it and complete the life cycle.
  4. If you need to remember part of your journey, then you have to document it.  On both descents there were several points where I had egregious recall of key landmarks.  For example, I’d think “wow, I really thought the tree line was much closer to this big white rock.  I can’t even see the tree line from here!”  Fortunately trails were well marked, so this wasn’t a danger.  In business, you must document key problems and decisions.  Otherwise your mind will surely blur them out over time.  If the trail isn’t well marked, you’ll get lost.
  5. You can find excitement in every stage in the journey, from shopping for supplies, to standing at the peak, to tasting a celebratory beer.

Comments (1)

Older Posts »