Archive for October, 2009

Review of Programming Pearls by Jon Bentley

I just finished reading Programming Pearls by Jon Bentley.  This is a collection of his “Programming Pearls” columns for the Communications of the ACM magazine.  Bentley shares stories about implementing fundamental data structures and algorithms.

My biggest takeaway was Bentley’s authoritative yet humble tone.  This stands in contrast to the high bravado and chest thumping that defines so much technical communication in the current realm of blogs and tweets.  These days it seems that every writer is a gladiator who’s foremost goal is to demonstrate technical superiority.  It’s king of the cage for nerds.

Bentley, a researcher at Xerox PARC, and a CMU professor who advised Charles Leiserson, is certainly a king of the technical cage.  Even so, he makes you feel like he’s a coworker discussing his current project during lunch.  My mind  was generating imaginary suggestions for implementation techniques instead of simply digesting his word as gospel.

The columns typically highlight personalized implementation stories which often involve now antiquated machines such as the PDP-11.  One personal anecdote that I really appreciated was Ed Reingold’s solution to a toy problem of searching for a missing integer.  Bentley mentions that Reingold uses the question on one of his quizzes.  Prof Reingold was my undergraduate advisor, and these are exactly the type of questions that he asked on his quizzes, which I always thoroughly enjoyed.

Comments (1)

HP48G Emulator for OSX

When I was in high school I had a HP48GX calculator.  I cherished that calculator.  I wrote programs for it.  I shared those programs with a couple fellow nerds at school via the IrDA transport.  I sent away for programs that arrived by mail on a floppy disk, which I downloaded to the calculator via a serial connection.

I’ve heard veteran engineers reminisce about old machines like the PDP-11.  The HP48G was my PDP-11.

From time to time I get nostalgic about that calculator, and I search for an emulator.  I’m not sure why.

I just did one of those searches and was delighted to find an emulator with a codebase updated as recently as last month!  If you download the app and then load the JEMAC.KML file you’re presented with a full color, functional HP48GX!  The memories are thick.

Leave a Comment

Efficient Calculations on Named Columns of Realworld Data

In data analysis frameworks, it’s common to represent data as a two-dimensional rectangle, where each column is given a name.  Every named column contains values of the same type and every row corresponds to related datums.  I say “rectangle” instead of matrix because a matrix typically implies that every value has the same type.  Database tables have named and typed columns, but they typically only support a limited set of operations.  Of course there are special-purpose databases, like Kdb+, but they’re prohibitively expensive for most users.

Stata is one commercial program that supports a variety of operations on data rectangles.  In fact, Stata’s language is entirely built around the idea of having a single rectangle in memory, upon which you sort, filter, generate columns, or merge with rectangles that you previously saved to disk.  Each column has a name and a data type.  The rectangles also support missing values, which is essential when working with real world data.

Unfortunately, although Stata is a decent tool for playing with data while doing research, it’s horrible as a production tool.  Also, its language is so simple that it makes operations like loops or fill_forward a nightmare.

I want to perform high level rectangular operations within a more flexible language like Python.  I’m already comfortable wrangling data in numpy, and I’m a huge fan of numpy’s MaskedArray package.  Numpy supports named, typed columns via recarray, but recarray is generally incomplete.  For example, there’s no built in way to merge (join) two recarrays.  More importantly, recarray does not support masked values!

I have built my own my own module that replicates and expands upon most of Stata’s rectangular operations.  My implementation includes a class that stores data in a dictionary mapping column names to numpy’s masked arrays.  I can’t share the code because I wrote it at work.

I’m happy with my implementation, but I wish it were faster.  Although I’m faster than Stata, two key operations that are slower than I’d like are: 1) merge (join) and 2) fill-forward.  Both of these do too much work outside of optimized space of numpy.  I’m thinking of rewriting these functions in C or Cython.

I’d love to hear comments about this general data structure.  How are other people dealing with rectangles of real world data?

Comments (7)

Fantasy Football Pickem Strategy Results after Week 4

After Brett Favre and the Vikings polished off the Packers last night, I found myself in 7th place out of 50 in my NFL Pickem pool.  As the following box plot shows, choosing weights using Vegas point spreads put me above the median in 3/4 weeks and in the top quartile 2/4 weeks.  In the second week, when I was below the median, I was still within the interquartile range.

weeklyperformance_boxplot

Here’s some of the data in the above box plot.

Week  |    Mean  StdDev     IQR     Min      lq  Median      uq     Max
=======================================================================
   1  |   114.5    8.22      11      95     109     115     120     133
   2  |    77.1   10.38      13      53      70      78      83     107
   3  |   104.6   15.76      14      15      99     108     113     126
   4  |    91.0    7.05       9      71      86      91      95     104

Leave a Comment

Hari Balakrishnan on Congestion Control

I was fortunate enough to work in Hari Balakrishnan’s group when I was working toward a PhD at MIT (I never finished it — I’m ABD).  Anybody who’s ever worked with Hari will confirm that he’s brilliant.  Fine, he’s a Professor at MIT.  Big surprise.  (Fact is, even in that crowd, he excels).

What’s cool is anybody can learn from him.  Today I watched a video of him lecturing on congestion control, an area that he’s both a pioneer and an expert.  It’s a brilliant lecture.  It’s not punctuated by “ummms” and “so yeahs.”  It’s crisp and crystal clear.  He not only explains the core concepts, but he also hints at open problems.  I highly recommend checking out the lecture.

Comments (1)

Follow

Get every new post delivered to your Inbox.