Posts Tagged software
October 22, 2009 at 10:21 pm
· Filed under Book Review ·Tagged book, review, software
I just finished reading Programming Pearls by Jon Bentley. This is a collection of his “Programming Pearls” columns for the Communications of the ACM magazine. Bentley shares stories about implementing fundamental data structures and algorithms.
My biggest takeaway was Bentley’s authoritative yet humble tone. This stands in contrast to the high bravado and chest thumping that defines so much technical communication in the current realm of blogs and tweets. These days it seems that every writer is a gladiator who’s foremost goal is to demonstrate technical superiority. It’s king of the cage for nerds.
Bentley, a researcher at Xerox PARC, and a CMU professor who advised Charles Leiserson, is certainly a king of the technical cage. Even so, he makes you feel like he’s a coworker discussing his current project during lunch. My mind was generating imaginary suggestions for implementation techniques instead of simply digesting his word as gospel.
The columns typically highlight personalized implementation stories which often involve now antiquated machines such as the PDP-11. One personal anecdote that I really appreciated was Ed Reingold’s solution to a toy problem of searching for a missing integer. Bentley mentions that Reingold uses the question on one of his quizzes. Prof Reingold was my undergraduate advisor, and these are exactly the type of questions that he asked on his quizzes, which I always thoroughly enjoyed.
Permalink
October 19, 2009 at 11:06 pm
· Filed under Random Thoughts ·Tagged hp48, Random Thoughts, software
When I was in high school I had a HP48GX calculator. I cherished that calculator. I wrote programs for it. I shared those programs with a couple fellow nerds at school via the IrDA transport. I sent away for programs that arrived by mail on a floppy disk, which I downloaded to the calculator via a serial connection.
I’ve heard veteran engineers reminisce about old machines like the PDP-11. The HP48G was my PDP-11.
From time to time I get nostalgic about that calculator, and I search for an emulator. I’m not sure why.
I just did one of those searches and was delighted to find an emulator with a codebase updated as recently as last month! If you download the app and then load the JEMAC.KML file you’re presented with a full color, functional HP48GX! The memories are thick.
Permalink
October 15, 2009 at 2:37 pm
· Filed under Python ·Tagged data, performance, Python, software
In data analysis frameworks, it’s common to represent data as a two-dimensional rectangle, where each column is given a name. Every named column contains values of the same type and every row corresponds to related datums. I say “rectangle” instead of matrix because a matrix typically implies that every value has the same type. Database tables have named and typed columns, but they typically only support a limited set of operations. Of course there are special-purpose databases, like Kdb+, but they’re prohibitively expensive for most users.
Stata is one commercial program that supports a variety of operations on data rectangles. In fact, Stata’s language is entirely built around the idea of having a single rectangle in memory, upon which you sort, filter, generate columns, or merge with rectangles that you previously saved to disk. Each column has a name and a data type. The rectangles also support missing values, which is essential when working with real world data.
Unfortunately, although Stata is a decent tool for playing with data while doing research, it’s horrible as a production tool. Also, its language is so simple that it makes operations like loops or fill_forward a nightmare.
I want to perform high level rectangular operations within a more flexible language like Python. I’m already comfortable wrangling data in numpy, and I’m a huge fan of numpy’s MaskedArray package. Numpy supports named, typed columns via recarray, but recarray is generally incomplete. For example, there’s no built in way to merge (join) two recarrays. More importantly, recarray does not support masked values!
I have built my own my own module that replicates and expands upon most of Stata’s rectangular operations. My implementation includes a class that stores data in a dictionary mapping column names to numpy’s masked arrays. I can’t share the code because I wrote it at work.
I’m happy with my implementation, but I wish it were faster. Although I’m faster than Stata, two key operations that are slower than I’d like are: 1) merge (join) and 2) fill-forward. Both of these do too much work outside of optimized space of numpy. I’m thinking of rewriting these functions in C or Cython.
I’d love to hear comments about this general data structure. How are other people dealing with rectangles of real world data?
Permalink
September 24, 2009 at 12:07 am
· Filed under software ·Tagged algorithm, backtest, data, fantasy, gambling, software, sports
I was curious to see how my simple strategy for the weighted fantasy football pickem would do in previous years. I found an archive of NFL point spreads, over under lines, and game results on goldsheet.com. I don’t know how accurate the data are, but I wrote a little Python script to parse it. The formatting of the data is pretty rough, so I’ve put a CSV version up on Google Docs.
Here’s a tabular summary of the results. The key column is Percent, which shows the percentage of available points (MaxPosib) that my strategy got in each year.
Year Min Max Avg Total MaxPosib Percent
==================================================================
1993 27 103 60.94 1097 1466 74.83
1994 46 95 64.82 1102 1500 73.47
1995 50 114 81.18 1380 1747 78.99
1996 51 119 82.82 1408 1748 80.55
1997 52 114 82.24 1398 1745 80.11
1998 56 120 86.94 1478 1734 85.24
1999 65 115 91.06 1548 1959 79.02
2000 62 113 90.53 1539 1959 78.56
2001 43 117 90.29 1535 1960 78.32
2002 62 136 95.59 1625 2088 77.83
2003 72 134 98.53 1675 2088 80.22
2004 67 129 95.47 1623 2088 77.73
2005 77 127 101.47 1725 2088 82.61
2006 65 134 89.59 1523 2091 72.84
2007 56 136 103.41 1758 2091 84.07
------------------------------------------------------------------
average 56.73 120.40 87.66 1494.27 1890.13 78.96
std.dev. 12.20 11.84 11.67 189.42 211.33 3.39
min 27.00 95.00 60.94 1097.00 1466.00 72.84
max 77.00 136.00 103.41 1758.00 2091.00 85.24
So I expect to score 79% of the total points available this year if I use my strategy of assigning weights to the favored team, ordered by point spread.
The next step is obviously to try some other strategies and see what the results are. First I want to see how poorly a completely random strategy does.
Permalink
September 18, 2009 at 8:27 pm
· Filed under Python, software ·Tagged algorithm, gambling, Python, software, sports
I’m participating in the Yahoo! Sports Fantasy Pro Football Pickem with some family and friends. I don’t regularly play fantasy sports, but I was coaxed into donating money to the pool during a family trip this summer. I blame the beer and port.
For the uninitiated, each week you must pick a winner in all 16 games, assigning a confidence weight from 1 to 16 to each game. If your pick wins, then you are awarded points equal to the confidence weight that you assigned to that game.
Now that my money’s in, I want to make a solid showing. My strategy is to leverage the gambling lines, mapping the point spreads and over under lines to my picks and weights.
I threw together a little Python script to parse the Yahoo odds web page. For each game, the script averages the point spreads and over under lines. The favored team is selected, and the picks are ranked according to point spread, with wider spreads getting a larger weight. Ties on point spread are broken by the over under line, with a larger over under line mapping to a larger weight.
Now I’m wondering how to improve this simple algorithm. I might break ties by giving a smaller over under a higher weight. The reasoning would be that smaller over under lines will have less variance. Surely there are other ways to improve the rankings.
Here’s the table that my little script prints out.
weight | winner loser spread o/u
================================================================================
16 | Minnesota Vikings over Detroit Lions -9.83 45.17
15 | Washington Redskins over St. Louis Rams -9.58 37.00
14 | Green Bay Packers over Cincinnati Bengals -9.08 42.00
13 | Tennessee Titans over Houston Texans -6.50 40.92
12 | Atlanta Falcons over Carolina Panthers -6.08 42.50
11 | Buffalo Bills over Tampa Bay Buccaneers -4.42 42.00
10 | New England Patriots over New York Jets -3.42 46.17
9 | Jacksonville Jaguars over Arizona Cardinals -3.08 42.50
8 | Indianapolis Colts over Miami Dolphins -3.08 42.17
7 | San Diego Chargers over Baltimore Ravens -3.00 40.33
6 | Denver Broncos over Cleveland Browns -3.00 38.83
5 | Kansas City Chiefs over Oakland Raiders -3.00 38.50
4 | Pittsburgh Steelers over Chicago Bears -2.92 37.50
3 | Dallas Cowboys over New York Giants -2.75 45.00
2 | San Francisco 49ers over Seattle Seahawks -1.10 39.58
1 | New Orleans Saints over Philadelphia Eagles -1.00 46.08
Permalink
September 6, 2009 at 12:03 am
· Filed under Programming Languages, software ·Tagged contest, data, Random Thoughts, software
I wrote a quick Python script to pull down all of the results from the qualifying rounds of Google Code Jam from 2008 and 2009. I have put the rank and time of each submitted solution for all participants in these two Google Spreadsheets:
I also put a collection of summary statistics in this Google Spreadsheet.
Here are some conclusions drawn from these data:
- The number of participants increased by 1,449 people in 2009, a 20% increase.
- There were 2,572 people who participated both years. This means 36% of the 2008 participants came back for more.
- The 2008 Qualifier had more difficult problems. After normalizing the point totals (2008 had a max total of 75 points but 2009 had a max of 99 points), the average score was 15% higher in 2009.
- Problem C in 2008 was extremely hard. Only 14% of the participants solved Problem C with the small data set, and only 9% solved Problem C with the large data set.
- The large data set for Problem C in 2009 was hard for many people. Only 36% of people solved it. Furthermore, of the people who solved the small data set, only 57% of those were able to solve the large data set.
- People worked roughly the same amount of time both years. The 2009 times are all slightly larger, but the round was also extended by 2 hours because of technical problems early in the round.
- On average, for each participant, there is about a 3.5 hour gap between the submission time of the first solution and the submission time of the last solution. Of course some people (like me) probably choose to tackle the problems whenever they found free moments during their day.
If people are interested, I’ll post my code and all the data somewhere.
Permalink
September 3, 2009 at 4:45 pm
· Filed under Python, software ·Tagged contest, software
I finished all 3 problem in this year’s Google Code Jam. I found them to be much easier than last year’s qualifier. (Although last year a friend told me about the qualifier less than two hours before it ended, so I was rushed).
Last year I wrote my solutions in Haskell because I was learning the language. This year I wasn’t so adventurous; I wrote my solutions using Python.
For the Alien Language challenge, I encoded the lexicon using a tree built with python dictionaries. I imagine this could be done using a regex, but I haven’t thought that through.
For the Watersheds challenge, I simply built two 2d arrays: one to hold the topology and the other to hold the labels. From each coordinate on the map, I followed the flow down hill until I reached a sink or reached a labeled coordinate. This memoization ensures that you only visit each node once.
For the Welcome to Code Jam challenge, I memoized a mapping from (string_index, substring) to the number of times that substring can be completed starting at the index. Again, this was straightforward.
I’m looking forward to seeing how other people solved each challenge. I wonder how 16 year old Neal Wu made such quick work of the problems, solving all three in under 26 minutes. That’s amazing.
Permalink
August 20, 2009 at 8:27 pm
· Filed under software ·Tagged poker, software
Back in 2004, I spent most of my time building a poker bot to play low stakes limit texas holdem. (A poker bot is a program that automatically plays poker against humans on one of the major internet poker sites). The project consumed me. It was both challenging and exciting. I successfully ran the bot for a few years before shutting it down when I took my first job on Wall Street.
I worked alone on the poker bot. I designed and implemented the poker client interface, the AI, and monitoring framework. Since then, I’ve enjoyed reading a few blog posts detailing the efforts and celebrations of other poker bot creators. Just today I read this one. I’d like to join the party and share some details about how I built my bot.

When I started the project, I wasn’t sure how to interface with the poker client. The poker sites don’t provide an API, so I either had to reverse engineer the network protocol or build a screen scraper. I knew a lot about hacking network protocols, and I had never built a screen scraper. (I also noticed that one of the sites used an open source SSL library that I could swap out).
Despite these advantages, I wanted to build a screen scraper because:
- It’s harder for the poker client to detect a screen scraper. I didn’t want exhibit an identifying communication fingerprint.
- I didn’t want to reverse engineer multiple network protocols. It seemed easier to get a screen scraper working on different poker clients.
I wrote the scraper using C#. One weakness of my scraper was that the windows all had to be a certain size, and they could not overlap. I worked around this by building a simple tiling window manager with an AutoHotKey script.
I routed information from the scraper through a piece of software that I called the poker gateway. This gateway directed information to a hand history database, poker AI, lobby AI, and control center. All of these components ran on a separate linux box, and I used Ruby to develop them all.
This architecture was robust and flexible. As a result, I could focus most of my energy and thought on building the AI.
I’d enjoy hearing from other poker bot creators. It would be fun to swap war stories.
Permalink
August 11, 2009 at 8:13 pm
· Filed under Programming Languages, Random Thoughts ·Tagged algorithm, Random Thoughts, software
I just read this post, which says “Just about every developer uses a debugger, at least occasionally.” I don’t know if I’m inadequate or working on a peculiar class of problems, but I haven’t used a debugger in roughly 10 years.
I have used a debugger twice. Once I was chasing down a bug in a boutique embedded systems compiler. The compiler had a bug, and staring at the C code didn’t cut it. I also used the Perl debugger while writing my MS thesis (back when I wrote oodles of Perl code). I used it because a friend showed it to me, and it seemed cool.
My primary debugging technique is code examination. I simply look carefully at my code and think about it. I also write unit tests.
I’m not fundementally opposed to debuggers. It’s just that I haven’t used one in the past 10 years.
Permalink
July 28, 2009 at 5:42 pm
· Filed under software ·Tagged Business, organization, software, technology
Anybody who has written a computer program appreciates the warm glow of having a machine do your work. That pleasure often masks the burden of holding the program’s hand — kicking it off and checking that it ran correctly.
At least for me, the natural tendency is to stop short of complete automation. Instead, if I’m being lazy, I’ll start the program manually and catch errors by noticing odd behavior. Perhaps I’ll see a glaring failure message. Or maybe I’ll just notice that the program ran too quickly.
I do this because I’ve seen trusty programs fail in myriad ways. I’ve seen malformed or truncated inputs. I’ve seen choppy or slow network connections. I’ve seen disks fill up. I’ve seen memory hogs.
When an error like that pops up when you’re sitting right there at the command prompt, it’s typically easy to diagnose and move on. When run naively from a scheduler, you might not notice a problem until disaster strikes, sparking a late night emergency call, or worse. Therefore, I have to fight my natural tendency to hand-hold my programs.
Through experience, I’ve learned to go the extra mile and properly automate my programs. These are some of my best practices — I hope other people can point out ones that I’ve overlooked.
- Every program must use a proper logger. The first thing I do when I write *any* program is to make sure it’s set up to log using that language’s preferred logging utility. This allows you to properly investigate what happened every time the program runs.
- Save the logs from each run. It’s possible that an error will go unnoticed for a while, and there’s nothing worse than saying “unfortunately the log file was deleted.”
- Validate the program’s inputs. I’ve been burned before by people who will “always save their Excel spreadsheet in the same way,” “websites that never change,” and data providers that “guarantee their data are validated before given to you.”
- Sanity check the outputs. If you always expect your program to generate some new data, then compare the new results to the old results, and verify that there are actually new results.
- Catch errors and retry. Computers are deterministic, but real world resources are unpredictable. I’ve seen one-off errors too often. Perhaps I call an external program that fails for an unknown reason. Or a file copy fails, but when I retry, it works fine.
- Have an extremely robust email notification system. When there’s an error, you should be notified. One trick is that a failure in the email system should never cause the program to fail. Perfect this library and reuse it.
- Try your program every time you change it. This is easy to say, but it’s tempting to say “oh, but I’m just adding one line” and not run your unit tests. If you change it, at the very least, give it a whirl.
- Run the program from multiple machines. Computers fail at all the wrong times. At the very least, design your program so that only one instance publishes a final result. Then if that computer fails, you can run a quick command on the other machine to publish whatever that program creates.
Permalink
Older Posts »