Zen and the Art of Elevators

I work in a high-rise building that secures its elevators with keycards.  In principle, the access system is simple and intuitive: employees and visitors are given keycards that they swipe past a sensor inside the elevator before pressing their floor button.  In reality, the building’s access system has a bad user interface, which regularly frustrates both employees and visitors.  During one early morning ride, the frustration was so bad that it almost boiled over into fisticuffs.

When I’m riding the elevator, I enjoy observing how various people react to the security system’s quirks.  I view it as a sort of a Rorschach test that exposes how people interact with technology.  A couple small quirks in the security system make the familiar interface of elevator floor buttons awfully cumbersome.  Only people who take the time to think about the implementation details succeed in calmly selecting their floor.  Those who take the Zen approach, avoiding any detailed thought of the elevator’s control system, typically end up spastically jabbing the buttons or missing their floor.

I drew up a simple flowchart that describes the security system’s behavior.  When the elevator doors open in the building lobby, the system is in the non-blinking state at the top left.  To keep the drawing simple, I haven’t drawn what happens when two people swipe cards or press buttons simultaneously.

Elevator Control Flowchart

Elevator Control Flowchart

The correct path for trouble-free operation is marked with bold arrows that move counter-clockwise around the drawing.  To follow this path, you must:

  1. Swipe your keycard past the sensor.  Be sure nobody else swipes their keycard at the same time.
  2. The sensor’s red light starts to blink.
  3. Wait 1 second.  Do not press the floor button too quickly because the press will be ignored.
  4. You’re now able to press the floor button.  The sensor is still blinking in the same way.
  5. Press the button for your floor.  It will light up.
  6. Everybody else must wait 5 more seconds for the sensor to stop blinking before they can swipe their card.

This description and the flowchart make a few design flaws apparent.  Most glaringly, any deviation from the correct path causes confusion or delay, and the flowchart depicts many potential detours.  The most common detours are:

  1. A person swipes their keycard and immediately presses their floor button.  The button doesn’t light up, and access is delayed.  The typically causes the person to start jabbing at the button.
  2. After somebody has selected her floor, but before the red light stops blinking, somebody else swipes his card and tries to press his floor button.  The swipe is ignored and the button press is rejected.  Everybody must wait roughly 5 seconds for the little red light to stop blinking before starting the process over.
  3. Two people swipe their card at roughly the same time.  Neither are sure which swipe, if any, was accepted.  The more aggressive of the two people starts jabbing their floor buttons while the other person steps back.  Typically all the button presses are ignored until the red light stops blinking and somebody tries again.

With such a poorly designed user interface, there’s no surprise that temporary visitors usually get confused.  I’m somewhat surprised that long-time employees remain confused.  In the terminology of Robert Pirsig’s Zen and the Art of Motorcycle Maintenance (which I’m re-reading now), these people are taking the Zen viewpoint toward technology.  They don’t want to trouble themselves with taking the time to think about the system’s details, which causes frustration when they encounter a problem.

If you visit my building, beware of the elevator.

Leave a Comment

Odd crashes lead to easy Linode upgrade

I’m hosting RankBuzzard on a virtual server hosted at Linode.  I initially signed up with their least expensive offering of $20/month.

This weekend I started observing some mysterious crashes on the server.  First I saw some failed queries into the Postgres database which stores RankBuzzard’s Django models.  Then, as I was investigating the cause of these errors, my Emacs quit with a simple “Killed” message.  Odd for sure.

I wondered if RankBuzzard’s database had outgrown the minimal 360MB allocated to my Linode 360 VPS.  After some basic resource checks that confirmed my suspicion, I decided to upgrade to a Linode 720.  The entire upgrade process, from filing a support ticket to clicking reload on RankBuzzard’s main page, took under 30 minutes.

All I had to do was:

  1. Submit a ticket requesting the upgrade,
  2. Shutdown my virtual server,
  3. Click a button that kicked off the migration process,
  4. Boot up the new virtual server.

The entire process was flawless.

Comments (1)

First version of RankBuzzard.com

I spent the holiday weekend implementing a website using the Django web framework.  In just a few days, I managed to put together a fully functional website that displays Google Hot Trends data and collects user comments.  The general point might be described as “Why are these searches popular?”  You can see the results at RankBuzzard.com.

I last built a public website roughly 4 years ago, when I wrote imwatching.net (which is no longer up).  At a broad level, the two websites are similar.  Both collect time series data, put it in a database, and present it via HTML.  I built imwatching.net using a collection of perl scripts that used the CGI module.  I had no ORM.  I had to build my own user management logic.  I had no templating system.  As a result, it took me at least 3 or 4 times as long to develop imwatching.net.  The end result also wasn’t nearly as tidy and well structured as my implementation of RankBuzzard.com.

What’s more, I hosted imwatching.net on a dedicated server that I rented from serverbeach.com for roughly $100/month.  Although I was happy with the quality of the server, the $100/month cost ultimately caused me to close the site.  Now I’m paying $20/month for a virtual server at linode.com.  Although it’s not a direct comparison, I’m been very happy so far with the service at linode.com.

Hopefully some people will find RankBuzzard interesting, informative, and fun.  I have several ideas for how to improve it, and I plan on adding to it when I find the free time over the next few weeks.

Leave a Comment

Review of OmniGraphSketcher

Last night I downloaded OmniGraphSketcher and gave it a test drive.  I’m a huge fan of OmniGraffle, OmniGroup’s diagramming program, so I was curious to see what their new graphing program had to offer.  In short, I like the usability that it inherited from OmniGraffle, but they need to make several improvements before I would even consider paying them $29.95.

For years I’ve been searching for a graphing platform that helps me quickly create Tufte-quality graphics.  I’ve had the best luck with programming-based environments such as NodeBox, Processing, or PSTricks (within LaTeX).  Back in 2000 I even wrote my own framework called TexLogo, which was a language and interactive system for illustrating Latex documents with logo turtle graphics. These frameworks are all drawing programs with an API.  As such, they give you the finest level of control with the unfortunate downside of being labor intensive.  Whenever you use them you risk diving down the rabbit hole.

The other end of the spectrum are programs that easily produce bad or misleading graphs.  Excel and gnuplot easily crank out ugly graphics.

So I like OmniGraphSketcher’s approach.  The program mixes hand-guided drawing with x-y data coordinate locations.  The default result is clean and sharp, just like diagrams that come out of OmniGraffle.  Unfortunately it’s not a finished product.  Specifically, I need it to include the following:

  1. Units other than foating point numbers.  Most importantly, they need to add support for dates.  You can’t draw a time series graph with the program.
  2. Titles.  Yes, you can double click to add a label, but there’s no built in support for title or sub-titles.
  3. Legends.  This can also be done manually, but they could facilitate it so that you can link data points or series with a legend.
  4. More options for the axes.  For example I’d like the option to only show tick marks.  And I’d like the ability to only show specific tick marks.  I don’t want to be forced to use evenly spaced tick marks.

If the OmniGroup developers make these improvements, then I’m likely to send them thirty bucks.

Here’s a quick example graph that I threw together.

omnigraphsketcher_visits1

Here’s something closer to what I’d like that I made with OmniGraffle.

omnigraffle_visits

Leave a Comment

Randomized optimization algorithm

I thought of a simple randomized optimization algorithm today while I was working on a problem. The algorithm’s working well for me, and while I doubt it’s truly novel, it’s new to me.

Broadly speaking, I’m searching for for a vector of floating point numbers that optimizes my objective function. My objective function is complex enough to be thought of as a black box, and therefore there’s no closed-form gradient function. Although the function has no local minima, the parameters are subject to a number of boundary conditions.

I’m coding in python, so at first I tried using scipy.optimize.fmin. In the function passed to fmin, I returned a large number if the given vector violated the boundary conditions. Unfortunately those large numbers created singularities that caused fmin to get stuck. There are other functions within scipy.optimize that might work, but it wasn’t clear how to use them. This motivated me to code up my own algorithm.

The general approach is simple: at each step generate a random neighbor. If that neighbor is superior to the current point, move to it. I generate a neighbor by perturbing each element of my vector by some fixed delta. Specifically, I multiple a vector numpy.random.random_integers(low=-1, high=1, size=mysize) by delta. Then I normalize the result. Lastly, I find the closest vector point that satisfies the boundary conditions.

This has the succinctness that makes many randomized algorithm appealing. In addition, on my problem, its performance was better
than I expected. Two enhancements made it exceptional.

First, if I generate too many neighbors without finding a superior one, then I decrease delta by a factor of 10. The algorithm stops when delta drops below some threshold. This provides a natural transition between large, coarse steps for speed and small, fine steps for precision.

Second, when I’m at point x0, and I generate a superior point at x1, then I search for increasingly superior points along the trajectory from x0 to x1. Doing so yields the benefits of a poor man’s gradient.

example_search

The figure gives a rough illustration of the search technique. We start at a point with value 8, and randomly pick a neighbor using delta 4. Next we search along that trajectory and find a better point with value 4. Once again using delta 4 we pick a random superior neighbor. Note that we hop over a boundary point when we move there. Unfortunately we don’t randomly generate a better neighbor with delta 4 from that point (even though one exists). So we switch delta to 2, and continue the algorithm.

My code is currently embedded with logic that’s specific to the problem I’m solving.  If I abstract it out, I’ll post the code here.  I’m curious to hear from anybody else who’s implemented a randomized optimization algorithm like this one.

Comments (1)

Pythonic Data Analysis with MaskedArray and Timeseries

In the financial world, most quants analyze time series data using languages such as Matlab, R, SAS, or Stata.  I’ve used those tools, but I’m much happier working with a more general purpose language.  Most recently, I’ve been writing most of my code in Python.  Although people have implemented interfaces from Python to R and other numerical libraries, I prefer to avoid hopping in and out of Python.  Fortunately, Python, NumPy, and SciPy provide an expressive, flexible, and efficient platform for analyzing data.

Regardless of what programming platform you choose, the biggest challenge is wrangling real world data into the platform’s data structures so that you can take advantage of their high level operators.  When you’re wrangling, you’re typically confronted with a Procrustean bed that forces you to crudely fit your real world inputs to the pristine shape and view of your data structure.  This painful process unfortunately risks generating incorrect and misleading results.

In the case of python’s numpy, you need to fit your data into numpy arrays.  Recently I’ve needed to analyze a lot of real world time series data, so I’ve been exploring two important extensions to numpy: masked arrays and the scikits timeseries.  I want to share my experiences and show some code.

I’m coding in Python 2.5.2, using numpy 1.3.0.dev6370, scipy 0.7.0, and scikits.timeseries 0.67.0.dev-r1480.  I’ve run my code both on Windows and OS X 10.5.  A single file containing all of the example code is located on github here.  Everything that I’m writing related to this post is accessible via git://github.com/nodogbite/maskedarray_timeseries.git or by browsing here.

Real World Example

To ground this discussion, let’s consider a purely random dataset that approximates some simple real world daily financial data.  The function ‘generateFakeStockData’ pumps out roughly 8 years of daily return and trading volume data for a universe of 2000 tickers.  Using this data, we’ll calculate:

  • the volume-weighted returns for a given basket of stocks, and
  • the minimum, maximum, and average daily daily returns or trading volumes.

Starting at the top of the class hierarchy, consider the basic numpy array.  Fundamentally, a numpy array is a data structure that stores a multi-dimensional collection of homogeneous data.  Once data is in a numpy array, it’s easy to crunch it using either regular python code or optimized functions from the numpy and scipy libraries.

For our toy example, it’s tempting to tap into those capabilities by loading the data into numpy arrays.  We could put the daily returns into a two-dimensional array of floats, while putting the daily trading volume into an array of integers.  In both numpy arrays, the columns contain data for a particular ticker, and the rows contain data for a particular day.  Then, as the figure shows, we could compute a vector of total returns by writing ‘NP.prod(dailyReturns + 1, axis=0) – 1′.

numpyarray_stockdata1
Unfortunately, our data, and most real world data, aren’t perfectly aligned, and are interspersed with missing values.  These gaps aren’t easy to represent in the numpy array.  As a hack, we could insert a conspicuous flag value in place of missing values, and then derive a perfectly clean array or set of arrays every time we want to use a numpy operator.  This is both inefficient and error prone.  We won’t bother doing that with our example daily financial data.

Masked Arrays

The numpy.ma.MaskedArray, a subclass of the numpy array, was built for this situation.  Conceptually, a masked array is a numpy array coupled with a second array of booleans that has the same shape.  The first array is called the ‘data’, and the boolean array is called the ‘mask’.  The MaskedArray blanks out the data array value at every position where the mask array is ‘True’.

maskedarray_stockdata1
The MaskedArray redefines most of the numpy array’s functions so that it handles missing data just as you expect.  For example, sum and product simply ignore the masked entries.  Since MaskedArray is a subclass of ndarray, the switch is fairly seamless.  There are also a bunch of new capabilities, as well as some potential pitfalls, which I’ll point out.

Returning to our example, we write a function that loads data from ‘generateFakeStockData’ into a couple of masked arrays.  Let’s name the function ‘loadDataIntoMaskedArrays’.  This function is fairly representative of most functions that I’ve written to build masked arrays when I’m processing CSV files.  (Usually I’m a bit more sophisticated so that I can process any number of columns, instead of hard-coding information about the columns, as I’ve done here.)  As we read each datum, we record both its value and a boolean that will become its mask.  Finally, we pass both the data and the mask to the MaskedArray constructor.

Once the data is in the MaskedArray, we can use functions from its class and module to produce results that appropriately skip the missing data.  Similar to the numpy array, we write ‘MA.product(dailyReturns + 1, axis=0) – 1′ to generate a vector of total returns for every ticker.  Here we see a minor blemish in the API: the numpy package has both a ‘prod’ and ‘product’ function, but the numpy.ma package only has a ‘product’ function.

Timeseries

So far, we’ve only used the dates as a groupby key.  The ’scikits.timeseries’ package enables us to conveniently couple a list of dates to our masked array.  Doing so will help us align data for different tickers, as well as produce subsets for a given date range.  There’s some nice documentation for the timeseries module here, but let’s write some timeseries code for our example.

Let’s call the function ‘loadDataIntoTimeseries’, which uses a helper function ‘makeTimeseriesGrid’.  This version of the loading function doesn’t have the bit of logic to fill in masked values for tickers that are missing from the initial dates.  Instead, we simply keep lists of observed dates, datums, and masks for every ticker.  Then we use functions from the timeseries package to properly align the rows and fill in missing values.

Creating a list of timeseries Date objects is the first step when constructing a timeseries object.  Every Date object has a frequency.  In this case we use the ‘B’ frequency, which stands for business days, or Monday through Friday.  Then, using that list of Date objects, you must create a DateArray object, which also has a frequency that’s equal to the frequency of the dates.  The time_series constructor also takes in a MaskedArray that has the same length as the DateArray.

The Date object has some useful functionality, such as when you add 1 to a Friday Date with business frequency, you get the subsequent Monday.  One feature that’s potentially harmful is that if you construct a business frequency Date using a weekend, you’ll get back a Date for the subsequent Monday.  I think a thrown exception would be more pythonic.

Let’s look at the function ‘makeTimeseriesGrid’.  Here we build up a timeseries for every ticker, align them to a common date list, and finally stack all of the data together into a single two-dimensional grid.  Note that I call ‘fill_missing_dates’ on each ticker timeseries.  This fills in all the missing dates in the DateArray and also inserts masked rows into the timeseries data at the corresponding locations.  I wish this were the default behavior because it leverages the principle point of MaskedArrays.  I think calling it is the best practice.

Each of the timeseries are not necessarily aligned on the same date list.  The timeseries package contains a function ‘aligned’, which behaves a bit like ‘fill_missing_dates’, to create a list of timeseries objects that contain the exact same dates.  Its signature is a bit awkward because it takes its input as a variable length argument list.

Finally, we use the numpy.ma.column_stack function to build a two-dimensional array.  Note that we access the ’series’ property of each timeseries object, which returns its masked array.  The timeseries object also has a ‘data’ property which returns a plain, un-masked numpy array.  I wish ‘data’ returned the masked array because ’series’ is a confusing name and I think a masked array should be the default.

To Be Continued

This post is pushing “too long didn’t read” length.  I will defer benchmarks and additional examples of using the constructed timeseries object to a subsequent post.

Comments (2)

Git is it

I’ve used RCS, CVS, ClearCase, Perforce, and SVN.  For the past 8 years or so, Subversion has been my source control system of choice.  For the past few weeks I’ve been dabbling with git, and I’m hooked.  I’m kicking SVN to the curb.

The big wins are:

  1. lightweight branching.  It’s no big deal to make a branch to try out a simple change.  Similarly, I like the convenience of commands like “git stash”.
  2. no pollution of directories with “.svn”.  This is essentially what brought me to git from svn.  I wanted to version control a directory without confusing some non-technical users who would have been perturbed by tons of “.svn” folders in Windows Explorer.  Even as a technical user, I’ve never liked subversion’s use of the “.svn” directories.
  3. no server required.  I like that I don’t have to install or configure anything other than git.  I installed git, typed “git init” in a directory, and was rolling.

Comments (1)

Scatter Map of Madoff’s Victims

When I learned that the court had made a list of Madoff’s victims available, I immediately wanted to graph it.  I first forwarded the PDF one of my best friends.  Within a couple hours he had converted it to text, which he streamed through Yahoo Pipes to generate KML suitable for viewing with Google Maps.  The pipes filtered out duplicates, did address discovery and geolocation.  Unfortunately, Google Maps can’t display more than 80 markers at once, which is far less than the thousands of people whom Madoff ripped off.  He eventually did pull off a single view visualization, but it ran like a dog.

So last night I took a different approach.  I extracted out all of the zip codes from the list.  (Ignoring all of the international investors robbed by Madoff.)  Then I used http://geocoder.us to map each zip code to a latitude and longitude.  Next I wrote a short script in NodeBox to draw a projection of the zip codes on a canvas.  Each point is color coded to indicate the number of times that zip code occurs in the court document.  I was helped a lot by code and ideas in Ben Fry’s Visualizing Data book, which I have access to via Safari Bookshelf.  In particular, I stole his function for projecting out the latitude and longitude numbers.

Here are the resulting images.  The perfectionist in me would love to tinker with this for hours.

scatter map of Madoff's victims in the NYC Metro area

scatter map of Madoff's victims in the NYC Metro area

scatter map of Madoff's victims in NYC Metro area

scatter map of Madoff's victims in NYC Metro area

scatter map of Madoff's victims in New England

scatter map of Madoff's victims in New England

scatter map of Madoff's victims in Florida

scatter map of Madoff's victims in Florida

Comments (5)

Researching Ant Colonies

Last night I watched Dr. Deborah Gordon’s tech talk about How Ant Colonies Get Things Done.  She’s been studying a collection of ant colonies in New Mexico for the past 25 years.  That impressive feat of longevity was my biggest takeaway from the talk.  25 years devoted to ants.

As a kid growing up in Illinois, I toyed with ants.  I mixed ants from different colonies.  I figured out that you could confuse an ant by erasing its scent trail.  I sprayed ants with paint.  I roasted ants by focusing sunlight with a magnifying glass.  I slowed ants by putting them in the fridge.  I had a couple ant farms.  I submerged ants under water.  I dug up colonies outside.

I see Deborah taking that childish curiosity to a professional level.  As a computer scientist who also first played with computers at an early age, I find her longevity inspirational.

I’ve also been thinking about how her results might be applied to distributed computing problems.  Ant colonies are extremely fault tolorant and highly decentralized.  When I design distributed systems I’ll be sure to think about how ant colonies might apply.  And if I get tired, I’ll think about how Deborah Gordon keeps on keepin on.

Leave a Comment

Open Source CDS Pricing

I just read an ISDA press release saying that the J.P. Morgan CDS Analytics Engine will soon be open source.  I hope it’s true because I’d love to feast my eyes on that code base.

For the uninitiated, CDS stands for Credit Default Swap.  Put simply, a CDS is an insurance contract between two parties that references some financial contract, such as a bond.  The party selling insurance receives payments from the buyer every quarter.  If the referenced entity (bond) defaults, the quarterly payments immediately stop, and the insurance seller pays out a large lump sum to cover any losses.

CDS contracts have a value that fluctuates with variables such as time, interest rates, and the perceived risk of default on the referenced entity.  As a result, there is a large, liquid market for these derivative contracts.  A single contract is typically written to insure a notional of $10, $20, $50, or $100 million.  Despite dealing with sums that large (or larger), there is no reference, open source pricing mechanism.

Said another way, if a bank or hedge fund wants to value one of its CDS positions, there’s no reference, open source pricing algorithm.  Instead, each desk might use an expensive, commercial library, or the CDS pricing screen on their (expensive) Bloomberg Terminal, or their own proprietary pricing algorithm.  The standard pricing model within the Bloomberg Terminal is their “J.P. Morgan model,” the details of which are only described at a high level in a one page document.  You can’t access the code.

At a previous job I wrote a proprietary CDS pricing algorithm.  I was shocked at how subtle variations in the algorithm produce signficantly different results.  Furthermore, it’s impossible to tie out your results exactly with a standard such as Bloomberg’s pricer.  And I’m wasn’t alone.  I spoke to quants at other banks and to vendors who sell their own pricers, and none of them were able to exactly match Bloomberg’s pricer!  I had a conspiracy theory that Bloomberg made their pricing output cryptographically secure so that traders would be forced to pay for their CDS pricer.

So if this J.P. Morgan code has any relation to the elusive code within the Bloomberg CDS screen, I can’t wait to see it.

Leave a Comment

Older Posts »