CSV Parsing: Haskell versus Python

I recently wrote a fairly simple program to parse a CSV file containing some daily intraday prices for a number of tickers. The CSV file had roughly 160,000 lines, with each line containing a ticker, bid, ask, and timestamp. The goal is to produce a daily open, high, low, and close for each ticker.

First I wrote the program in Haskell using the Text.CSV library. It performed horrendously (I didn’t even bother waiting for the program to finish executing as I heard the CPU fan spin up to max). After chatting with ‘dons’ and others on #haskell, I rewrote the program using the Data.ByteString library. Unfortunately, it took about 7.5 seconds to process the input and produce a result.

Curious, I wrote an equivalent program in python. The python program finishes in about 1.7 seconds on my MacBook. That’s 1.7 seconds for python and 7.5 seconds for Haskell — not so good.

None of the superstars on #haskell had any immediate suggestions for what’s making Haskell perform poorly, but most suggestions pointed to my unpacking of the ByteStrings to Strings. I’ll investigate further if I have the time, and post the findings here.

Here is my Haskell code (http://hpaste.org/8928):

> import qualified Data.ByteString.Char8   as BS
> import Data.Maybe
> import Data.List
> import List
> import System.Environment
> import System.IO

TICKER1,33,35,NULL,2007-09-28 00:00:00,2007-09-28 16:32:00
TICKER2,29.5,31,NULL,2007-03-07 00:00:00,2007-03-07 07:33:00
TICKER3,29.5,31,NULL,2007-03-07 00:00:00,2007-03-07 07:33:00

> data Row = Row {ticker :: String,
>                 bid :: Double,
>                 ask :: Double,
>                 day :: String,
>                 time :: String} deriving (Read, Show)

> data Result = Result {tickerR :: String,
>                       openR :: Double,
>                       highR :: Double,
>                       lowR :: Double,
>                       closeR :: Double,
>                       dayR :: String} deriving (Read, Show)

> main :: IO ()
> main = do
>   (csvFile:outputFile:_) <- getArgs
>   rows <- doReadData csvFile
>   let tickerGroups = groupBy (\a b -> (ticker a) == (ticker b)) rows
>   let oneTicker rs = map oneDay $ groupBy (\a b -> (day a) == (day b)) rs
>   let results = concat $ map oneTicker tickerGroups
>   let pprow r = concat $ intersperse "," $ [tickerR r,
>                                             show $ openR r,
>                                             show $ highR r,
>                                             show $ lowR r,
>                                             show $ closeR r,
>                                             dayR r]
>   let txt = unlines $ map pprow results
>   writeFile outputFile txt
>   putStrLn "done"

create a Result row that encapsulates the open/high/low/close for
a single ticker on a single day

> oneDay rows =
>   let price r  = ((bid r) + (ask r)) / 2
>       openRow  = head rows
>       closeRow = last rows
>       highRow  = maximumBy (\a b -> compare (price a) (price b)) rows
>       lowRow   = minimumBy (\a b -> compare (price a) (price b)) rows
>   in (Result (ticker openRow)
>              (price openRow)
>              (price highRow)
>              (price lowRow)
>              (price closeRow)
>              (day openRow))

> maybeRead s = case reads s of
>                 [ (x, "")] -> Just (x::Double)
>                 _          -> Nothing

Convert a list of strings into a [Row]
We need to convert two of the strings to Doubles.

> makeRow r =
>   case (maybeRead (r !! 1), maybeRead (r !! 2)) of
>     ((Just a), (Just b)) -> Just (Row (r !! 0) a b (r !! 4) (r !! 5))
>     _                    -> Nothing

Read in the CSV file, returning a [Row], sorted by ticker and day

> doReadData file = do
>   bytes <- BS.readFile file
>   let byteRows = map (BS.split ',') $ BS.lines bytes
>   let sortByTicker   = sortBy (\a b -> case compare (head a) (head b) of
>                                          EQ -> compare (a !! 5) (b !! 5)
>                                          o  -> o)
>   let sortedByteRows = sortByTicker $ filter (\r -> (length r) > 0) byteRows
>   let strRows = map (\r -> map BS.unpack r) sortedByteRows
>   return $ catMaybes $ map makeRow strRows

And here is the python code:

import itertools as IT

def rowCompare(x,y):
    """comparison used for sorting our rows.
    Sort by ticker and then by time."""
    if cmp(x[0], y[0]) == 0:
        return cmp(x[5], y[5])
    else:
        return cmp(x[0], y[0])

def price(r):
    """Take the average of the bid and ask prices."""
    return (float(r[1]) + float(r[2])) / 2.0

# read in the data from a CSV file
f = open('IG_US.csv', 'r')
lines = sorted([line.strip().split(',') for line in f], rowCompare)

# create daily open/high/low/close for each ticker
results = []
for ticker, tickerIter in IT.groupby(lines, lambda x: x[0]):
    for day, dayIter in IT.groupby(tickerIter, lambda x: x[4]):
        dayRows = [d for d in dayIter if d[1] != "NULL" and d[2] != "NULL"]
        if not dayRows:
            continue
        highRow = dayRows[0]
        lowRow = dayRows[0]
        for row in dayRows:
            highRow = row if price(row) > highRow else highRow
            lowRow = row if price(row) < lowRow else lowRow
        results.append([ticker,
                        "%0.2f" % price(dayRows[0]),
                        "%0.2f" % price(highRow),
                        "%0.2f" % price(lowRow),
                        "%0.2f" % price(dayRows[-1]),
                        day])

# write out the result to a file
fw = open("./IG_US_python_output.csv", 'w')
fw.write("\n".join([",".join([str(c) for c in l]) for l in results]))

Please note that I’ve posted a followup here.

17 Comments »

  1. Python performance golf!

    I’m betting the Python version would run even faster if you remove the rowCompare() function and replace the ‘lines = …’ line with (note the use of the ‘key’ argument to sorted()):

    lines = sorted([line.strip().split(',') for line in f], key=lambda x: (x[0], x[5]))

    And you could also change the ‘fw.write(…’ line to:

    fw.write(”\n”.join(”,”.join(str(c) for c in l) for l in results))

    which utilizes generator expressions rather than list comprehensions.

    Do these changes improve the run-time for you?

  2. Scott said

    While this code suits your needs, it’s not true CSV parsing, which would be MUCH slower. For robust CSV, you need to handle fields with embedded quotes and commas.

  3. Please don’t call this CSV parsing. CSV is a little bit more complicated than that.

  4. Paddy3118 said

    I’m interested in what comparable timings you got for the Python CSV module and any Haskell equivalent standard library?

    Thanks, Paddy.

  5. Evan said

    Hey pretty interesting comparison. Keep up the good work.

  6. Greg said

    Hi, interesting issue, do you try to profile the haskell code?

  7. Greg said

    Author’s response.

    Fair point about it not being true CSV parsing. I have some ideas of how I might optimize by unpacking fewer of the ByteStrings, but I have no idea of how I’d make Haskell fast with a true, nasty CSV file.

    I haven’t profiled the code because after doing a configure with:

    cabal configure –enable-executable-profiling

    I receive this error when I build:

    Could not find module `Data.ByteString.Char8′:
    Perhaps you haven’t installed the profiling libraries for package bytestring-0.9.1.0?

  8. i don’t think Haskell is designed for parsing CSV files … its not a language with emphasis on I/O …

    i’m not sure why you’re using it for that purpose … in the financial space even :)

  9. Reid said

    You realize that Python has a csv module as part of the standard library. All you really need is (I think roughly) this:

    import csv

    for row in csv.reader(’myfile.csv’):
    # do stuff with row.

    To make it even better, you can go so far as to use a DictReader which lets you use the row like a dictionary with real string names. And then if you wanted to go further, you could wrap that in a class that lets you use regular attribute lookup. But if you want it to go really fast, you name you indices like so:

    _ticker = 0
    _bid = 1

    And your code would still be readable.

  10. Greg said

    @author: Ok, as a temporary workaround, you may add timestamps before each “class” of action in order to find what is taking to much time. I think it’s important to find and focus on the code that mainly slow all your program.

  11. Stan Domeshok said

    Real World Haskell wrote a full CSV parser with Parsec, whole thing ended up being about 20 lines of code, you might want to take a look at it.
    http://book.realworldhaskell.org/beta/parsec.html

  12. Note there’s a ByteString-based CSV parser on hackage now,

    http://hackage.haskell.org/cgi-bin/hackage-scripts/package/bytestring-csv

    which is significantly faster than Text.CSV. That, along with the bytestring-lex package for parsing floating point values from ByteStrings should mean your program is able to be both concise and fast.

  13. Never, *ever*, **ever** unpack ByteStrings to Strings. That’s the whole point of ByteStrings! If you unpack, you may as well just use `[Char]`. Stick to ByteStrings only and you get [this kind of result](http://shootout.alioth.debian.org/gp4/benchmark.php?test=sumcol&lang=all).

  14. Kelila said

    Good words.

  15. Josh said

    160,000 records, well over over 50-bytes a peice, in 1.7 seconds?

    That’s almost 5GB/sec.

    Either I’m misunderstanding something, or the MacBook’s hard drive is apparently over 5 times faster than $50,000.00 RAID arrays currently on the market. That’s amazing!

  16. Greg said

    Josh,

    Well, the file had 157562 lines and is 9779275 bytes in size. So yeah, that’s about 62 bytes per line. But processing that number of lines in 1.7 seconds is only 5.5MBps. So, by that math, it’s megas, not gigas.

    Greg

  17. Josh said

    Did I really forget to carry the zeroes? D’oh!

    Thanks for the correction…

RSS feed for comments on this post · TrackBack URI

Leave a Comment