I recently wrote a fairly simple program to parse a CSV file containing some daily intraday prices for a number of tickers. The CSV file had roughly 160,000 lines, with each line containing a ticker, bid, ask, and timestamp. The goal is to produce a daily open, high, low, and close for each ticker.
First I wrote the program in Haskell using the Text.CSV library. It performed horrendously (I didn’t even bother waiting for the program to finish executing as I heard the CPU fan spin up to max). After chatting with ‘dons’ and others on #haskell, I rewrote the program using the Data.ByteString library. Unfortunately, it took about 7.5 seconds to process the input and produce a result.
Curious, I wrote an equivalent program in python. The python program finishes in about 1.7 seconds on my MacBook. That’s 1.7 seconds for python and 7.5 seconds for Haskell — not so good.
None of the superstars on #haskell had any immediate suggestions for what’s making Haskell perform poorly, but most suggestions pointed to my unpacking of the ByteStrings to Strings. I’ll investigate further if I have the time, and post the findings here.
Here is my Haskell code (http://hpaste.org/8928):
> import qualified Data.ByteString.Char8 as BS
> import Data.Maybe
> import Data.List
> import List
> import System.Environment
> import System.IO
TICKER1,33,35,NULL,2007-09-28 00:00:00,2007-09-28 16:32:00
TICKER2,29.5,31,NULL,2007-03-07 00:00:00,2007-03-07 07:33:00
TICKER3,29.5,31,NULL,2007-03-07 00:00:00,2007-03-07 07:33:00
> data Row = Row {ticker :: String,
> bid :: Double,
> ask :: Double,
> day :: String,
> time :: String} deriving (Read, Show)
> data Result = Result {tickerR :: String,
> openR :: Double,
> highR :: Double,
> lowR :: Double,
> closeR :: Double,
> dayR :: String} deriving (Read, Show)
> main :: IO ()
> main = do
> (csvFile:outputFile:_) <- getArgs
> rows <- doReadData csvFile
> let tickerGroups = groupBy (\a b -> (ticker a) == (ticker b)) rows
> let oneTicker rs = map oneDay $ groupBy (\a b -> (day a) == (day b)) rs
> let results = concat $ map oneTicker tickerGroups
> let pprow r = concat $ intersperse "," $ [tickerR r,
> show $ openR r,
> show $ highR r,
> show $ lowR r,
> show $ closeR r,
> dayR r]
> let txt = unlines $ map pprow results
> writeFile outputFile txt
> putStrLn "done"
create a Result row that encapsulates the open/high/low/close for
a single ticker on a single day
> oneDay rows =
> let price r = ((bid r) + (ask r)) / 2
> openRow = head rows
> closeRow = last rows
> highRow = maximumBy (\a b -> compare (price a) (price b)) rows
> lowRow = minimumBy (\a b -> compare (price a) (price b)) rows
> in (Result (ticker openRow)
> (price openRow)
> (price highRow)
> (price lowRow)
> (price closeRow)
> (day openRow))
> maybeRead s = case reads s of
> [ (x, "")] -> Just (x::Double)
> _ -> Nothing
Convert a list of strings into a [Row]
We need to convert two of the strings to Doubles.
> makeRow r =
> case (maybeRead (r !! 1), maybeRead (r !! 2)) of
> ((Just a), (Just b)) -> Just (Row (r !! 0) a b (r !! 4) (r !! 5))
> _ -> Nothing
Read in the CSV file, returning a [Row], sorted by ticker and day
> doReadData file = do
> bytes <- BS.readFile file
> let byteRows = map (BS.split ',') $ BS.lines bytes
> let sortByTicker = sortBy (\a b -> case compare (head a) (head b) of
> EQ -> compare (a !! 5) (b !! 5)
> o -> o)
> let sortedByteRows = sortByTicker $ filter (\r -> (length r) > 0) byteRows
> let strRows = map (\r -> map BS.unpack r) sortedByteRows
> return $ catMaybes $ map makeRow strRows
And here is the python code:
import itertools as IT
def rowCompare(x,y):
"""comparison used for sorting our rows.
Sort by ticker and then by time."""
if cmp(x[0], y[0]) == 0:
return cmp(x[5], y[5])
else:
return cmp(x[0], y[0])
def price(r):
"""Take the average of the bid and ask prices."""
return (float(r[1]) + float(r[2])) / 2.0
# read in the data from a CSV file
f = open('IG_US.csv', 'r')
lines = sorted([line.strip().split(',') for line in f], rowCompare)
# create daily open/high/low/close for each ticker
results = []
for ticker, tickerIter in IT.groupby(lines, lambda x: x[0]):
for day, dayIter in IT.groupby(tickerIter, lambda x: x[4]):
dayRows = [d for d in dayIter if d[1] != "NULL" and d[2] != "NULL"]
if not dayRows:
continue
highRow = dayRows[0]
lowRow = dayRows[0]
for row in dayRows:
highRow = row if price(row) > highRow else highRow
lowRow = row if price(row) < lowRow else lowRow
results.append([ticker,
"%0.2f" % price(dayRows[0]),
"%0.2f" % price(highRow),
"%0.2f" % price(lowRow),
"%0.2f" % price(dayRows[-1]),
day])
# write out the result to a file
fw = open("./IG_US_python_output.csv", 'w')
fw.write("\n".join([",".join([str(c) for c in l]) for l in results]))
Please note that I’ve posted a followup here.
Chris Rebert said,
July 15, 2008 @ 2:13 am
Python performance golf!
I’m betting the Python version would run even faster if you remove the rowCompare() function and replace the ‘lines = …’ line with (note the use of the ‘key’ argument to sorted()):
lines = sorted([line.strip().split(',') for line in f], key=lambda x: (x[0], x[5]))
And you could also change the ‘fw.write(…’ line to:
fw.write(”\n”.join(”,”.join(str(c) for c in l) for l in results))
which utilizes generator expressions rather than list comprehensions.
Do these changes improve the run-time for you?
Scott said,
July 15, 2008 @ 2:32 am
While this code suits your needs, it’s not true CSV parsing, which would be MUCH slower. For robust CSV, you need to handle fields with embedded quotes and commas.
Stefan Scholl said,
July 15, 2008 @ 2:52 am
Please don’t call this CSV parsing. CSV is a little bit more complicated than that.
Paddy3118 said,
July 15, 2008 @ 3:11 am
I’m interested in what comparable timings you got for the Python CSV module and any Haskell equivalent standard library?
Thanks, Paddy.
Evan said,
July 15, 2008 @ 4:09 am
Hey pretty interesting comparison. Keep up the good work.
Greg said,
July 15, 2008 @ 5:50 am
Hi, interesting issue, do you try to profile the haskell code?
Greg said,
July 15, 2008 @ 7:31 am
Author’s response.
Fair point about it not being true CSV parsing. I have some ideas of how I might optimize by unpacking fewer of the ByteStrings, but I have no idea of how I’d make Haskell fast with a true, nasty CSV file.
I haven’t profiled the code because after doing a configure with:
cabal configure –enable-executable-profiling
I receive this error when I build:
Could not find module `Data.ByteString.Char8′:
Perhaps you haven’t installed the profiling libraries for package bytestring-0.9.1.0?
Erhan Hosca said,
July 15, 2008 @ 8:08 am
i don’t think Haskell is designed for parsing CSV files … its not a language with emphasis on I/O …
i’m not sure why you’re using it for that purpose … in the financial space even :)
Reid said,
July 15, 2008 @ 11:28 am
You realize that Python has a csv module as part of the standard library. All you really need is (I think roughly) this:
import csv
for row in csv.reader(’myfile.csv’):
# do stuff with row.
To make it even better, you can go so far as to use a DictReader which lets you use the row like a dictionary with real string names. And then if you wanted to go further, you could wrap that in a class that lets you use regular attribute lookup. But if you want it to go really fast, you name you indices like so:
_ticker = 0
_bid = 1
…
And your code would still be readable.
Greg said,
July 15, 2008 @ 12:32 pm
@author: Ok, as a temporary workaround, you may add timestamps before each “class” of action in order to find what is taking to much time. I think it’s important to find and focus on the code that mainly slow all your program.
Stan Domeshok said,
July 15, 2008 @ 1:16 pm
Real World Haskell wrote a full CSV parser with Parsec, whole thing ended up being about 20 lines of code, you might want to take a look at it.
http://book.realworldhaskell.org/beta/parsec.html
Don Stewart said,
July 19, 2008 @ 8:33 pm
Note there’s a ByteString-based CSV parser on hackage now,
http://hackage.haskell.org/cgi-bin/hackage-scripts/package/bytestring-csv
which is significantly faster than Text.CSV. That, along with the bytestring-lex package for parsing floating point values from ByteStrings should mean your program is able to be both concise and fast.
Christine Fenton said,
July 25, 2008 @ 12:22 pm
Never, *ever*, **ever** unpack ByteStrings to Strings. That’s the whole point of ByteStrings! If you unpack, you may as well just use `[Char]`. Stick to ByteStrings only and you get [this kind of result](http://shootout.alioth.debian.org/gp4/benchmark.php?test=sumcol&lang=all).
Kelila said,
October 28, 2008 @ 1:46 pm
Good words.