Tonight I tried improving the Haskell version of my CSV data analysis program. None of the changes made the Haskell program perform as well as the Python program. Still, I thought I’d share the results.
I first tried using the Parsec code discussed in the Real World Haskell book. The authors boil the CSV parsing down to 5 lines, which I incorporated into my program as:
> doReadDataParsec file = do > chars <- readFile file > let charRows = case parse csvFile "(unknown)" chars of > Right r -> r > otherwise -> [[]] > let sortByTicker = sortBy $ comparing head `mappend` comparing (!!5) > let strRows = sortByTicker $ filter (\r -> (length r) > 0) charRows > return $ catMaybes $ map makeRowFromStrs strRows > where csvFile = endBy line eol > line = sepBy cell (char ',') > cell = many (noneOf ",\n") > eol = char '\n'
I use the standard readFile function here, which reads the entire file in as a String, instead of a ByteString. The String is then parsed using Parsec. Using this function, the program executes in about 10 seconds. That’s up from about 7.5 seconds with my first implementation.
Next I tried to avoid the String type. In this version, I only unpack the ByteString when I need to convert the bid and ask to a Double. I then pack the Double back to a ByteString when the result is written to file. Everything else is left as a ByteString (except for the data types that I create — I’m not sure if that’s a problem). This approach shaves off a little over 3 seconds, executing in about 4.3 seconds.
Here is the complete program:
> import qualified Data.ByteString.Char8 as BS
> import Data.Ord
> import Data.Maybe
> import Data.Monoid
> import List
> import System.Environment
TICKER1,33,35,NULL,2007-09-28 00:00:00,2007-09-28 16:32:00
TICKER2,29.5,31,NULL,2007-03-07 00:00:00,2007-03-07 07:33:00
TICKER3,29.5,31,NULL,2007-03-07 00:00:00,2007-03-07 07:33:00
> data RowBytes = RowBytes {tickerB :: BS.ByteString,
> bidB :: Double,
> askB :: Double,
> dayB :: BS.ByteString,
> timeB :: BS.ByteString} deriving (Read, Show)
> data Result = Result {tickerR :: BS.ByteString,
> openR :: Double,
> highR :: Double,
> lowR :: Double,
> closeR :: Double,
> dayR :: BS.ByteString} deriving (Read, Show)
> main :: IO ()
> main = do
> (csvFile:outputFile:_) <- getArgs
> rows <- doReadData csvFile
> let tickerGroups = groupBy (\a b -> (tickerB a) == (tickerB b)) rows
> let oneTicker rs = map oneDay $ groupBy (\a b -> (dayB a) == (dayB b)) rs
> let results = concat $ map oneTicker tickerGroups
> let bytes r = [tickerR r,
> BS.pack $ show $ openR r,
> BS.pack $ show $ highR r,
> BS.pack $ show $ lowR r,
> BS.pack $ show $ closeR r,
> dayR r]
> let pprow r = BS.concat $ intersperse (BS.singleton ',') $ bytes r
> let txt = BS.unlines $ map pprow results
> BS.writeFile outputFile txt
> putStrLn "done"
create a Result row that encapsulates the open/high/low/close for
a single ticker on a single day
> oneDay rows =
> let price r = ((bidB r) + (askB r)) / 2
> openRow = head rows
> closeRow = last rows
> highRow = maximumBy (\a b -> compare (price a) (price b)) rows
> lowRow = minimumBy (\a b -> compare (price a) (price b)) rows
> in (Result (tickerB openRow)
> (price openRow)
> (price highRow)
> (price lowRow)
> (price closeRow)
> (dayB openRow))
> maybeRead s = case reads s of
> [ (x, "")] -> Just (x::Double)
> _ -> Nothing
Convert a list of strings into a [Row]
We need to convert two of the strings to Doubles.
> makeRow r =
> case (maybeRead (BS.unpack $ r !! 1), maybeRead (BS.unpack $ r !! 2)) of
> ((Just a), (Just b)) -> Just (RowBytes (r !! 0) a b (r !! 4) (r !! 5))
> _ -> Nothing
Read in the CSV file, returning a [Row], sorted by ticker and day
> doReadData file = do
> bytes <- BS.readFile file
> let byteRows = map (BS.split ',') $ BS.lines bytes
> let sortByTicker = sortBy $ comparing head `mappend` comparing (!!5)
> let sortedByteRows = sortByTicker $ filter (\r -> (length r) > 0) byteRows
> return $ catMaybes $ map makeRow sortedByteRows
Finally, I tried adding the -O2 ghc compile flag, but that did not yield a speedup.
I’m not sure what to try next. Thanks to everybody who commented to my last post, both directly on this blog and on the Reddit thread.
Note that in the comments below, I’ve linked to a version of the Haskell code that runs in 1 second. Thanks to Slarba for posting the solution, which I only modified slightly.
CSV Parsing: Haskell versus Python « tech guy in midtown said
[...] note that I’ve posted a followup here. Possibly related posts: (automatically generated)Baby Hat PatternMagnetic Field of a [...]
I don't believe you said
Haskell’s performance couldn’t be as bad as you claim. You must be imagining it, because according to the small group of evangelists who have turned programming.reddit.com into The Haskell Blog, Haskell is super super fast! As fast as C, even–No, faster than C! Because Haskell has monads and lazy evaluation and and…
I like dynamic languages. I prefer to use Lisp, Ruby, and even Javascript. These languages have their place. But contrary to the BS you read on the blogs of vacuous language advocates, there are many tasks for which they are just not suitable. C, on the other hand, is perfectly suited for something like this (sequential file IO, light text-parsing, large date set). In fact, in the time you wasted tuning that Haskell code, you could have easily written it in C, and it would have been solid, portable, and blazingly fast. I should know, because I have written many such programs in C.
Hopefully the delusional Haskellphiles will read your posts, realize their language isn’t quite as hot as they think it is, and stop relentlessly pushing it on reddit like it’s the Second Coming, Ron Paul or marijuana legalization.
Jonathan Allen said
Just out of curiosity I wonder how long a language like C# would take using your setup.
augustss said
Have you profiled your code? That’s always the most important step to speeding it up, and should be done before anything else.
If you profile it, I wouldn’t be surprised if it spends a lot of time in string-to-double conversion.
Slarba said
Here’s an optimized version.
http://hpaste.org/8959
DiD said
“I don’t believe you”: If I could give you an award for that post I would!
Slarba said
I don’t believe you: what do you think, was your post helpful for the blogger?
First and only rule of optimization (applies to C, Python, Haskell…): profile to see where the time is REALLY wasted. It was really easy to compile the program to produce time/allocation profile and pinpoint ByteString to String conversion being the problem, and fix it. Another 10 minutes for refactoring a bit.
Slarba said
You can make this even run faster with quicksort:
http://hpaste.org/8965
Less than 600 milliseconds on a 2GHz MacBook.
Again, profiling revealed that the sorting is slow.
Greg said
Slarba,
Thank you for posting your solution. Unfortunately, it does not perform well for me. For example, when I run it on the first 16,000 lines of my CSV file (not all 157,000 lines), it takes 7.5 seconds. It was running for longer than 30 seconds on the full file, so I stopped it.
I do not know how to profile the code. For example, when I run:
$ ghc –make -O2 -prof Dailyquotes.lhs
Could not find module `Data.ByteString.Char8′:
Perhaps you haven’t installed the profiling libraries for package bytestring-0.9.1.0?
Where do I get the profiling libraries for the bytestring package?
What sort of input file did you run your program on? Perhaps I’m doing something horribly wrong?
Greg
Slarba said
I used quite small amount of distinct tickers… the difference gets smaller when there are more of them.
You need to build the profiling versions of the libraries you intend to use:
runhaskell Setup.hs configure -p
runhaskell Setup.hs build
install…
Greg said
Slarba,
Okay, assume we only have around 10 tickers over a span of 5 years of data. So each day we have 10s of quotes for each of the 10 tickers.
I installed the bytestring profiling library, and the profile report told me that it was spending 80% of its time in qsort. So I replaced ‘qsort’ with ‘sort’ and the program ran over all 160,000 CSV lines in about 1 second!
I also added a check for a zero return from strtod, as a hacked check for when the string cannot be parsed. I never have zero bids or asks, so that’s sufficient for me. This change didn’t effect running time.
So that leaves open the question why qsort performed so horribly for me. (my intuition is that the default Haskell sort is probably the best thing to use).
I’ve annotated your hpaste code here:
http://hpaste.org/8965#a2
Now the Haskell program is faster than Python (although it certainly took plenty of effort, and using strtod feels like cheating).
Greg
Top Posts « WordPress.com said
[...] Followup: CSV Parsing in Haskell and Python Tonight I tried improving the Haskell version of my CSV data analysis program. None of the changes made the Haskell [...] [...]
Slarba said
Well then, the default Haskell mergesort suits your data better :)
GalIvory said
The party’s over now. I reworked the original with all the `String`s, but got something that was even slower (~33%). (Though maybe the HOFs are part of this?) My last shot is an accumulating parameter and then I am out of ideas. import qualified Data.ByteString.Char8 as B import Control.Arrow import Data.Maybe import Data.Ord import Data.List import System.Environment import System.IO data Row = Row { symbol :: B.ByteString , avg :: Double , day :: B.ByteString , time :: B.ByteString } deriving (Read, Show) main :: IO () main = B.interact $ B.unlines . map recomma . analyze . rowify . B.lines recomma = B.intercalate (B.pack “, “) analyze = map analyze’ . groupBy g . sortBy c where g r0 r1 = symbol r0 == symbol r1 c = comparing symbol analyze’ rows = [symbol', open, low, high, close, day'] where (day’, symbol’) = (day &&& symbol) . head $ rows (open, close) = clipper . sortBy (comparing time) $ rows (low, high) = clipper . sortBy (comparing avg) $ rows clipper = (avgB *** avgB) . (head &&& last) where avgB = B.pack . show . avg rowify = catMaybes . map processRow processRow bytes = case (maybeNum l’, maybeNum h’) of (Just n’, Just m’) -> Just $ Row symbol ((n’ + m’) / 2) day time _ -> Nothing where symbol:l’:h’:_:day:time:_ = B.split ‘,’ bytes — match or program dies! maybeNum b = case reads $ B.unpack b of [(n, "")] -> Just n _ -> Nothing