Followup: CSV Parsing in Haskell and Python

Tonight I tried improving the Haskell version of my CSV data analysis program. None of the changes made the Haskell program perform as well as the Python program. Still, I thought I’d share the results.

I first tried using the Parsec code discussed in the Real World Haskell book. The authors boil the CSV parsing down to 5 lines, which I incorporated into my program as:

> doReadDataParsec file = do
>     chars <- readFile file
>     let charRows = case parse csvFile "(unknown)" chars of
>                      Right r   -> r
>                      otherwise -> [[]]
>     let sortByTicker = sortBy $ comparing head `mappend` comparing (!!5)
>     let strRows = sortByTicker $ filter (\r -> (length r) > 0) charRows
>     return $ catMaybes $ map makeRowFromStrs strRows
>   where csvFile = endBy line eol
>         line    = sepBy cell (char ',')
>         cell    = many (noneOf ",\n")
>         eol     = char '\n'

I use the standard readFile function here, which reads the entire file in as a String, instead of a ByteString. The String is then parsed using Parsec. Using this function, the program executes in about 10 seconds. That’s up from about 7.5 seconds with my first implementation.

Next I tried to avoid the String type. In this version, I only unpack the ByteString when I need to convert the bid and ask to a Double. I then pack the Double back to a ByteString when the result is written to file. Everything else is left as a ByteString (except for the data types that I create — I’m not sure if that’s a problem). This approach shaves off a little over 3 seconds, executing in about 4.3 seconds.

Here is the complete program:

> import qualified Data.ByteString.Char8   as BS
> import Data.Ord
> import Data.Maybe
> import Data.Monoid
> import List
> import System.Environment

TICKER1,33,35,NULL,2007-09-28 00:00:00,2007-09-28 16:32:00
TICKER2,29.5,31,NULL,2007-03-07 00:00:00,2007-03-07 07:33:00
TICKER3,29.5,31,NULL,2007-03-07 00:00:00,2007-03-07 07:33:00

> data RowBytes = RowBytes {tickerB :: BS.ByteString,
>                           bidB :: Double,
>                           askB :: Double,
>                           dayB :: BS.ByteString,
>                           timeB :: BS.ByteString} deriving (Read, Show)
> data Result = Result {tickerR :: BS.ByteString,
>                       openR :: Double,
>                       highR :: Double,
>                       lowR :: Double,
>                       closeR :: Double,
>                       dayR :: BS.ByteString} deriving (Read, Show)

> main :: IO ()
> main = do
>   (csvFile:outputFile:_) <- getArgs
>   rows <- doReadData csvFile
>   let tickerGroups = groupBy (\a b -> (tickerB a) == (tickerB b)) rows
>   let oneTicker rs = map oneDay $ groupBy (\a b -> (dayB a) == (dayB b)) rs
>   let results = concat $ map oneTicker tickerGroups
>   let bytes r = [tickerR r,
>                  BS.pack $ show $ openR r,
>                  BS.pack $ show $ highR r,
>                  BS.pack $ show $ lowR r,
>                  BS.pack $ show $ closeR r,
>                  dayR r]
>   let pprow r = BS.concat $ intersperse (BS.singleton ',') $ bytes r
>   let txt = BS.unlines $ map pprow results
>   BS.writeFile outputFile txt
>   putStrLn "done"

create a Result row that encapsulates the open/high/low/close for
a single ticker on a single day

> oneDay rows =
>   let price r  = ((bidB r) + (askB r)) / 2
>       openRow  = head rows
>       closeRow = last rows
>       highRow  = maximumBy (\a b -> compare (price a) (price b)) rows
>       lowRow   = minimumBy (\a b -> compare (price a) (price b)) rows
>   in (Result (tickerB openRow)
>              (price openRow)
>              (price highRow)
>              (price lowRow)
>              (price closeRow)
>              (dayB openRow))

> maybeRead s = case reads s of
>                 [ (x, "")] -> Just (x::Double)
>                 _          -> Nothing

Convert a list of strings into a [Row]
We need to convert two of the strings to Doubles.

> makeRow r =
>   case (maybeRead (BS.unpack $ r !! 1), maybeRead (BS.unpack $ r !! 2)) of
>     ((Just a), (Just b)) -> Just (RowBytes (r !! 0) a b (r !! 4) (r !! 5))
>     _                    -> Nothing

Read in the CSV file, returning a [Row], sorted by ticker and day

> doReadData file = do
>   bytes <- BS.readFile file
>   let byteRows = map (BS.split ',') $ BS.lines bytes
>   let sortByTicker   = sortBy $ comparing head `mappend` comparing (!!5)
>   let sortedByteRows = sortByTicker $ filter (\r -> (length r) > 0) byteRows
>   return $ catMaybes $ map makeRow sortedByteRows

Finally, I tried adding the -O2 ghc compile flag, but that did not yield a speedup.

I’m not sure what to try next. Thanks to everybody who commented to my last post, both directly on this blog and on the Reddit thread.

Note that in the comments below, I’ve linked to a version of the Haskell code that runs in 1 second.  Thanks to Slarba for posting the solution, which I only modified slightly.

About these ads

14 Comments »

  1. […] note that I’ve posted a followup here. Possibly related posts: (automatically generated)Baby Hat PatternMagnetic Field of a […]

  2. I don't believe you said

    Haskell’s performance couldn’t be as bad as you claim. You must be imagining it, because according to the small group of evangelists who have turned programming.reddit.com into The Haskell Blog, Haskell is super super fast! As fast as C, even–No, faster than C! Because Haskell has monads and lazy evaluation and and…

    I like dynamic languages. I prefer to use Lisp, Ruby, and even Javascript. These languages have their place. But contrary to the BS you read on the blogs of vacuous language advocates, there are many tasks for which they are just not suitable. C, on the other hand, is perfectly suited for something like this (sequential file IO, light text-parsing, large date set). In fact, in the time you wasted tuning that Haskell code, you could have easily written it in C, and it would have been solid, portable, and blazingly fast. I should know, because I have written many such programs in C.

    Hopefully the delusional Haskellphiles will read your posts, realize their language isn’t quite as hot as they think it is, and stop relentlessly pushing it on reddit like it’s the Second Coming, Ron Paul or marijuana legalization.

  3. Jonathan Allen said

    Just out of curiosity I wonder how long a language like C# would take using your setup.

  4. augustss said

    Have you profiled your code? That’s always the most important step to speeding it up, and should be done before anything else.

    If you profile it, I wouldn’t be surprised if it spends a lot of time in string-to-double conversion.

  5. Slarba said

    Here’s an optimized version.

    http://hpaste.org/8959

  6. DiD said

    “I don’t believe you”: If I could give you an award for that post I would!

  7. Slarba said

    I don’t believe you: what do you think, was your post helpful for the blogger?

    First and only rule of optimization (applies to C, Python, Haskell…): profile to see where the time is REALLY wasted. It was really easy to compile the program to produce time/allocation profile and pinpoint ByteString to String conversion being the problem, and fix it. Another 10 minutes for refactoring a bit.

  8. Slarba said

    You can make this even run faster with quicksort:

    http://hpaste.org/8965

    Less than 600 milliseconds on a 2GHz MacBook.

    Again, profiling revealed that the sorting is slow.

  9. Greg said

    Slarba,

    Thank you for posting your solution. Unfortunately, it does not perform well for me. For example, when I run it on the first 16,000 lines of my CSV file (not all 157,000 lines), it takes 7.5 seconds. It was running for longer than 30 seconds on the full file, so I stopped it.

    I do not know how to profile the code. For example, when I run:

    $ ghc –make -O2 -prof Dailyquotes.lhs

    Could not find module `Data.ByteString.Char8′:
    Perhaps you haven’t installed the profiling libraries for package bytestring-0.9.1.0?

    Where do I get the profiling libraries for the bytestring package?

    What sort of input file did you run your program on? Perhaps I’m doing something horribly wrong?

    Greg

  10. Slarba said

    I used quite small amount of distinct tickers… the difference gets smaller when there are more of them.

    You need to build the profiling versions of the libraries you intend to use:

    runhaskell Setup.hs configure -p
    runhaskell Setup.hs build
    install…

  11. Greg said

    Slarba,

    Okay, assume we only have around 10 tickers over a span of 5 years of data. So each day we have 10s of quotes for each of the 10 tickers.

    I installed the bytestring profiling library, and the profile report told me that it was spending 80% of its time in qsort. So I replaced ‘qsort’ with ‘sort’ and the program ran over all 160,000 CSV lines in about 1 second!

    I also added a check for a zero return from strtod, as a hacked check for when the string cannot be parsed. I never have zero bids or asks, so that’s sufficient for me. This change didn’t effect running time.

    So that leaves open the question why qsort performed so horribly for me. (my intuition is that the default Haskell sort is probably the best thing to use).

    I’ve annotated your hpaste code here:

    http://hpaste.org/8965#a2

    Now the Haskell program is faster than Python (although it certainly took plenty of effort, and using strtod feels like cheating).

    Greg

  12. […] Followup: CSV Parsing in Haskell and Python Tonight I tried improving the Haskell version of my CSV data analysis program. None of the changes made the Haskell […] […]

  13. Slarba said

    Well then, the default Haskell mergesort suits your data better :)

  14. GalIvory said

    The party’s over now. I reworked the original with all the `String`s, but got something that was even slower (~33%). (Though maybe the HOFs are part of this?) My last shot is an accumulating parameter and then I am out of ideas. import qualified Data.ByteString.Char8 as B import Control.Arrow import Data.Maybe import Data.Ord import Data.List import System.Environment import System.IO data Row = Row { symbol :: B.ByteString , avg :: Double , day :: B.ByteString , time :: B.ByteString } deriving (Read, Show) main :: IO () main = B.interact $ B.unlines . map recomma . analyze . rowify . B.lines recomma = B.intercalate (B.pack “, “) analyze = map analyze’ . groupBy g . sortBy c where g r0 r1 = symbol r0 == symbol r1 c = comparing symbol analyze’ rows = [symbol’, open, low, high, close, day’] where (day’, symbol’) = (day &&& symbol) . head $ rows (open, close) = clipper . sortBy (comparing time) $ rows (low, high) = clipper . sortBy (comparing avg) $ rows clipper = (avgB *** avgB) . (head &&& last) where avgB = B.pack . show . avg rowify = catMaybes . map processRow processRow bytes = case (maybeNum l’, maybeNum h’) of (Just n’, Just m’) -> Just $ Row symbol ((n’ + m’) / 2) day time _ -> Nothing where symbol:l':h':_:day:time:_ = B.split ‘,’ bytes — match or program dies! maybeNum b = case reads $ B.unpack b of [(n, “”)] -> Just n _ -> Nothing

RSS feed for comments on this post · TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

%d bloggers like this: