Lessons Learned: Parsing CSV in Haskell and Python

One of my best friends read my CSV post and said “So, you wrote a program that was a little slow, and then you wrote another one that was faster. That’s great.” Hard to disagree.

Over the past couple days, thousands of people have read my last two posts, and tens of people have made comments and offered advice. Thanks to Slarba, I now have a Haskell program that runs in about 1 second (which happens to be about 0.7 seconds faster than my Python program, although I haven’t tried to optimize the Python program at all).

Here’s what I learned in this odyssey:

  1. If you’re reading in a large text file, use Haskell’s ByteString instead of String. (thanks to Don Stewart, aka dons, for writing that module).
  2. To clean up your code that uses sorts, make your custom data type an instance of Ord instead of using the sortBy function.
  3. I had never used Haskell’s foreign function interface before. It’s amazing how simple it is to pull in a C function. It seems like you just need to abide by a few idioms, such as the NOINLINE pragma and the unsafePerformIO function.
  4. Clearly this is all a lot of effort and optimization simply to read in some basic CSV lines in Haskell. Hopefully the ASCII string parsing libraries will improve in time.
  5. The common optimization compiler flag ‘-O2′ is not the default. Several people suggested adding that flag, although I never noticed a speedup from it for this program.
  6. The Parsec module is very neat and clever, but its performance is hampered by Haskell’s String type.

For completeness, here’s the final version of the code.

> {-# OPTIONS -fffi -funbox-strict-fields #-}
> import qualified Data.ByteString.Char8 as BS
> import Data.ByteString.Unsafe (unsafeUseAsCString)
> import System
> import Data.List
> import Maybe
> import CForeign
> import Foreign.Ptr
> import System.IO.Unsafe

> data Row = Row {
>       ticker :: !BS.ByteString,
>       bid    :: !Double,
>       ask    :: !Double,
>       day    :: !BS.ByteString,
>       time   :: !BS.ByteString
>       }
> instance Eq Row where
>     a == b  = ticker a == ticker b && time a == time b
> instance Ord Row where
>     compare a b = case compare (ticker a) (ticker b) of
>                     EQ -> compare (time a) (time b)
>                     e  -> e

since ByteString library doesn't seem to have strtod-like function, we can
peek into the ByteString using useAsCString and calling C strtod on it.

> foreign import ccall "stdlib.h strtod" cstrtod :: CString->CString->IO CDouble
> {-# NOINLINE strtod #-}
> strtod :: BS.ByteString -> Double
> strtod bs = realToFrac . unsafePerformIO .
>             unsafeUseAsCString bs $ \cs -> cstrtod cs nullPtr

> data Result = Result {
>       tickerR :: !BS.ByteString,
>       openR   :: !Double,
>       highR   :: !Double,
>       lowR    :: !Double,
>       closeR  :: !Double,
>       dayR    :: !BS.ByteString
>       }
> instance Show Result where
>     show (Result ticker open high low close day) =
>         intercalate "," [BS.unpack ticker, show open, show high,
>                          show low, show close, BS.unpack day] ++ "\n"

> main :: IO ()
> main = do
>   (csvFile:outputFile:_) <- getArgs
>   rows <- doReadData csvFile
>   writeFile outputFile . concatMap show . computeResults $ groupToDays rows

> computeResults :: [[Row]] -> [Result]
> computeResults = map makeResult
>  where makeResult d = let openRow  = head d
>                           closeRow = last d
>                           highRow  = maximumBy comparePrice d
>                           lowRow   = minimumBy comparePrice d
>                       in Result (ticker openRow) (price openRow) (price highRow)
>                              (price lowRow) (price closeRow) (day openRow)
>        price r = (bid r + ask r) / 2
>        comparePrice a b = compare (price a) (price b)

> groupToDays :: [Row] -> [[Row]]
> groupToDays = concatMap (groupBy tickerDay) . groupBy tickerName
>  where tickerName a b = ticker a == ticker b
>        tickerDay a b  = day a == day b

> doReadData :: FilePath -> IO [Row]
> doReadData file = do
>     contents <- BS.readFile file
>     return . sort . mapMaybe parseRow . BS.lines $ contents

> parseRow :: BS.ByteString -> Maybe Row
> parseRow row = case BS.split ',' row of
>                  [ticker_,bid_,ask_,_,day_,time_] ->
>                      let (bidD, askD) = (strtod bid_, strtod ask_)
>                      in if (bidD == 0 || askD == 0)
>                         then Nothing
>                         else Just $ Row ticker_ bidD askD day_ time_
>                  _ -> Nothing

Thanks again to all the people who helped out. It’s humbling to think about other hackers spending their time commenting and optimizing my code.

11 Comments »

  1. Don Stewart said

    Note you’re still using `useAsCString` which copies the entire input data to C on every call. Try replacing `useAsCString` with `unsafeUseAsCString`, which avoids the copy and parses the Double directly from the input buffer. That’s probably the main inefficiency remaining.

    Similarly, you’re using strict fields, but without the `-funbox-strict-fields` flag, which may enable some of the atomic types to be stored in registers in tight loops.

  2. augustss said

    I don’t think you need that NOINLINE.

  3. Greg said

    Don:

    Thanks. I made those changes, and although they didn’t effect the execution time, I think they’re logical.

    Augustss:

    I wondered that myself because the documentation for NOINLINE says something like, “you probably don’t need this.” But then the documentation for unsafePerformIO suggests using that pragma.

  4. atp said

    Regarding the FFI and unsafePerformIO: by default the FFI wraps everything in the IO monad, because foreign functions are not guaranteed to be side-effect free. Using unsafePerformIO in this context is only a good idea if you know the function you’re calling doesn’t have any side effects. This doesn’t just mean producing I/O. In general, any function that is not reentrant is not pure and could conceivably cause problems. Unfortunately, some C library functions use internal static buffers as a relic of older, simpler times. It’s important to be aware of this when you use unsafePerformIO. Although, if your application is not and will never be threaded, perhaps this isn’t an issue.

  5. augustss said

    The documentation suggest NOINLINE because there’s a theoretical chance the call could get duplicated and the IO would happen twice. In this case that wouldn’t matter for correctness since strtod really is a pure function; it’s just the mechanics of calling it that is impure.

  6. Greg said

    atp: Thanks for clarifying that. This program is single threaded, but I might write concurrent programs in the future.

    Can you suggest an alternate approach? What can I use instead of unsafePerformIO?

    augustss: Got it. Thanks.

  7. atp said

    Greg,

    Reading my comment now it seems as though I was implying that in this particular function unsafePerformIO might be inappropriate: it’s not. You are completely correct to use it here, as strtod is a pure function. I was commenting in a more general way.

    Haskell’s type system separates pure functions from impure ones for practical reasons: you can make assumptions about pure functions that you can’t make about impure ones, allowing optimizations and such. I guess all I really wanted to say was that unsafePerformIO discards the information that marks the function as impure, and that’s why it’s called “unsafe”. This doesn’t mean you shouldn’t use it, just that you should be aware that the function you’re calling really does need to be pure, in the sense that its behavior can be completely characterized by its arguments (it doesn’t access or depend on values stored in global variables or internal static buffers) and that it doesn’t produce I/O side-effects or similar.

    I guess my point was more to add a word of caution — when using the FFI, make sure that you only use unsafePerformIO in situations when the function truly is pure. If it’s not pure, just don’t use unsafePerformIO — leave the value encapsulated in the IO monad, and Haskell’s type system will take care of making sure that everything happens as it should.

  8. ADEpt said

    Reading the whole series of posts, I cant help but wonder – did you do a profiling to find out what is actually the biggest time consumer in your code?

  9. Greg said

    ADEpt:

    Not initially, because I was getting an error about not having a debug library for bytestring. But once I installed that library, I profiled it, and most of the time was being spent going ByteString -> String -> Double.

    Greg

  10. Don Stewart said

    Efficient bytestring-based Double parsing was added this weekend.

    http://hackage.haskell.org/cgi-bin/hackage-scripts/package/bytestring-lexing

    I’d love to know if that solves your performance issues. (I was able to parse a 50M file of Doubles in around 5seconds, so a 160k CSV parser should be easy).

  11. I suggest getting familiar with beautiful functions “on” (from Data.Function) and (&&&) (Control.Arrow instance for functions) then you can turn this: instance Eq Row where a == b = ticker a == ticker b && time a == time b instance Ord Row where compare a b = case compare (ticker a) (ticker b) of EQ -> compare (time a) (time b) e -> e comparePrice a b = compare (price a) (price b) to this: instance Eq Row where (==) = (==) `on` ticker &&& time instance Ord Row where compare = compare `on` ticker &&& time comparePrice = compare `on` price and the most interesting lesson is if you have some slow code, learn basis of haskell and join reddit, then you’ll have phds like dons working their asses of, to tune&pimp your code:> (and it’s really great to be part of haskell community, any other programming language community will write like 10 versions of code to help you?)

RSS feed for comments on this post · TrackBack URI

Leave a Comment