<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:georss="http://www.georss.org/georss" xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#" xmlns:media="http://search.yahoo.com/mrss/"
		>
<channel>
	<title>Comments on: Followup: CSV Parsing in Haskell and Python</title>
	<atom:link href="http://techguyinmidtown.com/2008/07/15/followup-csv-parsing-in-haskell-and-python/feed/" rel="self" type="application/rss+xml" />
	<link>http://techguyinmidtown.com/2008/07/15/followup-csv-parsing-in-haskell-and-python/</link>
	<description>the notebook of a computer scientist living in midtown manhattan</description>
	<lastBuildDate>Sun, 11 Dec 2011 20:21:06 +0000</lastBuildDate>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.com/</generator>
	<item>
		<title>By: GalIvory</title>
		<link>http://techguyinmidtown.com/2008/07/15/followup-csv-parsing-in-haskell-and-python/#comment-76</link>
		<dc:creator><![CDATA[GalIvory]]></dc:creator>
		<pubDate>Thu, 24 Jul 2008 21:45:45 +0000</pubDate>
		<guid isPermaLink="false">http://techguyinmidtown.wordpress.com/?p=35#comment-76</guid>
		<description><![CDATA[The party&#039;s over now. I reworked the original with all the `String`s, but got something that was even slower (~33%). (Though maybe the HOFs are part of this?) My last shot is an accumulating parameter and then I am out of ideas. import qualified Data.ByteString.Char8 as B import Control.Arrow import Data.Maybe import Data.Ord import Data.List import System.Environment import System.IO data Row = Row { symbol :: B.ByteString , avg :: Double , day :: B.ByteString , time :: B.ByteString } deriving (Read, Show) main :: IO () main = B.interact $ B.unlines . map recomma . analyze . rowify . B.lines recomma = B.intercalate (B.pack &quot;, &quot;) analyze = map analyze&#039; . groupBy g . sortBy c where g r0 r1 = symbol r0 == symbol r1 c = comparing symbol analyze&#039; rows = [symbol&#039;, open, low, high, close, day&#039;] where (day&#039;, symbol&#039;) = (day &amp;&amp;&amp; symbol) . head $ rows (open, close) = clipper . sortBy (comparing time) $ rows (low, high) = clipper . sortBy (comparing avg) $ rows clipper = (avgB *** avgB) . (head &amp;&amp;&amp; last) where avgB = B.pack . show . avg rowify = catMaybes . map processRow processRow bytes = case (maybeNum l&#039;, maybeNum h&#039;) of (Just n&#039;, Just m&#039;) -&gt; Just $ Row symbol ((n&#039; + m&#039;) / 2) day time _ -&gt; Nothing where symbol:l&#039;:h&#039;:_:day:time:_ = B.split &#039;,&#039; bytes -- match or program dies! maybeNum b = case reads $ B.unpack b of [(n, &quot;&quot;)] -&gt; Just n _ -&gt; Nothing]]></description>
		<content:encoded><![CDATA[<p>The party&#8217;s over now. I reworked the original with all the `String`s, but got something that was even slower (~33%). (Though maybe the HOFs are part of this?) My last shot is an accumulating parameter and then I am out of ideas. import qualified Data.ByteString.Char8 as B import Control.Arrow import Data.Maybe import Data.Ord import Data.List import System.Environment import System.IO data Row = Row { symbol :: B.ByteString , avg :: Double , day :: B.ByteString , time :: B.ByteString } deriving (Read, Show) main :: IO () main = B.interact $ B.unlines . map recomma . analyze . rowify . B.lines recomma = B.intercalate (B.pack &#8220;, &#8220;) analyze = map analyze&#8217; . groupBy g . sortBy c where g r0 r1 = symbol r0 == symbol r1 c = comparing symbol analyze&#8217; rows = [symbol', open, low, high, close, day'] where (day&#8217;, symbol&#8217;) = (day &amp;&amp;&amp; symbol) . head $ rows (open, close) = clipper . sortBy (comparing time) $ rows (low, high) = clipper . sortBy (comparing avg) $ rows clipper = (avgB *** avgB) . (head &amp;&amp;&amp; last) where avgB = B.pack . show . avg rowify = catMaybes . map processRow processRow bytes = case (maybeNum l&#8217;, maybeNum h&#8217;) of (Just n&#8217;, Just m&#8217;) -&gt; Just $ Row symbol ((n&#8217; + m&#8217;) / 2) day time _ -&gt; Nothing where symbol:l&#8217;:h&#8217;:_:day:time:_ = B.split &#8216;,&#8217; bytes &#8212; match or program dies! maybeNum b = case reads $ B.unpack b of [(n, "")] -&gt; Just n _ -&gt; Nothing</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Slarba</title>
		<link>http://techguyinmidtown.com/2008/07/15/followup-csv-parsing-in-haskell-and-python/#comment-62</link>
		<dc:creator><![CDATA[Slarba]]></dc:creator>
		<pubDate>Thu, 17 Jul 2008 07:34:51 +0000</pubDate>
		<guid isPermaLink="false">http://techguyinmidtown.wordpress.com/?p=35#comment-62</guid>
		<description><![CDATA[Well then, the default Haskell mergesort suits your data better :)]]></description>
		<content:encoded><![CDATA[<p>Well then, the default Haskell mergesort suits your data better :)</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Top Posts &#171; WordPress.com</title>
		<link>http://techguyinmidtown.com/2008/07/15/followup-csv-parsing-in-haskell-and-python/#comment-60</link>
		<dc:creator><![CDATA[Top Posts &#171; WordPress.com]]></dc:creator>
		<pubDate>Thu, 17 Jul 2008 00:04:44 +0000</pubDate>
		<guid isPermaLink="false">http://techguyinmidtown.wordpress.com/?p=35#comment-60</guid>
		<description><![CDATA[[...]  Followup: CSV Parsing in Haskell and Python Tonight I tried improving the Haskell version of my CSV data analysis program. None of the changes made the Haskell [...] [...]]]></description>
		<content:encoded><![CDATA[<p>[...]  Followup: CSV Parsing in Haskell and Python Tonight I tried improving the Haskell version of my CSV data analysis program. None of the changes made the Haskell [...] [...]</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Greg</title>
		<link>http://techguyinmidtown.com/2008/07/15/followup-csv-parsing-in-haskell-and-python/#comment-59</link>
		<dc:creator><![CDATA[Greg]]></dc:creator>
		<pubDate>Wed, 16 Jul 2008 18:41:18 +0000</pubDate>
		<guid isPermaLink="false">http://techguyinmidtown.wordpress.com/?p=35#comment-59</guid>
		<description><![CDATA[Slarba,

Okay, assume we only have around 10 tickers over a span of 5 years of data.  So each day we have 10s of quotes for each of the 10 tickers.

I installed the bytestring profiling library, and the profile report told me that it was spending 80% of its time in qsort.  So I replaced &#039;qsort&#039; with &#039;sort&#039; and the program ran over all 160,000 CSV lines in about 1 second!

I also added a check for a zero return from strtod, as a hacked check for when the string cannot be parsed.  I never have zero bids or asks, so that&#039;s sufficient for me.  This change didn&#039;t effect running time.

So that leaves open the question why qsort performed so horribly for me.  (my intuition is that the default Haskell sort is probably the best thing to use).

I&#039;ve annotated your hpaste code here:

http://hpaste.org/8965#a2

Now the Haskell program is faster than Python (although it certainly took plenty of effort, and using strtod feels like cheating).  

Greg]]></description>
		<content:encoded><![CDATA[<p>Slarba,</p>
<p>Okay, assume we only have around 10 tickers over a span of 5 years of data.  So each day we have 10s of quotes for each of the 10 tickers.</p>
<p>I installed the bytestring profiling library, and the profile report told me that it was spending 80% of its time in qsort.  So I replaced &#8216;qsort&#8217; with &#8216;sort&#8217; and the program ran over all 160,000 CSV lines in about 1 second!</p>
<p>I also added a check for a zero return from strtod, as a hacked check for when the string cannot be parsed.  I never have zero bids or asks, so that&#8217;s sufficient for me.  This change didn&#8217;t effect running time.</p>
<p>So that leaves open the question why qsort performed so horribly for me.  (my intuition is that the default Haskell sort is probably the best thing to use).</p>
<p>I&#8217;ve annotated your hpaste code here:</p>
<p><a href="http://hpaste.org/8965#a2" rel="nofollow">http://hpaste.org/8965#a2</a></p>
<p>Now the Haskell program is faster than Python (although it certainly took plenty of effort, and using strtod feels like cheating).  </p>
<p>Greg</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Slarba</title>
		<link>http://techguyinmidtown.com/2008/07/15/followup-csv-parsing-in-haskell-and-python/#comment-58</link>
		<dc:creator><![CDATA[Slarba]]></dc:creator>
		<pubDate>Wed, 16 Jul 2008 17:51:22 +0000</pubDate>
		<guid isPermaLink="false">http://techguyinmidtown.wordpress.com/?p=35#comment-58</guid>
		<description><![CDATA[I used quite small amount of distinct tickers... the difference gets smaller when there are more of them.

You need to build the profiling versions of the libraries you intend to use:

runhaskell Setup.hs configure -p
runhaskell Setup.hs build
install...]]></description>
		<content:encoded><![CDATA[<p>I used quite small amount of distinct tickers&#8230; the difference gets smaller when there are more of them.</p>
<p>You need to build the profiling versions of the libraries you intend to use:</p>
<p>runhaskell Setup.hs configure -p<br />
runhaskell Setup.hs build<br />
install&#8230;</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Greg</title>
		<link>http://techguyinmidtown.com/2008/07/15/followup-csv-parsing-in-haskell-and-python/#comment-57</link>
		<dc:creator><![CDATA[Greg]]></dc:creator>
		<pubDate>Wed, 16 Jul 2008 17:23:58 +0000</pubDate>
		<guid isPermaLink="false">http://techguyinmidtown.wordpress.com/?p=35#comment-57</guid>
		<description><![CDATA[Slarba,

Thank you for posting your solution.  Unfortunately, it does not perform well for me.  For example, when I run it on the first 16,000 lines of my CSV file (not all 157,000 lines), it takes 7.5 seconds.  It was running for longer than 30 seconds on the full file, so I stopped it.

I do not know how to profile the code.  For example, when I run:

$ ghc --make -O2 -prof Dailyquotes.lhs

Could not find module `Data.ByteString.Char8&#039;:
      Perhaps you haven&#039;t installed the profiling libraries for package bytestring-0.9.1.0?


Where do I get the profiling libraries for the bytestring package?

What sort of input file did you run your program on?  Perhaps I&#039;m doing something horribly wrong?

Greg]]></description>
		<content:encoded><![CDATA[<p>Slarba,</p>
<p>Thank you for posting your solution.  Unfortunately, it does not perform well for me.  For example, when I run it on the first 16,000 lines of my CSV file (not all 157,000 lines), it takes 7.5 seconds.  It was running for longer than 30 seconds on the full file, so I stopped it.</p>
<p>I do not know how to profile the code.  For example, when I run:</p>
<p>$ ghc &#8211;make -O2 -prof Dailyquotes.lhs</p>
<p>Could not find module `Data.ByteString.Char8&#8242;:<br />
      Perhaps you haven&#8217;t installed the profiling libraries for package bytestring-0.9.1.0?</p>
<p>Where do I get the profiling libraries for the bytestring package?</p>
<p>What sort of input file did you run your program on?  Perhaps I&#8217;m doing something horribly wrong?</p>
<p>Greg</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Slarba</title>
		<link>http://techguyinmidtown.com/2008/07/15/followup-csv-parsing-in-haskell-and-python/#comment-56</link>
		<dc:creator><![CDATA[Slarba]]></dc:creator>
		<pubDate>Wed, 16 Jul 2008 16:15:14 +0000</pubDate>
		<guid isPermaLink="false">http://techguyinmidtown.wordpress.com/?p=35#comment-56</guid>
		<description><![CDATA[You can make this even run faster with quicksort:

http://hpaste.org/8965

Less than 600 milliseconds on a 2GHz MacBook.

Again, profiling revealed that the sorting is slow.]]></description>
		<content:encoded><![CDATA[<p>You can make this even run faster with quicksort:</p>
<p><a href="http://hpaste.org/8965" rel="nofollow">http://hpaste.org/8965</a></p>
<p>Less than 600 milliseconds on a 2GHz MacBook.</p>
<p>Again, profiling revealed that the sorting is slow.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Slarba</title>
		<link>http://techguyinmidtown.com/2008/07/15/followup-csv-parsing-in-haskell-and-python/#comment-55</link>
		<dc:creator><![CDATA[Slarba]]></dc:creator>
		<pubDate>Wed, 16 Jul 2008 15:21:45 +0000</pubDate>
		<guid isPermaLink="false">http://techguyinmidtown.wordpress.com/?p=35#comment-55</guid>
		<description><![CDATA[I don&#039;t believe you: what do you think, was your post helpful for the blogger?

First and only rule of optimization (applies to C, Python, Haskell...): profile to see where the time is REALLY wasted. It was really easy to compile the program to produce time/allocation profile and pinpoint ByteString to String conversion being the problem, and fix it. Another 10 minutes for refactoring a bit.]]></description>
		<content:encoded><![CDATA[<p>I don&#8217;t believe you: what do you think, was your post helpful for the blogger?</p>
<p>First and only rule of optimization (applies to C, Python, Haskell&#8230;): profile to see where the time is REALLY wasted. It was really easy to compile the program to produce time/allocation profile and pinpoint ByteString to String conversion being the problem, and fix it. Another 10 minutes for refactoring a bit.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: DiD</title>
		<link>http://techguyinmidtown.com/2008/07/15/followup-csv-parsing-in-haskell-and-python/#comment-54</link>
		<dc:creator><![CDATA[DiD]]></dc:creator>
		<pubDate>Wed, 16 Jul 2008 12:19:03 +0000</pubDate>
		<guid isPermaLink="false">http://techguyinmidtown.wordpress.com/?p=35#comment-54</guid>
		<description><![CDATA[&quot;I don&#039;t believe you&quot;: If I could give you an award for that post I would!]]></description>
		<content:encoded><![CDATA[<p>&#8220;I don&#8217;t believe you&#8221;: If I could give you an award for that post I would!</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Slarba</title>
		<link>http://techguyinmidtown.com/2008/07/15/followup-csv-parsing-in-haskell-and-python/#comment-53</link>
		<dc:creator><![CDATA[Slarba]]></dc:creator>
		<pubDate>Wed, 16 Jul 2008 10:11:52 +0000</pubDate>
		<guid isPermaLink="false">http://techguyinmidtown.wordpress.com/?p=35#comment-53</guid>
		<description><![CDATA[Here&#039;s an optimized version.

http://hpaste.org/8959]]></description>
		<content:encoded><![CDATA[<p>Here&#8217;s an optimized version.</p>
<p><a href="http://hpaste.org/8959" rel="nofollow">http://hpaste.org/8959</a></p>
]]></content:encoded>
	</item>
</channel>
</rss>

