Archive for July, 2008

Lessons Learned: Parsing CSV in Haskell and Python

One of my best friends read my CSV post and said “So, you wrote a program that was a little slow, and then you wrote another one that was faster. That’s great.” Hard to disagree.

Over the past couple days, thousands of people have read my last two posts, and tens of people have made comments and offered advice. Thanks to Slarba, I now have a Haskell program that runs in about 1 second (which happens to be about 0.7 seconds faster than my Python program, although I haven’t tried to optimize the Python program at all).

Here’s what I learned in this odyssey:

  1. If you’re reading in a large text file, use Haskell’s ByteString instead of String. (thanks to Don Stewart, aka dons, for writing that module).
  2. To clean up your code that uses sorts, make your custom data type an instance of Ord instead of using the sortBy function.
  3. I had never used Haskell’s foreign function interface before. It’s amazing how simple it is to pull in a C function. It seems like you just need to abide by a few idioms, such as the NOINLINE pragma and the unsafePerformIO function.
  4. Clearly this is all a lot of effort and optimization simply to read in some basic CSV lines in Haskell. Hopefully the ASCII string parsing libraries will improve in time.
  5. The common optimization compiler flag ‘-O2′ is not the default. Several people suggested adding that flag, although I never noticed a speedup from it for this program.
  6. The Parsec module is very neat and clever, but its performance is hampered by Haskell’s String type.

For completeness, here’s the final version of the code.

> {-# OPTIONS -fffi -funbox-strict-fields #-}
> import qualified Data.ByteString.Char8 as BS
> import Data.ByteString.Unsafe (unsafeUseAsCString)
> import System
> import Data.List
> import Maybe
> import CForeign
> import Foreign.Ptr
> import System.IO.Unsafe

> data Row = Row {
>       ticker :: !BS.ByteString,
>       bid    :: !Double,
>       ask    :: !Double,
>       day    :: !BS.ByteString,
>       time   :: !BS.ByteString
>       }
> instance Eq Row where
>     a == b  = ticker a == ticker b && time a == time b
> instance Ord Row where
>     compare a b = case compare (ticker a) (ticker b) of
>                     EQ -> compare (time a) (time b)
>                     e  -> e

since ByteString library doesn't seem to have strtod-like function, we can
peek into the ByteString using useAsCString and calling C strtod on it.

> foreign import ccall "stdlib.h strtod" cstrtod :: CString->CString->IO CDouble
> {-# NOINLINE strtod #-}
> strtod :: BS.ByteString -> Double
> strtod bs = realToFrac . unsafePerformIO .
>             unsafeUseAsCString bs $ \cs -> cstrtod cs nullPtr

> data Result = Result {
>       tickerR :: !BS.ByteString,
>       openR   :: !Double,
>       highR   :: !Double,
>       lowR    :: !Double,
>       closeR  :: !Double,
>       dayR    :: !BS.ByteString
>       }
> instance Show Result where
>     show (Result ticker open high low close day) =
>         intercalate "," [BS.unpack ticker, show open, show high,
>                          show low, show close, BS.unpack day] ++ "\n"

> main :: IO ()
> main = do
>   (csvFile:outputFile:_) <- getArgs
>   rows <- doReadData csvFile
>   writeFile outputFile . concatMap show . computeResults $ groupToDays rows

> computeResults :: [[Row]] -> [Result]
> computeResults = map makeResult
>  where makeResult d = let openRow  = head d
>                           closeRow = last d
>                           highRow  = maximumBy comparePrice d
>                           lowRow   = minimumBy comparePrice d
>                       in Result (ticker openRow) (price openRow) (price highRow)
>                              (price lowRow) (price closeRow) (day openRow)
>        price r = (bid r + ask r) / 2
>        comparePrice a b = compare (price a) (price b)

> groupToDays :: [Row] -> [[Row]]
> groupToDays = concatMap (groupBy tickerDay) . groupBy tickerName
>  where tickerName a b = ticker a == ticker b
>        tickerDay a b  = day a == day b

> doReadData :: FilePath -> IO [Row]
> doReadData file = do
>     contents <- BS.readFile file
>     return . sort . mapMaybe parseRow . BS.lines $ contents

> parseRow :: BS.ByteString -> Maybe Row
> parseRow row = case BS.split ',' row of
>                  [ticker_,bid_,ask_,_,day_,time_] ->
>                      let (bidD, askD) = (strtod bid_, strtod ask_)
>                      in if (bidD == 0 || askD == 0)
>                         then Nothing
>                         else Just $ Row ticker_ bidD askD day_ time_
>                  _ -> Nothing

Thanks again to all the people who helped out. It’s humbling to think about other hackers spending their time commenting and optimizing my code.

Comments (11)

Followup: CSV Parsing in Haskell and Python

Tonight I tried improving the Haskell version of my CSV data analysis program. None of the changes made the Haskell program perform as well as the Python program. Still, I thought I’d share the results.

I first tried using the Parsec code discussed in the Real World Haskell book. The authors boil the CSV parsing down to 5 lines, which I incorporated into my program as:

> doReadDataParsec file = do
>     chars <- readFile file
>     let charRows = case parse csvFile "(unknown)" chars of
>                      Right r   -> r
>                      otherwise -> [[]]
>     let sortByTicker = sortBy $ comparing head `mappend` comparing (!!5)
>     let strRows = sortByTicker $ filter (\r -> (length r) > 0) charRows
>     return $ catMaybes $ map makeRowFromStrs strRows
>   where csvFile = endBy line eol
>         line    = sepBy cell (char ',')
>         cell    = many (noneOf ",\n")
>         eol     = char '\n'

I use the standard readFile function here, which reads the entire file in as a String, instead of a ByteString. The String is then parsed using Parsec. Using this function, the program executes in about 10 seconds. That’s up from about 7.5 seconds with my first implementation.

Next I tried to avoid the String type. In this version, I only unpack the ByteString when I need to convert the bid and ask to a Double. I then pack the Double back to a ByteString when the result is written to file. Everything else is left as a ByteString (except for the data types that I create — I’m not sure if that’s a problem). This approach shaves off a little over 3 seconds, executing in about 4.3 seconds.

Here is the complete program:

> import qualified Data.ByteString.Char8   as BS
> import Data.Ord
> import Data.Maybe
> import Data.Monoid
> import List
> import System.Environment

TICKER1,33,35,NULL,2007-09-28 00:00:00,2007-09-28 16:32:00
TICKER2,29.5,31,NULL,2007-03-07 00:00:00,2007-03-07 07:33:00
TICKER3,29.5,31,NULL,2007-03-07 00:00:00,2007-03-07 07:33:00

> data RowBytes = RowBytes {tickerB :: BS.ByteString,
>                           bidB :: Double,
>                           askB :: Double,
>                           dayB :: BS.ByteString,
>                           timeB :: BS.ByteString} deriving (Read, Show)
> data Result = Result {tickerR :: BS.ByteString,
>                       openR :: Double,
>                       highR :: Double,
>                       lowR :: Double,
>                       closeR :: Double,
>                       dayR :: BS.ByteString} deriving (Read, Show)

> main :: IO ()
> main = do
>   (csvFile:outputFile:_) <- getArgs
>   rows <- doReadData csvFile
>   let tickerGroups = groupBy (\a b -> (tickerB a) == (tickerB b)) rows
>   let oneTicker rs = map oneDay $ groupBy (\a b -> (dayB a) == (dayB b)) rs
>   let results = concat $ map oneTicker tickerGroups
>   let bytes r = [tickerR r,
>                  BS.pack $ show $ openR r,
>                  BS.pack $ show $ highR r,
>                  BS.pack $ show $ lowR r,
>                  BS.pack $ show $ closeR r,
>                  dayR r]
>   let pprow r = BS.concat $ intersperse (BS.singleton ',') $ bytes r
>   let txt = BS.unlines $ map pprow results
>   BS.writeFile outputFile txt
>   putStrLn "done"

create a Result row that encapsulates the open/high/low/close for
a single ticker on a single day

> oneDay rows =
>   let price r  = ((bidB r) + (askB r)) / 2
>       openRow  = head rows
>       closeRow = last rows
>       highRow  = maximumBy (\a b -> compare (price a) (price b)) rows
>       lowRow   = minimumBy (\a b -> compare (price a) (price b)) rows
>   in (Result (tickerB openRow)
>              (price openRow)
>              (price highRow)
>              (price lowRow)
>              (price closeRow)
>              (dayB openRow))

> maybeRead s = case reads s of
>                 [ (x, "")] -> Just (x::Double)
>                 _          -> Nothing

Convert a list of strings into a [Row]
We need to convert two of the strings to Doubles.

> makeRow r =
>   case (maybeRead (BS.unpack $ r !! 1), maybeRead (BS.unpack $ r !! 2)) of
>     ((Just a), (Just b)) -> Just (RowBytes (r !! 0) a b (r !! 4) (r !! 5))
>     _                    -> Nothing

Read in the CSV file, returning a [Row], sorted by ticker and day

> doReadData file = do
>   bytes <- BS.readFile file
>   let byteRows = map (BS.split ',') $ BS.lines bytes
>   let sortByTicker   = sortBy $ comparing head `mappend` comparing (!!5)
>   let sortedByteRows = sortByTicker $ filter (\r -> (length r) > 0) byteRows
>   return $ catMaybes $ map makeRow sortedByteRows

Finally, I tried adding the -O2 ghc compile flag, but that did not yield a speedup.

I’m not sure what to try next. Thanks to everybody who commented to my last post, both directly on this blog and on the Reddit thread.

Note that in the comments below, I’ve linked to a version of the Haskell code that runs in 1 second.  Thanks to Slarba for posting the solution, which I only modified slightly.

Comments (14)

CSV Parsing: Haskell versus Python

I recently wrote a fairly simple program to parse a CSV file containing some daily intraday prices for a number of tickers. The CSV file had roughly 160,000 lines, with each line containing a ticker, bid, ask, and timestamp. The goal is to produce a daily open, high, low, and close for each ticker.

First I wrote the program in Haskell using the Text.CSV library. It performed horrendously (I didn’t even bother waiting for the program to finish executing as I heard the CPU fan spin up to max). After chatting with ‘dons’ and others on #haskell, I rewrote the program using the Data.ByteString library. Unfortunately, it took about 7.5 seconds to process the input and produce a result.

Curious, I wrote an equivalent program in python. The python program finishes in about 1.7 seconds on my MacBook. That’s 1.7 seconds for python and 7.5 seconds for Haskell — not so good.

None of the superstars on #haskell had any immediate suggestions for what’s making Haskell perform poorly, but most suggestions pointed to my unpacking of the ByteStrings to Strings. I’ll investigate further if I have the time, and post the findings here.

Here is my Haskell code (http://hpaste.org/8928):

> import qualified Data.ByteString.Char8   as BS
> import Data.Maybe
> import Data.List
> import List
> import System.Environment
> import System.IO

TICKER1,33,35,NULL,2007-09-28 00:00:00,2007-09-28 16:32:00
TICKER2,29.5,31,NULL,2007-03-07 00:00:00,2007-03-07 07:33:00
TICKER3,29.5,31,NULL,2007-03-07 00:00:00,2007-03-07 07:33:00

> data Row = Row {ticker :: String,
>                 bid :: Double,
>                 ask :: Double,
>                 day :: String,
>                 time :: String} deriving (Read, Show)

> data Result = Result {tickerR :: String,
>                       openR :: Double,
>                       highR :: Double,
>                       lowR :: Double,
>                       closeR :: Double,
>                       dayR :: String} deriving (Read, Show)

> main :: IO ()
> main = do
>   (csvFile:outputFile:_) <- getArgs
>   rows <- doReadData csvFile
>   let tickerGroups = groupBy (\a b -> (ticker a) == (ticker b)) rows
>   let oneTicker rs = map oneDay $ groupBy (\a b -> (day a) == (day b)) rs
>   let results = concat $ map oneTicker tickerGroups
>   let pprow r = concat $ intersperse "," $ [tickerR r,
>                                             show $ openR r,
>                                             show $ highR r,
>                                             show $ lowR r,
>                                             show $ closeR r,
>                                             dayR r]
>   let txt = unlines $ map pprow results
>   writeFile outputFile txt
>   putStrLn "done"

create a Result row that encapsulates the open/high/low/close for
a single ticker on a single day

> oneDay rows =
>   let price r  = ((bid r) + (ask r)) / 2
>       openRow  = head rows
>       closeRow = last rows
>       highRow  = maximumBy (\a b -> compare (price a) (price b)) rows
>       lowRow   = minimumBy (\a b -> compare (price a) (price b)) rows
>   in (Result (ticker openRow)
>              (price openRow)
>              (price highRow)
>              (price lowRow)
>              (price closeRow)
>              (day openRow))

> maybeRead s = case reads s of
>                 [ (x, "")] -> Just (x::Double)
>                 _          -> Nothing

Convert a list of strings into a [Row]
We need to convert two of the strings to Doubles.

> makeRow r =
>   case (maybeRead (r !! 1), maybeRead (r !! 2)) of
>     ((Just a), (Just b)) -> Just (Row (r !! 0) a b (r !! 4) (r !! 5))
>     _                    -> Nothing

Read in the CSV file, returning a [Row], sorted by ticker and day

> doReadData file = do
>   bytes <- BS.readFile file
>   let byteRows = map (BS.split ',') $ BS.lines bytes
>   let sortByTicker   = sortBy (\a b -> case compare (head a) (head b) of
>                                          EQ -> compare (a !! 5) (b !! 5)
>                                          o  -> o)
>   let sortedByteRows = sortByTicker $ filter (\r -> (length r) > 0) byteRows
>   let strRows = map (\r -> map BS.unpack r) sortedByteRows
>   return $ catMaybes $ map makeRow strRows

And here is the python code:

import itertools as IT

def rowCompare(x,y):
    """comparison used for sorting our rows.
    Sort by ticker and then by time."""
    if cmp(x[0], y[0]) == 0:
        return cmp(x[5], y[5])
    else:
        return cmp(x[0], y[0])

def price(r):
    """Take the average of the bid and ask prices."""
    return (float(r[1]) + float(r[2])) / 2.0

# read in the data from a CSV file
f = open('IG_US.csv', 'r')
lines = sorted([line.strip().split(',') for line in f], rowCompare)

# create daily open/high/low/close for each ticker
results = []
for ticker, tickerIter in IT.groupby(lines, lambda x: x[0]):
    for day, dayIter in IT.groupby(tickerIter, lambda x: x[4]):
        dayRows = [d for d in dayIter if d[1] != "NULL" and d[2] != "NULL"]
        if not dayRows:
            continue
        highRow = dayRows[0]
        lowRow = dayRows[0]
        for row in dayRows:
            highRow = row if price(row) > highRow else highRow
            lowRow = row if price(row) < lowRow else lowRow
        results.append([ticker,
                        "%0.2f" % price(dayRows[0]),
                        "%0.2f" % price(highRow),
                        "%0.2f" % price(lowRow),
                        "%0.2f" % price(dayRows[-1]),
                        day])

# write out the result to a file
fw = open("./IG_US_python_output.csv", 'w')
fw.write("\n".join([",".join([str(c) for c in l]) for l in results]))

Please note that I’ve posted a followup here.

Comments (17)

My Answers to the Microsoft Interview Questions

I thought it would be fun to write up solutions to the Microsoft interview questions that Andrew Wagner (chessguy) posted to the Haskell Cafe mailing list. I read about his post in the Haskel Sequence weekly news of June 25, 2008. You can read his post here.

> module Questions
> where

> import Data.Array
> import Data.List
> import Random

1.) A “rotated array” is an array of integers in ascending order, after which for every element i, it has been moved to element (i + n) mod sizeOfList. Write a function that takes a rotated array and, in less-than-linear time, returns n (the amount of rotation).

First we’ll write a simple function to rotate a given array (xs) by a given amount (n).

> rotate n xs = b ++ a
>     where n'     = n `mod` (length xs)
>           (a, b) = splitAt ((length xs) - n') xs

Now we turn to the function that computes how much the array was rotated. The key to the solution is realizing that you need to find an inflection point where consecutive numbers in the rotated list decrease. To do this, we find the first, largest sub-sequence that monotonically increases. We can find that subsequence using binary search, testing the endpoints of subsequences. The running time is O(log(n)), where n is the length of the input list (ignoring the O(n) time to construct our array).

> rotateAmount xs = _ra 0 ((length xs) - 1) (listArray (0, ((length xs) - 1)) xs)
>     where _ra s e ys = if (e - s) == 1
>                        then (if ((ys ! s) &lt (ys ! e)) then s else e)  -- base case
>                        else let h  = ys ! s                  -- first item
>                                 l  = ys ! e                  -- last item
>                                 mi = s + ((e - s) `div` 2)   -- middle index
>                                 m  = ys ! mi                 -- middle item
>                             in if (h &lt l)
>                                then s                        -- return start index
>                                else if (h &gt m)
>                                     then _ra s  mi ys
>                                     else _ra mi e  ys

2.) You are given a list of Ball objects. Each Ball is either Red or Blue. Write a function that partitions these balls so that all of the balls of each color are contiguous. Return the index of the first ball of the second color (your result can be Red balls, then Blue balls, or the other way around). In haskell, you’ll probably want to return a ([Ball],Int).

This one is dead-simple in Haskell.

> data Ball = Red | Blue deriving (Eq, Ord, Show, Read)

> groupBalls bs = let gs = (group . sort) bs
>                 in (concat gs, length $ head gs)

3.) Live Search is a search engine. Suppose it was to be tied into an online store. Now you’re given two lists. One is a [(SessionId, NormalizedQuery)]. That is, when a particular user performs a query, it is turned into some consistent format, based on their apparent intent, and stored in this logfile. The second list is a [(SessionId, ProductId)]. This indicates the product bought by a particular user. Now, you want to take these two (potentially very long) lists, and return some structure that will make it easy to take a query and return a list of the most popular resulting purchases. That is, of people who have run this query in the past, what are the most common products they’ve wound up buying? The interviewer said that this is an instance of a well-known problem, but I didn’t catch the name of it.

I’m skipping this for now because its more of a rambling thought experiment. I’d love to chat about it, but I’m not going to immediately dive into writing some code.

4.) You’re given an array which contains the numbers from 1 to n, in random order, except there is one missing. Write a function to return the missing number.

Obviously sorting and walking through would work, but we can do better because the sum of the numbers 1 through n is (n * (n+1) / 2). So we can get it in one pass. This takes O(n) time. Neat.

> missingInt xs = let n = length xs
>                 in ((n + 1) * (n + 2) `div` 2) - (sum xs)


5.) Write a function to reconstruct a binary tree from its preorder traversal and inorder traversal. Take into account that the traversals could be invalid.

First define a tree data structure.

> data Tree a = Tree a (Tree a) (Tree a) | Leaf a | Empty
>               deriving (Eq, Ord, Show, Read)

For the pre-order list we need to split it up by what elements are less than or greater than the head of the list (thinking recursively here). So in the initial list, the first element is the root, and then the subsequent items that are less than the first element are in the left branch, and the other items are in the right branch. I don’t catch invalid traversals, although they would show up if ‘rest2′ in the below function was not empty.

> preorderTree xs = f xs
>     where f []          = Empty
>           f (h:[])      = Leaf h
>           f (h:tl)      = let (left,  rest1) = span (&lt h) tl
>                               (right, rest2) = span (&gt h) rest1
>                           in Tree h (f left) (f right)

> po1 = [4, 2, 1, 3, 5, 6]   -- test

Now, for the in-order traversal, every input will simply be an in-order list of numbers. So we create a balanced binary tree using that traversal by recursively splitting the list in half.

> inorderTree xs = f xs
>     where f [] = Empty
>           f (h:[]) = Leaf h
>           f (h1:h2:[]) = Tree h2 (Leaf h1) Empty
>           f ys = let (left, right) = splitAt ((length ys) `div` 2) ys
>                  in Tree (head right) (f left) (f $ tail right)

> pi1 = [1..6]

6.) You have a [(WeatherStationId, Latitude, Longitude)]. Similar to #3, write a function which will, off-line, turn this into a data structure from which you can easily determine the nearest Weather Station, given an arbitrary Latitude and Longitude.

I’m fairly stumped by this one. I can think of good ways to to solve this problem conceptualy, but I can’t think of a simple data structure to return. I’d like to return a data structure of polygons, where each polygon encapsulates what points are closest to each weather station. I don’t feel like coding that up though.

7.) Write a function for scoring a mastermind guess. That is, in a game of mastermind (http://en.wikipedia.org/wiki/Mastermind_(board_game)), given a guess, and the actual answer, determine the number correct, the number wrong,and the number out of place.

For this solution, we zip up the answer and the guess so we can examine the elements pair-wise. We then consider each pair. If they are equal, we can increment a correct counter. Otherwise, we must accumulate a list of incorrect values. As we build up the incorrect values, we check if each guess and answer has already been accumulated in the list of incorrect values. When we find it in the accumulated list, we know we’ve found a guess that was in the wrong position.


> data ColoredPeg = W | G | B | R | P | O deriving (Eq, Ord, Show, Read)

> scoreMM as gs = let (crct, wrg, _) = foldl f (0, [], []) (zip as gs)
>                     wrongPosition = (length as) - crct - (length wrg)
>                 in (crct, (length wrg), wrongPosition)
>     where f (c, oA, oG)
>                 (a, g) = case g == a of  -- update an accumulator after examining each pair
>                          True -> (c+1, oA, oG) -- match, so simply update the count of correct
>                          False -> case ((a `elem` oG), (g `elem` oA)) of
>                               (True,True)->(c, (rem g oA), (rem a oG))
>                               (False,True)->(c, a:(rem g oA), (rem a oG))
>                               (True,False)->(c, (rem g oA), g:(rem a oG))
>                               (False,False)->(c, a:(rem g oA), g:(rem a oG))
>           rem a [] = []
>           rem a (b:bs) = if (a == b) then bs else b:(rem a bs)

> testAnswer1 = [W, G, B, R]
> testGuess1 = [W, G, R, B]
> testGuess2 = [W, G, R, P]

8.) Implement a trie (http://en.wikipedia.org/wiki/Trie) data structure. Write a function add, which takes a word, and a trie, and adds the word to the trie. Write another function lookup, which takes a prefix and a trie, and returns all the words in the trie that start with that prefix.

This is a fairly straightforward problem. We create the Trie data structure and define the recursive functions. I wish that we didn’t have a special data value for the Trie root, but I couldn’t think of an alternate solution besides having a special null character as the root.


> data Trie = TrieRoot { tries :: [Trie] }
>     | Trie { char :: Char, tries :: [Trie] }
>             deriving (Eq, Ord, Show, Read)

> trieAdd t [] = t
> trieAdd (TrieRoot ts) ws = TrieRoot{tries = (_tahelp ts ws)}
> trieAdd (Trie c ts) ws   = Trie{char = c, tries = (_tahelp ts ws)}
> _tahelp ts (w:ws) = case find (\t -> (char t) == w) ts of
>                     Nothing -> let newTrie = trieAdd Trie{char=w,
>                                                           tries=[]} ws
>                                in newTrie:ts
>                     Just mtch -> let otherTries = filter (\t -> (char t) /= w) ts
>                                      newTrie    = trieAdd mtch ws
>                                  in newTrie:otherTries

> trieLookup (TrieRoot ts) ws = concatMap (\t -> trieLookup t ws) ts
> trieLookup (Trie c ts) [] = case ts of
>                             []        -> [[c]]
>                             otherwise -> recurse
>     where recurse = map (\str -> c:str) $ concatMap (\t -> trieLookup t []) ts
> trieLookup (Trie c ts) (w:ws) = if (c == w)
>                                 then case ts of -- we matched the character
>                                      []        -> [[c]]
>                                      otherwise -> recurse
>                                 else [] -- stop the search down this trie branch
>     where recurse = map (\str -> c:str) $ concatMap (\t -> trieLookup t ws) ts

> testTrie = trieAdd (trieAdd (trieAdd (TrieRoot []) "hi") "hot") "gr"
> testTrie2 = trieAdd (TrieRoot []) "h"

9.) Write an algorithm to shuffle a deck of cards. Now write a function to perform some kind of evaluation of “how shuffled” a deck of cards is.

To shuffle I swap element i with a random element at position [i, (n-1)].

> shuffle gen xs = let n = length xs
>                      a = listArray (0, (n - 1)) xs                 -- create an Array
>                      swapPairs = (zip [0..(n - 1)] (swaps gen n))  -- compute a sequence of swaps to perform
>                  in elems $ foldl f a swapPairs
>     where f a' (i1, i2) = a' // [(i1, a'!i2), (i2, a'!i1)]         -- swap element at i1 with element at i2

Here’s a function which generates a sequence of swaps. We swap element at index ‘i’ into the range [i, (n-1)].

> swaps gen n = unfoldr f (gen, 0)
>     where f (g, i) = if (i &lt n)
>                      then let (randInt, g') = next g
>                               swapIndex = (randInt `mod` (n - i)) + i  -- produce a number in [i, (n-1)]
>                           in Just (swapIndex, (g', (i+1)))         -- add 'swapIndex' to the generated list
>                      else Nothing                                  -- stop generating elements

Here are functions that generates a sequence of random numbers or random generators. Can be used to call shuffle multiple times.

> randomGems = map mkStdGen randomNums
> randomNums = unfoldr (Just . next) (mkStdGen 1)

One way we might measure how good the suffle is (although for a single shuffle, the question is a bit preposterous), is to compute the min, max, and average distance that each item moved in the list.


> shuffleQuality xs ys = let xsPos = sortBySnd $ zip [1..(length xs)] xs
>                            ysPos = sortBySnd $ zip [1..(length ys)] ys
>                            diffFun = (\ (xp, yp) -> abs ((fst xp) - (fst yp)))
>                            posDiffs = map diffFun (zip xsPos ysPos)
>                            maxDiff = maximum posDiffs
>                            minDiff = minimum posDiffs
>                            avgDiff = (sum posDiffs) `div` (length posDiffs)
>                        in (maxDiff, minDiff, avgDiff)
>     where sortBySnd ps = sortBy (\ p1 p2 -> compare (snd p1) (snd p2)) ps

Comments (2)