问题描述:

Idea. Read several files line by line, concatenate them, process the list of lines in all files.

Implementation. This can be implemented this way:

import qualified Data.ByteString.Char8 as B

readFiles :: [FilePath] -> IO B.ByteString

readFiles = fmap B.concat . mapM B.readFile

...

main = do

files <- getArgs

allLines <- readFiles files

Problem. This works unbearably slow. What's notable, the real or user time is several orders higher than system time (measured using UNIX time), so I suppose the problem is in spending too much time in IO.

I didn't manage to find a simple and effective way to solve this problem in Haskell.

For instance, processing two files (30.000 lines and 1.2M each) takes

 20.98 real 18.52 user 0.25 sys

This is the output when running +RTS -s:

 157,972,000 bytes allocated in the heap

6,153,848 bytes copied during GC

5,716,824 bytes maximum residency (4 sample(s))

1,740,768 bytes maximum slop

10 MB total memory in use (0 MB lost due to fragmentation)

Tot time (elapsed) Avg pause Max pause

Gen 0 295 colls, 0 par 0.01s 0.01s 0.0000s 0.0006s

Gen 1 4 colls, 0 par 0.00s 0.00s 0.0010s 0.0019s

INIT time 0.00s ( 0.01s elapsed)

MUT time 16.09s ( 16.38s elapsed)

GC time 0.01s ( 0.02s elapsed)

EXIT time 0.00s ( 0.00s elapsed)

Total time 16.11s ( 16.41s elapsed)

%GC time 0.1% (0.1% elapsed)

Alloc rate 9,815,312 bytes per MUT second

Productivity 99.9% of total user, 98.1% of total elapsed

16.41 real 16.10 user 0.12 sys

Why is concatenating files using the code above is so slow?

How should I write readFiles function in Haskell to make it faster?

网友答案:

You should show us exactly what your processing steps are.

This program is very performant even when run on multiple input files of the kind you are using (1.2 MB, 30k lines each):

import Control.Monad
import Data.List
import System.Environment
import qualified Data.ByteString.Char8 as B

readFiles :: [FilePath] -> IO B.ByteString
readFiles = fmap B.concat . mapM B.readFile

main = do
    files <- getArgs
    allLines <- readFiles files
    print $ foldl' (\s _ -> s+1) 0 (B.words allLines)

Here is how I created the input file:

import Control.Monad

main = do
  forM_ [1..30000] $ \i -> do
    putStrLn $ unwords ["line", show i, "this is a test of the emergency"]

Run times:

time ./program input               -- 27 milliseconds
time ./program input input         -- 49 milliseconds
time ./program input input input   -- 69 milliseconds
相关阅读:
Top