问题描述:

I wanted to have a look at the julia language, so I wrote a little script to import a dataset I'm working with. But when I run and profile the script it turns out that it is much slower than a similar script in R.

When I do profiling it tells me that all the cat commands have a bad performance.

The files look like this:

#

#Metadata

#

Identifier1 data_string1

Identifier2 data_string2

Identifier3 data_string3

Identifier4 data_string4

//

I primarily want to get the data_strings and split them up into a matrix of single characters.

This is a somehow minimal code example:

function loadfile()

f = open("/file1")

first=true

m = Array(Any, 1,0)

for ln in eachline(f)

if ln[1] != '#' && ln[1] != '\n' && ln[1] != '/'

s = split(ln[1:end-1])

s = split(s[2],"")

if first

m = reshape(s,1,length(s))

first = false

else

s = reshape(s,1,length(s))

println(size(m))

println(size(s))

m = vcat(m, s)

end

end

end

end

Any idea why julia might be slow with the cat command or how i can do it differently?

Thanks for any suggestions!

网友答案:

Using cat like that is slow in that it requires a lot of memory allocations. Every time we do a vcat we are allocating a whole new array m which is mostly the same as the old m. Here is how I'd rewrite your code in a more Julian way, where m is only created at the end:

function loadfile2()
  f = open("./sotest.txt","r")
  first = true
  lines = Any[]

  for ln in eachline(f)
    if ln[1] == '#' || ln[1] == '\n' || ln[1] == '/'
      continue
    end

    data_str = split(ln[1:end-1]," ")[2]
    data_chars = split(data_str,"")
    # Can make even faster (2x in my tests) with
    # data_chars = [data_str[i] for i in 1:length(data_str)]
    # But this inherently assumes ASCII data
    push!(lines, data_chars)
  end
  m = hcat(lines...)'  # Stick column vectors together then transpose
end

I made a 10,000 line version of your example data and found the following performance:

Old version:
elapsed time: 3.937826405 seconds (3900659448 bytes allocated, 43.81% gc time)
elapsed time: 3.581752309 seconds (3900645648 bytes allocated, 36.02% gc time)
elapsed time: 3.57753696 seconds (3900645648 bytes allocated, 37.52% gc time)
New version:
elapsed time: 0.010351067 seconds (11568448 bytes allocated)
elapsed time: 0.011136188 seconds (11568448 bytes allocated)
elapsed time: 0.010654002 seconds (11568448 bytes allocated)
相关阅读:
Top