Matthias Pospiech
Matthias Pospiech

Reputation: 3488

matlab: speed up for loop with string analysis

I have a very lare csv file containing three columns. Now I want to load these columns as fast as possible into a matlab matrix.

Currently what I do is this

    fid = fopen(inputfile, 'rt');
    g = textscan(fid,'%s','delimiter','\r\n');
    tdata = g{1};
    fclose(fid);
    
    results = zeros([numel(tdata)-4], 3);
    tic
    display('start reading data...');
    for r = 4:numel(tdata)
        if ~mod(r, 100) 
            display(['data row: ' num2str(r) ' / ' num2str(numel(tdata))]);
        end
        entries = strsplit(tdata{r}, ',');
        results(r-3,1) = str2double(strrep(entries{1},',', '.'));
        results(r-3,2) = str2double(strrep(entries{2},',', '.'));
        results(r-3,3) = str2double(strrep(entries{3},',', '.'));
    end

This however takes ~30 seconds for 200 000 lines. This means 150 µs per line. This is really slow. The code is not accepted by parfor.

Now I would like to know what causes the bottleneck in the for loop and how I can speed it up.

Here the measured times:

str2double 578253 calls 29.631s

strsplit 192750 calls 13.388s

EDIT: The content has this structure in the file

  0.000000,  -0.00271,   5394147
  0.000667,  -0.00271,   5394148
  0.001333,  -0.00271,   5394149
  0.002000,  -0.00271,   5394150

Upvotes: 0

Views: 98

Answers (2)

Gelliant
Gelliant

Reputation: 1845

I think a lot can be improved by calling textscan differently.

You do this:

g = textscan(fid,'%s','delimiter','\r\n');

But then call tdata = g{1};

If textscan is called correctly it should already split all your data, and give it back as numbers.

Try this:

g=textscan(fid,'%f,%f,%f,'delimiter','\r\n')

It should give you back three cell arrays with in the columns your values. To convert to a matrix you can use:

g=cell2mat(g)

I imported 200k lines in 0.12 seconds.

It seems your code has some other workarounds. You start at r=4, it seems you have 3 lines that you don't want to read. so after fopen you can call 3 times

[~] =fgetl(fid) 

to get to the interesting part of your file.

You also first split the line with ',' as seperator. But the replace all ',' by '.'. That will not do anything, all ',' are already gone since they were used as seperators.

Upvotes: 1

Wolfie
Wolfie

Reputation: 30045

If you used csvread you wouldn't need to use str2double or strsplit, which you say are the slow lines... it's likely much quicker for a csv.

You would be able to replace all the above code by:

results = csvread(inputfile);

Upvotes: 1

Related Questions