tamkrit
tamkrit

Reputation: 27

How to split large text file, on every blank line, into smaller text files using MATLAB?

I have large text file that looks like this:

PMID- 123456123
OWN - NLM
DA  - 20160930

PMID- 27689094
OWN - NLM
VI  - 2016
DP  - 2016

PMID- 27688828
OWN - NLM
STAT- Publisher
DA  - 20160930
LR  - 20160930

and so on... I would like to split the text file into smaller text files according to every blank line. Also name each text file corresponding to its PMID number, so it looks like this:

filename '123456123.txt' contains:

PMID- 123456123
OWN - NLM
DA  - 20160930

filename '27689094.txt' contains:

PMID- 27689094
OWN - NLM
VI  - 2016
DP  - 2016

filename '27688828.txt' contains:

PMID- 27688828
OWN - NLM
STAT- Publisher
DA  - 20160930
LR  - 20160930

This is my attempt, I know how to identify blank lines (I think) but I don't know how to split and save as a smaller text file:

fid = fopen(filename);
text = fgets(fid);
blankline = sprintf('\r\n');

while ischar(text)
    if strcmp(blankline,str)
        %split the text
    else
        %write the text to the smaller file
    end
end

Upvotes: 0

Views: 1041

Answers (1)

Suever
Suever

Reputation: 65460

You can read in the entire file and then use regexp to split the contents at empty lines. You can then use regexp again to extract the PMID of each group and then loop through all pieces and save them. Processing the file as one giant string like this is likely going to be more performant than using fgets to read it piece by piece.

% Tell it what folder you want to put the files in
outdir = '/my/folder';

% Read the initial file in all at once
fid = fopen(filename, 'r');
data = fread(fid, '*char').';
fclose(fid);

% Break it into pieces based upon empty lines
pieces = regexp(data, '\n\s*\n', 'split');

% For each piece get the PMID
pmids = regexp(pieces, '(?<=PMID-\s*)\d*', 'match', 'once');

% Now loop through and save each one
for k = 1:numel(pieces)
    % Use the PMID of this piece to construct a filename
    filename = fullfile(outdir, [pmids{k}, '.txt']);

    % Now write the piece to the file
    fid = fopen(filename, 'w');
    fwrite(fid, pieces{k});
    fclose(fid);
end

Upvotes: 2

Related Questions