sameer oak
sameer oak

Reputation: 53

awk or sed command to split large text file matching a regex into smaller files each containing n records

I want to split a text file into multiple files based on the matched regex. This is straight forward using awk. for instance,

tmp_file_prefix="f-" ; awk '/^ID:/{x="'"$tmp_file_prefix"'" ++i;} {print > x;}' file.txt

The catch is, the input text file "file.txt" is huge in size, 2.6 GB to be precise. I'm sure, I'll quickly run out of max file limit in a directory.

The above awk command does my job well and splits the file containing the entire record matching the regex into multiple files. I've executed the command on a smaller sized file with 25 such records each with varied sizes. But I realized that this will overrun the limit of max files in a directory.

I tried the following pattern:

tmp_file_prefix="f-" ; awk -v i=0 '/^ID:/{x="'"$tmp_file_prefix"'" ++i;} i % 20 == 0 {print > x;}' file.txt

and realized that it emits only the 20th pattern and saves the same in the file. This solution is incorrect.

I want a way in the above said awk command whereby I can split the source file into smaller files, each containing 25000 thousand (or n for that matter) occurrences of the regex.

Upvotes: 1

Views: 2106

Answers (3)

glenn jackman
glenn jackman

Reputation: 246764

awk -v prefix="$tmp_file_prefix" -v max=25000 '
    function filename() { return sprintf(%s%06d", prefix, ++i) }
    !x { x = filename() }
    /^ID:/ {
        print > x
        n++
        if (n == max) {
            close x
            x = ""
            n = 0
        }
    }
' file

This should not run out of open file handles, as it takes care to close the file when done.

Upvotes: 2

Nitzan Shaked
Nitzan Shaked

Reputation: 13598

grep '^ID:' file.txt | split -l 25000

Upvotes: 2

Max
Max

Reputation: 2131

You can split the source file into smaller pieces first using split (1), then run your awk script on each piece. Obviously you will need to append to the output files, not overwrite them!

split -l 25000 -a 3 file.txt

will generate files xaaa, xaab, xaac etc., each no more than 25000 lines long, which you can then process with your awk script

Upvotes: 0

Related Questions