Newbie
Newbie

Reputation: 421

Split big file in unix based on size and pattern

I have a huge file, 45 GB. I want to split it into 4 parts. I can do this by: split --bytes=12G inputfile.

Problem is it disturbs the pattern of the file. This split cut the file based on size so format is not preserved. My input file looks like this:

Inspecting sequence ID   chr1:11873-13873

 V$ARID3A_04            |     1981 (-) |  0.899 |  0.774 | tttctatAATAActaaa
 V$ARID3A_04            |     1982 (+) |  0.899 |  0.767 | ttctaTAATAactaaag
Inspecting sequence ID   chr1:11873-13873

 V$ARID3A_04            |     1981 (-) |  0.899 |  0.774 | tttctatAATAActaaa
 V$ARID3A_04            |     1982 (+) |  0.899 |  0.767 | ttctaTAATAactaaag

I want to split the file but also mention mention the pattern that split it at Inspecting so that splitted files I get back must look like this:

Inspecting sequence ID   chr1:11873-13873

 V$ARID3A_04            |     1981 (-) |  0.899 |  0.774 | tttctatAATAActaaa
 V$ARID3A_04            |     1982 (+) |  0.899 |  0.767 | ttctaTAATAactaaag
 V$ARNT_Q6_01           |      390 (+) |  1.000 |  0.998 | tACGTGgc

and this:

Inspecting sequence ID   chr1:11873-13873

 V$ARID3A_04            |     1981 (-) |  0.899 |  0.774 | tttctatAATAActaaa
 V$ARID3A_04            |     1982 (+) |  0.899 |  0.767 | ttctaTAATAactaaag
 V$ARNT_Q6_01           |      390 (+) |  1.000 |  0.998 | tACGTGgc

NOTE: This pattern matching should be a second preference while first should be the size. For example, split files into chuncks of 12 GB and split based on pattern match of Inspecting. If I do split just based on pattern Inspecting then I will get thousands of splitted file because this pattern is repeating again and again.

Upvotes: 1

Views: 1669

Answers (1)

Michael Vehrs
Michael Vehrs

Reputation: 3363

Doint it with sed would be pretty difficult, since you have no easy way of keeping track of the characters read so far. It would be easier with awk:

BEGIN {
    fileno = 1
}
{
    size += length()
}
size > 100000 && /Inspecting/ {
    fileno++
    size = 0
}
{
    print $0 > "out" fileno;
}

Adjust the size according to your needs. awkmight have problems handling very large numbers. For this reason it might be better to keep track of the number of lines read so far.

Upvotes: 5

Related Questions