Reputation: 421
I have a huge file, 45 GB. I want to split it into 4 parts. I can do this by: split --bytes=12G inputfile
.
Problem is it disturbs the pattern of the file. This split cut the file based on size so format is not preserved. My input file looks like this:
Inspecting sequence ID chr1:11873-13873
V$ARID3A_04 | 1981 (-) | 0.899 | 0.774 | tttctatAATAActaaa
V$ARID3A_04 | 1982 (+) | 0.899 | 0.767 | ttctaTAATAactaaag
Inspecting sequence ID chr1:11873-13873
V$ARID3A_04 | 1981 (-) | 0.899 | 0.774 | tttctatAATAActaaa
V$ARID3A_04 | 1982 (+) | 0.899 | 0.767 | ttctaTAATAactaaag
I want to split the file but also mention mention the pattern that split it at Inspecting
so that splitted files I get back must look like this:
Inspecting sequence ID chr1:11873-13873
V$ARID3A_04 | 1981 (-) | 0.899 | 0.774 | tttctatAATAActaaa
V$ARID3A_04 | 1982 (+) | 0.899 | 0.767 | ttctaTAATAactaaag
V$ARNT_Q6_01 | 390 (+) | 1.000 | 0.998 | tACGTGgc
and this:
Inspecting sequence ID chr1:11873-13873
V$ARID3A_04 | 1981 (-) | 0.899 | 0.774 | tttctatAATAActaaa
V$ARID3A_04 | 1982 (+) | 0.899 | 0.767 | ttctaTAATAactaaag
V$ARNT_Q6_01 | 390 (+) | 1.000 | 0.998 | tACGTGgc
NOTE:
This pattern matching should be a second preference while first should be the size. For example, split files into chuncks of 12 GB and split based on pattern match of Inspecting
. If I do split just based on pattern Inspecting
then I will get thousands of splitted file because this pattern is repeating again and again.
Upvotes: 1
Views: 1669
Reputation: 3363
Doint it with sed
would be pretty difficult, since you have no easy way of keeping track of the characters read so far. It would be easier with awk
:
BEGIN {
fileno = 1
}
{
size += length()
}
size > 100000 && /Inspecting/ {
fileno++
size = 0
}
{
print $0 > "out" fileno;
}
Adjust the size according to your needs. awk
might have problems handling very large numbers. For this reason it might be better to keep track of the number of lines read so far.
Upvotes: 5