Reputation: 25
I have a 2TB text table file separated by tab, and one column further separated by ";". Yeah, it is in fact a very large vcf file.
Using Tab delimiter, we have 8 columns, and using ";" delimiter, we can split the 8th column into another 12 columns.
For easier statistical analysis, I need to split the files into the 19 files, each file contains one column. And preferably I can just go through the files once (since the file is large and I have 100 of those large files, IO cost is really high) and get the 19 columns write into 19 separate files.
I have achieve the problme in an efficient way, basically
cut 1-2 file.txt > column12.txt
but to get this 19 columns,I need to go through the file for 19 times and it is not efficient.
I am wondering if there is an efficient way to go through file once and get it write to 19 files?
Thanks very much indeed for your help.
The file example is like below
a b c d e f g;h;i;j;k
m n o p q l x;y;z;o;p
a b c d e f g;h;i;j;k
a b c d e f g;h;i;j;k
then I want files contains
a
m
a
a
Upvotes: 2
Views: 940
Reputation: 88563
With awk:
awk -F '[\t;]' '{for(i=1; i<=NF; i++) print $i >> "column" i ".txt"}' file
Use tab and semicolon as field separator. NF
contains the number of last column in the current row. $i
contains content of current column and i
number of current column.
This creates 11 files. column11.txt contains:
k p k k
Upvotes: 2