Reputation: 316
I have a huge csv file which is about 20 GB. It has 5,000 columns and 2,500,000 rows. I want to write each column of that into one file. I already tried FOR loop, but it is pretty slow. My code is below:
Columns=$(head -n 1 train.csv | sed "s/,/\n/g" | wc -l)
mkdir cols
for i in `seq 1 $Columns`;
do
echo $i
tail -n +2 train.csv | cut -d',' -f$i > cols/col_$i.txt
done
I appropriate any suggestion to accelerate this.
Upvotes: 3
Views: 479
Reputation: 37394
In awk:
$ awk '{for(i=1;i<=NF;i++) print $i > i}' train.csv
A test version that produces 5000 files:
$ cat > foo
1
2
3
$ awk 'BEGIN {for(i=1;i<=5000;i++) a=a i (i<5000? OFS:"")} {$0=a; for(i=1;i<=NF; i++) print $i > i}' foo
$ ls -l | wc -l
5002 # = 1-5000 + foo and "total 20004"
$ cat 5000
5000
5000
5000
For 250 rows it lasted on my laptop:
real 1m4.691s
user 1m4.456s
sys 0m0.180s
Upvotes: 3
Reputation: 23850
Here's a bash script that does that in a single pass:
Columns=$(head -n 1 train.csv | sed "s/,/\n/g" | wc -l)
mkdir cols
tail -n +2 train.csv | \
while IFS=, read -ra row; do
for i in `seq 1 $Columns`; do
echo "${row[$(($i-1))]}" >> cols/col_$i.txt
done
done
The disadvantage of this script is that it will open and close the column files millions of times. The following perl script avoids that issue by keeping all files open:
#!/usr/bin/perl
use strict;
use warnings;
my @handles;
open my $fh,'<','train.csv' or die;
<$fh>; #skip the header
while (<$fh>) {
chomp;
my @values=split /,/;
for (my $i=0; $i<@values; $i++) {
if (!defined $handles[$i]) {
open $handles[$i],'>','cols/col_'.($i+1).'.txt' or die;
}
print {$handles[$i]} "$values[$i]\n";
}
}
close $fh;
close $_ for @handles;
Since you have 5000 columns and this scripts keeps 5001 files open, you would need to increase the number of open file descriptors that your system allows you to have.
Upvotes: 3
Reputation: 241768
Perl solution. It openes 1000 files at once, so it will pass over your input 5 times. Run with the input filename as parameter.
#!/usr/bin/perl
use warnings;
use strict;
my $inputfile = shift;
open my $input, '<', $inputfile or die $!;
mkdir 'cols';
my @headers = split /,/, <$input>;
chomp $headers[-1];
my $pos = tell $input; # Remember where the first data line starts.
my $step = 1000;
for (my $from = 0; $from <= $#headers; $from += $step) {
my $to = $from + $step - 1;
$to = $#headers if $#headers < $to;
warn "$from .. $to";
# Open the files and print the headers in range.
my @fhs;
for ($from .. $to) {
open $fhs[ $_ - $from ], '>', "cols/col-$_" or die $!;
print { $fhs[ $_ - $from ] } $headers[$_], "\n";
}
# Print the columns in range.
while (<$input>) {
chomp;
my $i = 0;
print { $fhs[$i++] } $_, "\n" for (split /,/)[ $from .. $to ];
}
close for @fhs;
seek $input, $pos, 0; # Go back to the first data line.
}
Upvotes: 2