Mojtaba
Mojtaba

Reputation: 316

Splitting huge CSV file

I have a huge csv file which is about 20 GB. It has 5,000 columns and 2,500,000 rows. I want to write each column of that into one file. I already tried FOR loop, but it is pretty slow. My code is below:

Columns=$(head -n 1 train.csv | sed "s/,/\n/g" | wc -l)
mkdir cols
for i in `seq 1 $Columns`;
do
    echo $i
    tail -n +2 train.csv | cut -d',' -f$i > cols/col_$i.txt
done 

I appropriate any suggestion to accelerate this.

Upvotes: 3

Views: 479

Answers (3)

James Brown
James Brown

Reputation: 37394

In awk:

$ awk '{for(i=1;i<=NF;i++) print $i > i}' train.csv

A test version that produces 5000 files:

$ cat > foo
1
2
3
$ awk 'BEGIN {for(i=1;i<=5000;i++) a=a i (i<5000? OFS:"")} {$0=a; for(i=1;i<=NF; i++) print $i > i}' foo
$ ls -l | wc -l
5002 # = 1-5000 + foo and "total 20004"
$ cat 5000
5000
5000
5000

For 250 rows it lasted on my laptop:

real    1m4.691s
user    1m4.456s
sys     0m0.180s

Upvotes: 3

redneb
redneb

Reputation: 23850

Here's a bash script that does that in a single pass:

Columns=$(head -n 1 train.csv | sed "s/,/\n/g" | wc -l)
mkdir cols
tail -n +2 train.csv | \
    while IFS=, read -ra row; do
        for i in `seq 1 $Columns`; do
            echo "${row[$(($i-1))]}" >> cols/col_$i.txt
        done 
    done

The disadvantage of this script is that it will open and close the column files millions of times. The following perl script avoids that issue by keeping all files open:

#!/usr/bin/perl
use strict;
use warnings;

my @handles;
open my $fh,'<','train.csv' or die;
<$fh>; #skip the header
while (<$fh>) {
    chomp;
    my @values=split /,/;
    for (my $i=0; $i<@values; $i++) {
        if (!defined $handles[$i]) {
            open $handles[$i],'>','cols/col_'.($i+1).'.txt' or die;
        }
        print {$handles[$i]} "$values[$i]\n";
    }
}
close $fh;
close $_ for @handles;

Since you have 5000 columns and this scripts keeps 5001 files open, you would need to increase the number of open file descriptors that your system allows you to have.

Upvotes: 3

choroba
choroba

Reputation: 241768

Perl solution. It openes 1000 files at once, so it will pass over your input 5 times. Run with the input filename as parameter.

#!/usr/bin/perl
use warnings;
use strict;

my $inputfile = shift;
open my $input, '<', $inputfile or die $!;

mkdir 'cols';

my @headers = split /,/, <$input>;
chomp $headers[-1];
my $pos = tell $input;  # Remember where the first data line starts.

my $step = 1000;
for (my $from = 0; $from <= $#headers; $from += $step) {
    my $to = $from + $step - 1;
    $to = $#headers if $#headers < $to;
    warn "$from .. $to";

    # Open the files and print the headers in range.    
    my @fhs;
    for ($from .. $to) {
        open $fhs[ $_ - $from ], '>', "cols/col-$_" or die $!;
        print { $fhs[ $_ - $from ] } $headers[$_], "\n";
    }

    # Print the columns in range.
    while (<$input>) {
        chomp;
        my $i = 0;
        print { $fhs[$i++] } $_, "\n" for (split /,/)[ $from .. $to ];
    }
    close for @fhs;
    seek $input, $pos, 0;  # Go back to the first data line.
}

Upvotes: 2

Related Questions