Sean
Sean

Reputation: 81

Split a large, compressed file into multiple outputs using AWK and BASH

I have a large (3GB), gzipped file containing two fields: NAME and STRING. I want to split this file into smaller files - if field one is john_smith, I want the string to be placed in john_smith.gz. NOTE: the string field can and does contain special characters.

I can do this easily in a for loop over the domains using BASH, but I'd much prefer the efficiency of reading the file in once using AWK.

I have tried using the system function within awk with escaped single quotes around the string

zcat large_file.gz | awk '{system("echo -e '"'"'"$1"\t"$2"'"'"' | gzip >> "$1".gz");}'

and it works perfectly on most of the lines, however some of them are printed to STDERR and give an error that the shell cannot execute a command (the shell thinks that part of the string is a command). It looks like special characters might be breaking it.

Any thoughts on how to fix this, or any alternate implementations that would help?

Thanks!

-Sean

Upvotes: 4

Views: 3038

Answers (4)

shellter
shellter

Reputation: 37288

You're facing a big trade off in time vs disk space. I assume you're trying to save space by appending records to the end of your ${name}.gz files. @sehe comments and code are definitely worth considering.

In anycase, your time is more valuable that 3 GB of diskspace. Why not try

 zcat large_file.gz \
 | awk '-F\t' { 
    name=$1; string=$2; outFile=name".txt"
    print name "\t" string >> outFile
    # close( outFile) 
   }'

 echo *.txt | xargs gzip -9

You may need to uncomment the #close(outFile). The xargs is included because I'm assuming you're going to have more that 1000 filenames created. Even if you don't it won't hurt to use that technique.

Note this code assumes tab delimited data, change the value of arg for -F as needed and the "\t" in the print statment to give the field separator you need.

Don't have time to test this. If you like this idea and get stuck, please post small sample data, expected output, and error messages that you're getting.

I hope this helps.

Upvotes: 2

sehe
sehe

Reputation: 393114

This little perl script does the job nicely

  • keeping all destination files open for performance
  • doing error elementary handling
  • Edit now also pipes output through gzip on the fly

There is a bit of a kludge with $fh because apparently using the hash entry directly doesn't work

#!/usr/bin/perl
use strict;
use warnings;

my $suffix = ".txt.gz";

my %pipes;
while (my ($id, $line) = split /\t/,(<>),2)
{
    exists $pipes{$id} 
        or open ($pipes{$id}, "|gzip -9 > '$id$suffix'") 
        or die "can't open/create $id$suffix, or cannot spawn gzip";

    my $fh = $pipes{$id};
    print $fh $line;
}

print STDERR "Created: " . join(', ', map { "$_$suffix" } keys %pipes) . "\n"

Oh, use it like

zcat input.gz | ./myscript.pl

Upvotes: 0

Jason
Jason

Reputation: 193

Maybe try something along the lines of:

zcat large_file.gz | echo $("awk '{system("echo -e '"'"'"$1"\t"$2"'"'"' | gzip >> "$1".gz");}'")

I haven't tried it myself, as I don't have any large files to play with.

Upvotes: 0

wallyk
wallyk

Reputation: 57774

Create this program as, say largesplitter.c and use the command

zcat large_file.gz | largesplitter

The unadorned program is:

#include <errno.h>
#include <stdio.h>
#include <string.h>

int main (void)
{
        char    buf [32000];  // todo:  resize this if the second field is larger than 
        char    cmd [120];
        long    linenum = 0;
        while (fgets (buf, sizeof buf, stdin))
        {
                ++linenum;
                char *cp = strchr (buf, '\t');   // identify first field delimited by tab
                if (!cp)
                {
                        fprintf (stderr, "line %d missing delimiter\n", linenum);
                        continue;
                }
                *cp = '\000';  // split line
                FILE *out = fopen (buf, "w");
                if (!out)
                {
                        fprintf (stderr, "error creating '%s': %s\n", buf, strerror(errno));
                        continue;
                }
                fprintf (out, "%s", cp+1);
                fclose (out);
                snprintf (cmd, sizeof cmd, "gzip %s", buf);
                system (cmd);
        }
        return 0;
}

This compiles without error on my system, but I have not tested its functionality.

Upvotes: 0

Related Questions