Reputation: 81
I have a large (3GB), gzipped file containing two fields: NAME and STRING. I want to split this file into smaller files - if field one is john_smith, I want the string to be placed in john_smith.gz. NOTE: the string field can and does contain special characters.
I can do this easily in a for loop over the domains using BASH, but I'd much prefer the efficiency of reading the file in once using AWK.
I have tried using the system function within awk with escaped single quotes around the string
zcat large_file.gz | awk '{system("echo -e '"'"'"$1"\t"$2"'"'"' | gzip >> "$1".gz");}'
and it works perfectly on most of the lines, however some of them are printed to STDERR and give an error that the shell cannot execute a command (the shell thinks that part of the string is a command). It looks like special characters might be breaking it.
Any thoughts on how to fix this, or any alternate implementations that would help?
Thanks!
-Sean
Upvotes: 4
Views: 3038
Reputation: 37288
You're facing a big trade off in time vs disk space. I assume you're trying to save space by appending records to the end of your ${name}.gz files. @sehe comments and code are definitely worth considering.
In anycase, your time is more valuable that 3 GB of diskspace. Why not try
zcat large_file.gz \
| awk '-F\t' {
name=$1; string=$2; outFile=name".txt"
print name "\t" string >> outFile
# close( outFile)
}'
echo *.txt | xargs gzip -9
You may need to uncomment the #close(outFile). The xargs is included because I'm assuming you're going to have more that 1000 filenames created. Even if you don't it won't hurt to use that technique.
Note this code assumes tab delimited data, change the value of arg for -F as needed and the "\t" in the print statment to give the field separator you need.
Don't have time to test this. If you like this idea and get stuck, please post small sample data, expected output, and error messages that you're getting.
I hope this helps.
Upvotes: 2
Reputation: 393114
This little perl script does the job nicely
gzip
on the fly There is a bit of a kludge with $fh
because apparently using the hash entry directly doesn't work
#!/usr/bin/perl
use strict;
use warnings;
my $suffix = ".txt.gz";
my %pipes;
while (my ($id, $line) = split /\t/,(<>),2)
{
exists $pipes{$id}
or open ($pipes{$id}, "|gzip -9 > '$id$suffix'")
or die "can't open/create $id$suffix, or cannot spawn gzip";
my $fh = $pipes{$id};
print $fh $line;
}
print STDERR "Created: " . join(', ', map { "$_$suffix" } keys %pipes) . "\n"
Oh, use it like
zcat input.gz | ./myscript.pl
Upvotes: 0
Reputation: 193
Maybe try something along the lines of:
zcat large_file.gz | echo $("awk '{system("echo -e '"'"'"$1"\t"$2"'"'"' | gzip >> "$1".gz");}'")
I haven't tried it myself, as I don't have any large files to play with.
Upvotes: 0
Reputation: 57774
Create this program as, say largesplitter.c
and use the command
zcat large_file.gz | largesplitter
The unadorned program is:
#include <errno.h>
#include <stdio.h>
#include <string.h>
int main (void)
{
char buf [32000]; // todo: resize this if the second field is larger than
char cmd [120];
long linenum = 0;
while (fgets (buf, sizeof buf, stdin))
{
++linenum;
char *cp = strchr (buf, '\t'); // identify first field delimited by tab
if (!cp)
{
fprintf (stderr, "line %d missing delimiter\n", linenum);
continue;
}
*cp = '\000'; // split line
FILE *out = fopen (buf, "w");
if (!out)
{
fprintf (stderr, "error creating '%s': %s\n", buf, strerror(errno));
continue;
}
fprintf (out, "%s", cp+1);
fclose (out);
snprintf (cmd, sizeof cmd, "gzip %s", buf);
system (cmd);
}
return 0;
}
This compiles without error on my system, but I have not tested its functionality.
Upvotes: 0