Reputation: 21
Assuming this is my file:
$ cat file.txt
A:1:i
B:2:ii
X:9:iv
With a for loop like this I can print all the fields separately and redirect to sub-file
$ for i in $(seq 1 3); do echo $i; awk -F ":" -v FL=$i '{print $FL}' file.txt > $i.out; done
So that:
$ cat 1.out
A
B
X
$ cat 2.out
1
2
9
$ cat 3.out
i
ii
iv
Question: I have to perform this on nearly 70 columns and on file size of nearly 10 GB. It works, but slow. Can anyone suggest a better/efficient split to work on this big data set. Thanks.
$ for i in $(seq 1 70); do echo $i; awk -F ":" -v FL=$i '{print $FL}' *.data > $i.out; done
Upvotes: 1
Views: 133
Reputation: 531718
Here's a bash script that uses a feature I don't see often: asking bash to allocate a file descriptor for a file and storing the descriptor in a variable:
# Read the first line to get a count of the columns
IFS=: read -a columns < file.txt
# Open an output file for each column, saving the file descriptor in an array
for c in "${columns[@]}"; do
exec {a}>$((++i)).txt
fds+=( $a )
done
# Iterate through the iput, writing each column to the file opened for it
while IFS=: read -a fields; do
for f in "${fields[@]}"; do
printf "$f\n" >&${fds[++i]}
done
done < file.txt
# Close the file descriptors
for fd in "${fds[@]}"; do
exec {fd}>&-
done
Upvotes: 0
Reputation: 47169
With coreutils if you know that there are three columns:
< file.txt tee >(cut -d: -f1 > 1.out) >(cut -d: -f2 > 2.out) >(cut -d: -f3 > 3.out) > /dev/null
To make it more generic, here's one way to automate the command-line generation:
# Determine number of fields and generate tee argument
arg=""
i=1
while read; do
arg="$arg >(cut -d: -f$i > $((i++)).out)"
done < <(head -n1 file.txt | tr ':' '\n')
arg
is now:
>(cut -d: -f1 > 1.out) >(cut -d: -f2 > 2.out) >(cut -d: -f3 > 3.out)
Save to a script file:
echo "< file.txt tee $arg > /dev/null" > script
And execute:
. ./script
Upvotes: 0
Reputation: 30230
Python version
#!/bin/env python
with open('file.txt', 'r') as ih:
while True:
line = ih.readline()
if line == '': break
for i,element in enumerate(line.strip().split(':')):
outfile = "%d.out" % (i+1)
with open(outfile, 'a') as oh:
oh.write("%s\n" % element)
This might be a bit faster, as it only goes through the original file once. Note that it could be further optimized by leaving the output files open (as it is, I close each of them and re-open them for each write).
EDIT
For example, something like:
#!/bin/env python
handles = dict()
with open('file.txt', 'r') as ih:
while True:
line = ih.readline()
if line == '': break
for i,element in enumerate(line.strip().split(':')):
outfile = "%d.out" % (i+1)
if outfile not in handles:
handles[outfile] = open(outfile, 'a');
handles[outfile].write("%s\n" % element)
for k in handles:
handles[k].close()
This leaves the handles open for the duration of the execution, then closes them all before continuing / ending.
Upvotes: 2
Reputation: 2294
In perl you can do:
#!/usr/bin/perl -w
my $n = 3;
my @FILES;
for my $i (1..$n) {
my $f;
open ($f, "> $i.out") or die;
push @FILES, $f;
}
while (<>) {
chomp;
@a = split(/:/);
for my $i (0..$#a) {
print $FILES[$i] $a[$i],"\n";
}
}
close($f) for $f in @FILES;
Upvotes: 1
Reputation: 54502
This should be fairly quick considering what you are trying to do:
awk -F: '{ for (i=1; i<=NF; i++) print $i > i".out" }' file.txt
Upvotes: 6