Reputation: 139
How can I split files by grouping the same lines using shell script or awk?
For example, I have 1 file with the content as follow:
1,1,1,1
2,2,2,2
3,3,3,3
x,x,x,x
x,x,x,x
x,x,x,x
x,x,x,x
y,y,y,y
y,y,y,y
y,y,y,y
4,4,4,4
5,5,5,5
What I want is: all the equal lines are a group and must to be in a separated file, the other different lines needs to be in a splited file until specific limit. For example, if I have specific limit as 10, then the original file must to be splited for all lines containing numbers until the limit of 10 (<= 10), if there are more different lines than the limit, create another splited file and so on.
For the equal lines containing letters I need them to have their own separate file. So one file only for x,x,x,x lines, other for y,y,y,y lines and so on(basically to get file's contents based on a field, lets say 3rd field for example).
The content of lines is just example, the real case is a CSV containing different values for all columns where I need to group by specific column value (I'm using sort and uniq for this), but anyway I need to split this csv by equal lines group and by different lines <= limit using shell script or awk (I see awk provides better performance). I also need header(very first line) in each output file(with no duplicate of that header content in output file).
Do you have any idea?
My current code is (it keeps the first line because I'm considering the csv has a header):
#!/bin/bash
COLUMN=$1
FILE=$2
LIMIT=$3
FILELENGTH=`wc -l < $FILE`
COUNTER=$LIMIT
NUMS=""
SORTED="sorted_"`basename $FILE`
sort -t, -k $COLUMN -n $FILE > $SORTED
while [ $COUNTER -le $FILELENGTH ]; do
NUMS+=`uniq -c $SORTED | awk -v val=$COUNTER '($1+prev)<=val {prev+=$1} END{print prev}'`
NUMS+=" "
((COUNTER+=LIMIT))
echo $NUMS "|" $COUNTER "|" $FILELENGTH "|" $SORTED
done
awk -v nums="$NUMS" -v fname=`basename $2` -v dname=`dirname $2` '
NR==1 { header=$0; next}
(NR-1)==1 {
c=split(nums,b)
for(i=1; i<=c; i++) a[b[i]]
j=1; out = dname"/" "splited" j "_"fname
print header > out
system("touch "out".fin")
}
{ print > out }
NR in a {
close(out)
out = dname "/" "splited" ++j "_"fname
print header > out
system("touch "out".fin")
}' $SORTED
Upvotes: 1
Views: 385
Reputation: 133538
With GNU awk
you could try following code, written as per your shown samples. With a 2 pass of Input_file here. For lines which are occurring more than once in Input_file their output file will be created with name eg: firstfieldValue.outFile
and files which are unique(having only 1 occurrence in your Input_file) will be created with name like: 1.singleOccurrence.outFile
, 2.singleOccurrence.outFile
and so on.
To keep headers(very first line of your Input_file) into each output file, please try following awk
code, little tweak in above code:
awk '
BEGIN{
count1="1"
FS=OFS=","
}
NR==1{ headers = $0; next }
FNR==NR && FNR>1{
arr[$0]++
next
}
$0 == headers{ next }
arr[$0]>1{
if(!arr1[$1".outFile"]++){ print headers > ($1".outFile") }
print > ($1".outFile")
next
}
{
if(++count2%10==0){ close(count1".singleOccurrence.outFile") }
count1+=(count2%10==0?1:0)
if(prev!=count1){print headers > count1".singleOccurrence.outFile"}
print > (count1".singleOccurrence.outFile")
prev=count1
}
' Input_file Input_file
Upvotes: 4
Reputation: 26481
Here is my take on the problem, without any sorting needed. On top of that, we preserve the header and assume, in this example to have maximum 42 lines per file. At the end, the files will have the following naming convention "filename.ddddd.csv" and will contain the original header:
awk -v l=42 -v fname="/path/to/output/filename" \
'(NR==1){header=$0;next}
{ key = ($0 ~ /[^0-9,]/ ? $0 : "num") }
{c=a[key]++}
(c%l==0) {
close(filename[key]);
filename[key]=sprintf("%s.%0.5d.csv",fname,++file_count)
print header > filename[key]
}
{print > filename[key]}' file.csv
This works in the following way:
(NR==1){header=$0;next}
: If the record/line is the first line, save that line as the header
.{ key = ($0 ~ /[^0-9,]/ ? $0 : "num") }
: define a key
. If the current line contains only numbers and commas, define the key to be num
, otherwise it is the line itself.{c=a[key]++}
: keep track of how many times we encountered the current line. We do this by storing the count in the associative array a
indexed by key
. The value c
always returns the currently processed count -1.(c%l==0){...}
: Every time we wrote l=42 records/lines, we need to start writing to a new file. This happens every time the modulo of the record/line number hits 0. When such a line is found we do:
close(filename[key])
: close the file you write the current line to. This filename is tracked in the associative array filename
indexed by key
.filename[key]=sprintf("%s.%0.5d.csv",fname,++file_count)
: define the new
filename as FNAME.00XXX.csv
print header > filename[key]
: open the file and write the header to that file.{print > filename[key]}
: write the entries to the file.Upvotes: 2
Reputation: 3985
awk -F, -v limit=3 '
BEGIN{i=1}
NR==1{
header=$0 # save the header
next # go to next line
}
FNR==NR{ # process letters-lines
if(f!=$0) print header " > " "tmp/file_" $1 # print initial header
f=$0 # save line
print $0 " > " "tmp/file_" $1 # print line to file
next # go to next line
}
{ # process numbers-lines
if (x!=i) print header " > " "tmp/file_" i # print initial header
x=i # save number
print $0 " > " "tmp/file_" i # print line to file
}
FNR % limit == 0{ # check limit
i++
}
' <(head -n 1 split.csv; # getting the header
grep "^[a-Z]" <(sed '1d' split.csv)|sort # getting sorted letters-lines
) \
<(grep "^[^a-Z]" <(sed '1d' split.csv)) # getting numbers-lines
$ head tmp/*
==> tmp/file_1 <==
header
1,1,1,1
2,2,2,2
3,3,3,3
==> tmp/file_2 <==
header
4,4,4,4
5,5,5,5
==> tmp/file_x <==
header
x,x,x,x
x,x,x,x
x,x,x,x
x,x,x,x
==> tmp/file_y <==
header
y,y,y,y
y,y,y,y
y,y,y,y
Upvotes: 1