bmelo
bmelo

Reputation: 139

How can I split files by grouping the same lines?

How can I split files by grouping the same lines using shell script or awk?

For example, I have 1 file with the content as follow:

1,1,1,1
2,2,2,2
3,3,3,3
x,x,x,x
x,x,x,x
x,x,x,x
x,x,x,x
y,y,y,y
y,y,y,y
y,y,y,y
4,4,4,4
5,5,5,5

What I want is: all the equal lines are a group and must to be in a separated file, the other different lines needs to be in a splited file until specific limit. For example, if I have specific limit as 10, then the original file must to be splited for all lines containing numbers until the limit of 10 (<= 10), if there are more different lines than the limit, create another splited file and so on.

For the equal lines containing letters I need them to have their own separate file. So one file only for x,x,x,x lines, other for y,y,y,y lines and so on(basically to get file's contents based on a field, lets say 3rd field for example).

The content of lines is just example, the real case is a CSV containing different values for all columns where I need to group by specific column value (I'm using sort and uniq for this), but anyway I need to split this csv by equal lines group and by different lines <= limit using shell script or awk (I see awk provides better performance). I also need header(very first line) in each output file(with no duplicate of that header content in output file).

Do you have any idea?

My current code is (it keeps the first line because I'm considering the csv has a header):

#!/bin/bash
COLUMN=$1
FILE=$2
LIMIT=$3
FILELENGTH=`wc -l < $FILE`
COUNTER=$LIMIT
NUMS=""
SORTED="sorted_"`basename $FILE`

sort -t, -k $COLUMN -n $FILE > $SORTED
while [ $COUNTER -le $FILELENGTH ]; do
        NUMS+=`uniq -c $SORTED | awk -v val=$COUNTER '($1+prev)<=val {prev+=$1} END{print prev}'`
        NUMS+=" "
        ((COUNTER+=LIMIT))
        echo $NUMS "|" $COUNTER "|" $FILELENGTH "|" $SORTED
done

awk -v nums="$NUMS" -v fname=`basename $2` -v dname=`dirname $2` '
   NR==1 { header=$0; next}
   (NR-1)==1 {
        c=split(nums,b)
        for(i=1; i<=c; i++) a[b[i]]
        j=1; out = dname"/" "splited" j "_"fname
        print header > out
        system("touch "out".fin")
    }
    { print > out }
    NR in a {
        close(out)
        out = dname "/" "splited" ++j "_"fname
        print header > out
        system("touch "out".fin")
    }' $SORTED

Upvotes: 1

Views: 385

Answers (3)

RavinderSingh13
RavinderSingh13

Reputation: 133538

With GNU awk you could try following code, written as per your shown samples. With a 2 pass of Input_file here. For lines which are occurring more than once in Input_file their output file will be created with name eg: firstfieldValue.outFile and files which are unique(having only 1 occurrence in your Input_file) will be created with name like: 1.singleOccurrence.outFile, 2.singleOccurrence.outFile and so on.

To keep headers(very first line of your Input_file) into each output file, please try following awk code, little tweak in above code:

awk '
BEGIN{
  count1="1"
  FS=OFS=","
}
NR==1{ headers = $0; next }
FNR==NR && FNR>1{
  arr[$0]++
  next
}
$0 == headers{ next }
arr[$0]>1{
  if(!arr1[$1".outFile"]++){ print headers > ($1".outFile") }
  print > ($1".outFile")
  next
}
{
  if(++count2%10==0){ close(count1".singleOccurrence.outFile") }
  count1+=(count2%10==0?1:0)
  if(prev!=count1){print headers > count1".singleOccurrence.outFile"}
  print > (count1".singleOccurrence.outFile")
  prev=count1
}
'  Input_file  Input_file

Upvotes: 4

kvantour
kvantour

Reputation: 26481

Here is my take on the problem, without any sorting needed. On top of that, we preserve the header and assume, in this example to have maximum 42 lines per file. At the end, the files will have the following naming convention "filename.ddddd.csv" and will contain the original header:

awk -v l=42 -v fname="/path/to/output/filename"                \
    '(NR==1){header=$0;next}
     { key = ($0 ~ /[^0-9,]/ ? $0 : "num") }
     {c=a[key]++}
     (c%l==0) {
        close(filename[key]); 
        filename[key]=sprintf("%s.%0.5d.csv",fname,++file_count)
        print header > filename[key]
     }
     {print > filename[key]}' file.csv

This works in the following way:

  • (NR==1){header=$0;next}: If the record/line is the first line, save that line as the header.
  • { key = ($0 ~ /[^0-9,]/ ? $0 : "num") }: define a key. If the current line contains only numbers and commas, define the key to be num, otherwise it is the line itself.
  • {c=a[key]++}: keep track of how many times we encountered the current line. We do this by storing the count in the associative array a indexed by key. The value c always returns the currently processed count -1.
  • (c%l==0){...}: Every time we wrote l=42 records/lines, we need to start writing to a new file. This happens every time the modulo of the record/line number hits 0. When such a line is found we do:
    • close(filename[key]): close the file you write the current line to. This filename is tracked in the associative array filename indexed by key.
    • filename[key]=sprintf("%s.%0.5d.csv",fname,++file_count): define the new filename as FNAME.00XXX.csv
    • print header > filename[key]: open the file and write the header to that file.
  • {print > filename[key]}: write the entries to the file.

Upvotes: 2

ufopilot
ufopilot

Reputation: 3985

 awk -F, -v limit=3 '
    BEGIN{i=1}
    NR==1{
        header=$0                                       # save the header
        next                                            # go to next line
    }
    FNR==NR{                                            # process letters-lines
        if(f!=$0) print header " > " "tmp/file_" $1     # print initial header      
        f=$0                                            # save line
        print $0 " > " "tmp/file_" $1                   # print line to file
        next                                            # go to next line
    }
    {                                                   # process numbers-lines    
        if (x!=i) print header " > " "tmp/file_" i      # print initial header
        x=i                                             # save number    
        print $0 " > " "tmp/file_" i                    # print line to file    
    }
    FNR % limit == 0{                                   # check limit 
        i++
    }
' <(head -n 1 split.csv;                                # getting the header
    grep "^[a-Z]" <(sed '1d' split.csv)|sort            # getting sorted letters-lines
   ) \
  <(grep "^[^a-Z]" <(sed '1d' split.csv))               # getting numbers-lines


$ head tmp/*
==> tmp/file_1 <==
header
1,1,1,1
2,2,2,2
3,3,3,3

==> tmp/file_2 <==
header
4,4,4,4
5,5,5,5

==> tmp/file_x <==
header
x,x,x,x
x,x,x,x
x,x,x,x
x,x,x,x

==> tmp/file_y <==
header
y,y,y,y
y,y,y,y
y,y,y,y

Upvotes: 1

Related Questions