lukascbossert
lukascbossert

Reputation: 411

Getting all values of various rows which have the same value in one column with awk

I have a data set (test-file.csv) with tree columns:

node,contact,mail
AAAA,Peter,[email protected]
BBBB,Hans,[email protected]
CCCC,Dieter,[email protected]
ABABA,Peter,[email protected]
CCDDA,Hans,[email protected]

I like to extend the header by the column count and rename node to nodes. Furthermore all entries should be sorted after the second column (mail). In the column count I like to get the number of occurences of the column mail, in nodes all the entries having the same value in the column mail should be printed (space separated and alphabetically sorted).

This is what I try to achieve:

contact,mail,count,nodes
Dieter,dieter@anything,com,1,CCCC
Hans,[email protected],2,BBBB CCDDA
Peter,peter@anything,com,2,AAAA ABABA

I have this awk-command:

awk -F"," '
BEGIN{
  FS=OFS=",";
printf "%s,%s,%s,%s\n", "contact","mail","count","nodes"
}
NR>1{
    counts[$3]++;     # Increment count of lines.
    contact[$2];      # contact
}
END {
    # Iterate over all third-column values.
    for (x in counts) {
    printf "%s,%s,%s,%s\n", contact[x],x,counts[x],"nodes"
    }
}
' test-file.csv | sort --field-separator="," --key=2 -n

However this is my result :-( Nothing but the amount of occurences work.

,[email protected],1,nodes
,[email protected],2,nodes
,[email protected],2,nodes
contact,mail,count,nodes

Any help appreciated!

Upvotes: 3

Views: 171

Answers (2)

RavinderSingh13
RavinderSingh13

Reputation: 133428

With your shown samples please try following. Written and tested in GNU awk.

awk '
BEGIN{ FS=OFS="," }
FNR==1{
  sub(/^[^,]*,/,"")
  $1=$1
  print $0,"count,nodes"
}
FNR>1{
  nf=$2
  mail[nf]=$NF
  NF--
  arr[nf]++
  val[nf]=(val[nf]?val[nf] " ":"")$1
}
END{
  for(i in arr){
    print i,mail[i],arr[i],val[i] | "sort -t, -k1"
  }
}
' Input_file

Explanation: Adding detailed explanation for above.

awk '                                 ##Starting awk program from here.
BEGIN{ FS=OFS="," }                   ##In BEGIN section setting FS, OFS as comma here.   
FNR==1{                               ##if this is first line then do following.
  sub(/^[^,]*,/,"")                   ##Substituting everything till 1st comma here with NULL in current line.
  $1=$1                               ##Reassigning 1st field to itself.
  print $0,"count,nodes"            ##Printing headers as per need to terminal.
}
FNR>1{                                ##If line is Greater than 1st line then do following.
  nf=$2                               ##Creating nf with 2nd field value here.
  mail[nf]=$NF                        ##Creating mail with nf as index and value is last field value.
  NF--                                ##Decreasing value of current number of fields by 1 here.
  arr[nf]++                           ##Creating arr with index of nf and keep increasing its value with 1 here.
  val[nf]=(val[nf]?val[nf] " ":"")$1  ##Creating val with index of nf and keep adding $1 value in it.
}
END{                                  ##Starting END block of this program from here.
  for(i in arr){                      ##Traversing through arr in here.
    print i,mail[i],arr[i],val[i] | "sort -t, -k1"  ##printing values to get expected output and sorting it also by pipe here as per requirement.
  }
}
' Input_file                          ##Mentioning Input_file name here.


2nd solution: In case you want to sort by 2nd and 3rd fields then try following.

awk '
BEGIN{ FS=OFS="," }
FNR==1{
  sub(/^[^,]*,/,"")
  $1=$1
    print $0,"count,nodes"
}
FNR>1{
  nf=$2 OFS $3
  NF--
  arr[nf]++
  val[nf]=(val[nf]?val[nf] " ":"")$1
}
END{
  for(i in arr){
    print i,arr[i],val[i] | "sort -t, -k1"
  }
}
'  Input_file

Upvotes: 3

anubhava
anubhava

Reputation: 784938

You may use this gnu awk:

awk '
BEGIN {
   FS = OFS = ","
   printf "%s,%s,%s,%s\n", "contact","mail","count","nodes"
}
NR > 1 {
   ++counts[$3]    # Increment count of lines.
   name[$3] = $2
   map[$3] = ($3 in map ? map[$3] " " : "") $1
}
END {
   # Iterate over all third-column values.
   PROCINFO["sorted_in"]="@ind_str_asc";
   for (k in counts)
       print name[k], k, counts[k], map[k]
}
' test-file.csv

Output:

contact,mail,count,nodes
Dieter,[email protected],1,CCCC
Hans,[email protected],2,BBBB CCDDA
Peter,[email protected],2,AAAA ABABA

Upvotes: 4

Related Questions