awk to print incremental count of occurrences of unique values in each column

Question

Would like print to incrementally count and then print the counts of the unique values in column 1 & column 2 & column 3 ...Column NF and Column $0 and if the word is appeared only one time of column 1, would like to print remarks as "No" as duplicated flag and if the word is appeared more than one time of column 1, would like to print remarks as "Yes" as duplicated flag

Looking something like this

awk -F"," '{OFS=","; if (word == $1) { counter++ } else { counter = 1; word = $1 }; print $0 ",", "Yes/No", counter }'

For example, I am trying to check is there any duplicate information in the field $1 (Fruits Name) . Under Name field, "Apple" appears three times , "Orange" appears two times,"Mango" appear one time. So if any word is not repeated more than one time is consieder as "Name_Dup=No" duplicate and count of appears is "Name_Counter=1" (i.e Mango)

where "Apple" appears 3 times , so it is repeated/duplicated -remarks as "Yes" when it appears first time count is "Name_Dup=Yes" and Name_Counter=1" , when it appears second time "Name_Dup=Yes" and Name_Counter=2, when it appears 3rd time "Name_Dup=Yes" and Name_Counter=3

Then need to check each column $2, $3 .. till $NF and $0 ..

My actual input file is not sorted on any order. No of fields used to be vary like 10 fields, 12 fields and 15 fields etc

Input.csv

Name,Amount,Dept
Apple,10,eee
Orange,20,csc
Apple,30,mec
Mango,40,sss
Apple,10,eee
Orange,10,csc

Desired Output

Name,Amount,Dept,Name_Dup,Name_Counter,Amount_Dup,Amount_Counter,Dept_Dup,Dept_Counter,EntireLine_Dup,EntireLine_Counter
Apple,10,eee,Yes,1,Yes,1,Yes,1,Yes,1
Orange,20,csc,Yes,1,No,1,Yes,1,No,1
Apple,30,mec,Yes,2,No,1,No,1,No,1
Mango,40,sss,No,1,No,1,No,1,No,1
Apple,10,eee,Yes,3,Yes,2,Yes,2,Yes,2
Orange,10,csc,Yes,2,Yes,3,Yes,2,No,1

For example , Please find below steps for reference.

Step#1 - Field $1 check and Output

Name,Name_Dup,Name_Counter
Apple,Yes,1
Orange,Yes,1
Apple,Yes,2
Mango,No,1
Apple,Yes,3
Orange,Yes,2

Step#2 - Field $2 check and Output

Amount,Amount_Dup,Amount_Counter
10,Yes,1
20,No,1
30,No,1
40,No,1
10,Yes,2
10,Yes,3

Step#3 - Field $3 check and Output

Dept,Dept_Dup,Dept_Counter
eee,Yes,1
csc,Yes,1
mec,No,1
sss,No,1
eee,Yes,2
csc,Yes,2

Step#4-Field $0 check, combination of $1 & $2 & $3 and Output

"Name,Amount,Dept",EntireLine_Dup,EntireLine_Counter
"Apple,10,eee",Yes,1
"Orange,20,csc",No,1
"Apple,30,mec",No,1
"Mango,40,sss",No,1
"Apple,10,eee",Yes,2
"Orange,10,csc",No,1

Marc Lambrichs · Accepted Answer

awk solution:

OP asks for, as I understand it, to show per line, per column, if a column value shows up more than once and give an occurrence count of this particular column so far.

$ cat tst.awk
BEGIN{ FS=OFS="," }
NR==1{
   header=$0
   n=split("Dup,Counter",h)
   for (i=1; i<=NF; i++)
      for (j=1; j<=n; j++) header=header OFS $i"_"h[j]
   printf("%s,EntireLine_Dup,EntireLine_Counter
", header)
   next
}
{
   r[++lines]=$0
   for (col=1; col<=NF; col++) v[col][$col]++
   v[col][$0]++
}
END {
   for (l=1; l<=lines; l++){
      n=split(r[l], s)
      res=""
      for (c=1; c<=n; c++)
         res=res OFS output(v,c,s[c])
      res=res OFS output(v,c,r[l])
      print r[l] res
   }
}
function output(arr, col, val){
    return sprintf("%s,%s", (arr[col][val] > 1? "Yes" : "No"), ++count[col][val])
}

with input:

$ cat input.txt
Name,Amount,Dept,Nonsense
Apple,10,eee,eee
Orange,20,csc,eee
Apple,30,mec,eee
Mango,40,sss,eee
Apple,10,eee,eee
Orange,10,csc,eee

this gives (I've deleted the header line manually, because I couldn't get it to fit in the code sample):

$ awk -f tst.awk input.txt
# deleted header line
Apple,10,eee,eee,Yes,1,Yes,1,Yes,1,Yes,1,Yes,1
Orange,20,csc,eee,Yes,1,No,1,Yes,1,Yes,2,No,1
Apple,30,mec,eee,Yes,2,No,1,No,1,Yes,3,No,1
Mango,40,sss,eee,No,1,No,1,No,1,Yes,4,No,1
Apple,10,eee,eee,Yes,3,Yes,2,Yes,2,Yes,5,Yes,2
Orange,10,csc,eee,Yes,2,Yes,3,Yes,2,Yes,6,No,1

awk to print incremental count of occurrences of unique values in each column

Answers (2)

Related Questions