Reputation: 625
Would like print to incrementally count and then print the counts of the unique values in column 1 & column 2 & column 3 ...Column NF and Column $0 and if the word is appeared only one time of column 1, would like to print remarks as "No" as duplicated flag and if the word is appeared more than one time of column 1, would like to print remarks as "Yes" as duplicated flag
Looking something like this
awk -F"," '{OFS=","; if (word == $1) { counter++ } else { counter = 1; word = $1 }; print $0 ",", "Yes/No", counter }'
For example, I am trying to check is there any duplicate information in the field $1 (Fruits Name) . Under Name field, "Apple" appears three times , "Orange" appears two times,"Mango" appear one time. So if any word is not repeated more than one time is consieder as "Name_Dup=No" duplicate and count of appears is "Name_Counter=1" (i.e Mango)
where "Apple" appears 3 times , so it is repeated/duplicated -remarks as "Yes" when it appears first time count is "Name_Dup=Yes" and Name_Counter=1" , when it appears second time "Name_Dup=Yes" and Name_Counter=2, when it appears 3rd time "Name_Dup=Yes" and Name_Counter=3
Then need to check each column $2, $3 .. till $NF and $0 ..
My actual input file is not sorted on any order. No of fields used to be vary like 10 fields, 12 fields and 15 fields etc
Input.csv
Name,Amount,Dept
Apple,10,eee
Orange,20,csc
Apple,30,mec
Mango,40,sss
Apple,10,eee
Orange,10,csc
Desired Output
Name,Amount,Dept,Name_Dup,Name_Counter,Amount_Dup,Amount_Counter,Dept_Dup,Dept_Counter,EntireLine_Dup,EntireLine_Counter
Apple,10,eee,Yes,1,Yes,1,Yes,1,Yes,1
Orange,20,csc,Yes,1,No,1,Yes,1,No,1
Apple,30,mec,Yes,2,No,1,No,1,No,1
Mango,40,sss,No,1,No,1,No,1,No,1
Apple,10,eee,Yes,3,Yes,2,Yes,2,Yes,2
Orange,10,csc,Yes,2,Yes,3,Yes,2,No,1
For example , Please find below steps for reference.
Step#1 - Field $1 check and Output
Name,Name_Dup,Name_Counter
Apple,Yes,1
Orange,Yes,1
Apple,Yes,2
Mango,No,1
Apple,Yes,3
Orange,Yes,2
Step#2 - Field $2 check and Output
Amount,Amount_Dup,Amount_Counter
10,Yes,1
20,No,1
30,No,1
40,No,1
10,Yes,2
10,Yes,3
Step#3 - Field $3 check and Output
Dept,Dept_Dup,Dept_Counter
eee,Yes,1
csc,Yes,1
mec,No,1
sss,No,1
eee,Yes,2
csc,Yes,2
Step#4-Field $0 check, combination of $1 & $2 & $3 and Output
"Name,Amount,Dept",EntireLine_Dup,EntireLine_Counter
"Apple,10,eee",Yes,1
"Orange,20,csc",No,1
"Apple,30,mec",No,1
"Mango,40,sss",No,1
"Apple,10,eee",Yes,2
"Orange,10,csc",No,1
Upvotes: 0
Views: 689
Reputation: 2892
awk solution:
OP asks for, as I understand it, to show per line, per column, if a column value shows up more than once and give an occurrence count of this particular column so far.
$ cat tst.awk
BEGIN{ FS=OFS="," }
NR==1{
header=$0
n=split("Dup,Counter",h)
for (i=1; i<=NF; i++)
for (j=1; j<=n; j++) header=header OFS $i"_"h[j]
printf("%s,EntireLine_Dup,EntireLine_Counter\n", header)
next
}
{
r[++lines]=$0
for (col=1; col<=NF; col++) v[col][$col]++
v[col][$0]++
}
END {
for (l=1; l<=lines; l++){
n=split(r[l], s)
res=""
for (c=1; c<=n; c++)
res=res OFS output(v,c,s[c])
res=res OFS output(v,c,r[l])
print r[l] res
}
}
function output(arr, col, val){
return sprintf("%s,%s", (arr[col][val] > 1? "Yes" : "No"), ++count[col][val])
}
with input:
$ cat input.txt
Name,Amount,Dept,Nonsense
Apple,10,eee,eee
Orange,20,csc,eee
Apple,30,mec,eee
Mango,40,sss,eee
Apple,10,eee,eee
Orange,10,csc,eee
this gives (I've deleted the header line manually, because I couldn't get it to fit in the code sample):
$ awk -f tst.awk input.txt
# deleted header line
Apple,10,eee,eee,Yes,1,Yes,1,Yes,1,Yes,1,Yes,1
Orange,20,csc,eee,Yes,1,No,1,Yes,1,Yes,2,No,1
Apple,30,mec,eee,Yes,2,No,1,No,1,Yes,3,No,1
Mango,40,sss,eee,No,1,No,1,No,1,Yes,4,No,1
Apple,10,eee,eee,Yes,3,Yes,2,Yes,2,Yes,5,Yes,2
Orange,10,csc,eee,Yes,2,Yes,3,Yes,2,Yes,6,No,1
Upvotes: 1
Reputation: 192
you are not providing what efforts you placed so far. Here is a hint where I would start. I guess since awk is the tool to use, start with shell command sort Input.csv and pipe it to awk. Populate an array when reading the input as well as an associative array with index the first field. I n the END section go over the array and see if you find the first field more than once. It takes a bit of time however that sounds like a homework. Not a production problem.
Upvotes: 0