Avenger
Avenger

Reputation: 877

Script count unique records of file

I have a file script.sh:

script.sh

cd /folder
mv a.csv result.csv

The a.csv file will have lots of records(GBs) in form:

id,name
1,"platinum"
2,"joe"
1,"platinum"
...

What I want to do is create a file using a script called records.txt, which will have a total no of records and records with unique IDs.

records.txt

Total Records: 3
Unique Records: 2

Total excluding id and name.

I want to do this via script after mv. How can I do it via the script?

Upvotes: 1

Views: 811

Answers (3)

costaparas
costaparas

Reputation: 5237

Your script.sh could look like this:

cd /folder
mv a.csv result.csv

total=$((`wc -l < result.csv` - 1))
unique=$((`sort result.csv | uniq | wc -l` - 1))

cat > records.txt <<eof
Total Records: $total
Unique Records: $unique
eof

This just uses a simple pair of pipelines to count how many lines there are -- using wc.

Note, we subtract 1 in each case because of the header line.

Also, in the case of the unique count, we use sort followed by uniq to correctly remove the duplicates.

The counts are then exported to the records.txt file.

(Note I've used backticks here purely to avoid having too many parentheses, you can instead use $(...) command substitution syntax, but its not essential here since there's no nesting required).

Upvotes: 0

Raman Sailopal
Raman Sailopal

Reputation: 12867

Awk alternative:

awk -F, '{ fil[$1]++ } END { for (i in fil) { tot++;if (fil[i] == 1) { utot++ } } print "Total Records: "tot;print "Unique Records: "utot }' results.csv > records.txt

Set the field separator to , and then set up an array with the id as the index and an incremented value. At the end, loop through the array and creating a running total (tot) and a unique total (utot) where the count in the array is one. Print both values.

Upvotes: 0

0stone0
0stone0

Reputation: 43954

Use bash sort with unique (why both?) to get the unique values, use wc -l to count those:

#!/bin/bash

total=$(tail -n +2 tst.csv | wc -l)
unique=$(tail -n +2 tst.csv | sort | uniq | wc -l)

echo "Total Records: ${total}"
echo "Unique Records: ${unique}"

Total Records: 2

Unique Records: 3


NOTE; using tail -n +2 to skip the first line of the CSV, since you don't want to count those.

Upvotes: 1

Related Questions