Mark McWhirter
Mark McWhirter

Reputation: 1196

Count of unique lines based on first field in file

I am trying to get a count of unique lines output to a file based on the first field, where the input lines look like:

Forms.js     /forms/Forms.js     http://www.gumby.com/test.htm   404
Forms.js     /forms/Forms1.js    http://www.gumby.com/test.htm   404
Forms.js     /forms/Forms2.js    http://www.gumby.com/test.htm   404
Interpret.js     /forms/Interpret1.js    http://www.gumby.com/test.htm   404    
Interpret.js     /forms/Interpret2.js    http://www.gumby.com/test.htm   404
Interpret.js     /forms/Interpret3.js    http://www.gumby.com/test.htm   404

To something like this:

3    Forms.js    /forms/Forms.js     http://www.gumby.com.mx/test.htm 404
3    Interpret.js    /forms/Interpret.js    http://www.gumby.com.mx/test.htm  404

I have been trying various combinations of sort and uniq, but haven't hit on it yet. I can get distinct lines using the whole line, but I just want the first field. I am currently using cygwin. I am not awk literate, but I suspect that is the route to go. Anyone have a handy solution?

Upvotes: 2

Views: 1283

Answers (6)

Ed Morton
Ed Morton

Reputation: 203899

$ awk '!c[$1]++{v[$1]=$0} END{for (i in c) print c[i],v[i]}' file
3 Forms.js     /forms/Forms.js     http://www.gumby.com/test.htm   404
3 Interpret.js     /forms/Interpret1.js    http://www.gumby.com/test.htm   404

The above uses the common awk idiom of '!array[$n]++' to tell if a key value ($n where n is $0 or $1 or $4,$5 or ...) has been seen before.

Upvotes: 2

Thor
Thor

Reputation: 47149

This:

<infile awk '{ h[$1]++ } END { for(k in h) print h[k], k }'

Will get you:

3 Forms.js
3 Interpret.js

If you also want to keep the first hit use:

awk '!h[$1] { g[$1]=$0 } { h[$1]++ } END { for(k in g) print h[k], g[k] }'

Output:

3 Forms.js /forms/Forms.js http://www.gumby.com/test.htm 404
3 Interpret.js /forms/Interpret1.js http://www.gumby.com/test.htm 404

Tested with GNU awk.

Note that this does not require input to be sorted. Also note that the results are unordered.

Upvotes: 4

nullrevolution
nullrevolution

Reputation: 4137

assuming file.txt contains your sample input:

sort file.txt | awk -f counts.awk file

returns:

3:Forms.js     /forms/Forms.js     http://www.gumby.com/test.htm   404
3:Interpret.js     /forms/Interpret1.js    http://www.gumby.com/test.htm   404

awk script file:

cat counts.awk

#  output format is:
#+ TimesFirstFieldIsRepeated:FirstMatchingLineContents

BEGIN {

  plmatch="";
  pline="";
  outline="";
  n=1;

 }

{

 if($1 != plmatch && NR != 1)
  {
   print n ":" outline;
   n=1;
   outline="";
  }

 if($1 == plmatch)
  {
   n+=1;
   if(outline == ""){
     outline=pline;
    }
  }

 plmatch=$1;
 pline=$0;

}

END {
  print n ":" outline;
 }

Upvotes: 1

Chris Seymour
Chris Seymour

Reputation: 85835

Awk is the tool for this but if you want to be clever with uniq:

$ column -t file | uniq -w12 -c
      3 Forms.js      /forms/Forms.js       http://www.gumby.com/test.htm  404
      3 Interpret.js  /forms/Interpret1.js  http://www.gumby.com/test.htm  404

column -t aligns all the columns so we get a fixed width for column one.


Or a hack if column isn't available is to append the first column to end of the line with awk and then use uniq -c -f4 to count unique on the last column and use awk again to print the n-1 fields.

$ awk '{print $0, $1}' file | uniq -c -f4 | awk '{$NF=""; NF--; print}'
3 Forms.js /forms/Forms.js http://www.gumby.com/test.htm 404
3 Interpret.js /forms/Interpret1.js http://www.gumby.com/test.htm 404

It would be nice if uniq -f worked like -f4,4 or f1,1.


Or you could use rev to reverse the file so uniq -c -f3 can be done and then rev back (you get the count at the end however and if you don't have column you probably don't have rev)

$ rev file | uniq -c -f3 | rev
Forms.js /forms/Forms.js http://www.gumby.com/test.htm 404 3      
Interpret.js /forms/Interpret1.js http://www.gumby.com/test.htm 404 3

Upvotes: 2

Pilou
Pilou

Reputation: 1478

You can count the amount of the first field with cut but what you want to print after this field ?

cat file | cut -d " " -f 1 | uniq -c

Upvotes: 0

exic
exic

Reputation: 2678

I'd just cut -f 1 | uniq -c. That won't give you the whole line, but if the lines are differing, printing any line won't make too much sense anyway. Depends on what you want to achieve.

Upvotes: 0

Related Questions