Hell
Hell

Reputation: 35

How to get the unique count of strings

These are some of the lines I have in a file. The idea is to get the unique count of blades in the file.

sS'/on/demandware.servlet/webdav/Sites/Logs/service-ACI_Preauth_Card-blade2-3.mon.demandware.net-0-appserver-20201105.log'
sS'/on/demandware.servlet/webdav/Sites/Logs/service-ACI_Preauth_Card-blade5-0.mon.demandware.net-0-appserver-20201105.log'
sS'/on/demandware.servlet/webdav/Sites/Logs/service-ACI_Preauth_Card-blade3-9.mon.demandware.net-0-appserver-20201105.log'
sS'/on/demandware.servlet/webdav/Sites/Logs/service-ACI_Preauth_Card-blade4-5.mon.demandware.net-0-appserver-20201104.log'
sS'/on/demandware.servlet/webdav/Sites/Logs/service-ACI_Preauth_Card-blade4-6.mon.demandware.net-0-appserver-20201104.log'
sS'/on/demandware.servlet/webdav/Sites/Logs/service-ACI_Preauth_Card-blade4-5.mon.demandware.net-0-appserver-20201103.log'
sS'/on/demandware.servlet/webdav/Sites/Logs/service-ACI_Preauth_Card-blade4-2.mon.demandware.net-0-appserver-20201104.log'
sS'/on/demandware.servlet/webdav/Sites/Logs/service-ACI_Preauth_Card-blade3-9.mon.demandware.net-0-appserver-20201104.log'

This is the script I've tried.

#!/bin/bash +x
pwd
cat *.p >> test.txt
awk '{ match($0,/[0-9]{8}/);arr[substr($0,RSTART,RLENGTH)]+=1;match($0,/blade/);spoint=RSTART+RLENGTH;match($0,/\.demandware/) } END { for (i in arr) { print i" - "arr[i]} } ' test.txt >> gen_output.txt
grep  "2020" gen_output.txt

all I get the output as

20201105 - 3
20201104 - 4
20201103 - 1

All the blades count on a single day is considered.

The desired output should be like

20201105 - 3
20201104 - 2 
20201103 - 1

(blade4 & blade3) on 20201104, blade4 is repeated thrice, so that should be considered as one. Please suggest some ideas here.

Upvotes: 1

Views: 88

Answers (3)

glenn jackman
glenn jackman

Reputation: 246764

grep -Eo 'blade[0-9]+|[0-9]{8}' file | paste - - | sort -u | cut -f2 | sort | uniq -c

outputs

      1 20201103
      2 20201104
      3 20201105

Upvotes: 2

anubhava
anubhava

Reputation: 785008

You can get this done in a single awk:

awk 'match($0, /-blade[0-9]+/) {
   b = substr($0, RSTART, RLENGTH)
}
match($0, /[0-9]{8}/) {
   d = substr($0, RSTART, RLENGTH)
   if (!seen[d,b]++)
      freq[d]++
}
END {
   for (i in freq)
      print i, freq[i]
}' file
20201103 1
20201104 2
20201105 3

Upvotes: 2

RavinderSingh13
RavinderSingh13

Reputation: 133458

1st solution(without sorting): Could you please try following, written and tested with shown samples only in GNU awk.

awk -F"[-.]" '
match($0,/ACI_Preauth_Card-blade[0-9]+/){
  val=substr($0,RSTART,RLENGTH)
  if(!arr[val,$(NF-1)]++){
     arr1[$(NF-1)]++
  }
  val=""
}
END{
  for(key in arr1){
    print key" - "arr1[key]
  }
}' Input_file

Output will be as follows.

20201103 - 1
20201104 - 2
20201105 - 3


2nd solution(with sorting option of gawk): OR in case you have GNU awk and needed output in YYMMDD descending form then try following.

awk -F"[-.]" '
match($0,/ACI_Preauth_Card-blade[0-9]+/){
  val=substr($0,RSTART,RLENGTH)
  if(!arr[val,$(NF-1)]++){
    arr1[$(NF-1)]++
  }
  val=""
}
END{
  PROCINFO["sorted_in"] = "@ind_num_desc"
  for(key in arr1){
     print key" - "arr1[key]
  }
}' Input_file

Output will be as follows.

20201105 - 3
20201104 - 2
20201103 - 1

Upvotes: 2

Related Questions