oneindelijk
oneindelijk

Reputation: 636

awk 'uniq' on a range of columns

I'm trying to filter out all duplicates of a list, ignoring the first n columns, preferable using awk (but open for other implementations)

I've found a solution for a fixed number of columns, but as I don't know how many columns there will be, I need a range. That solution I've found here

For clarity: What I'm trying to achieve is an alias for history which will filter out duplicates, but leaves the history_id intact, preferably without messing with the order. The history is in this form

ID    DATE       HOUR     command
 5612  2019-07-25 11:58:30 ls /var/log/schaubroeck/audit/2019/May/
 5613  2019-07-25 12:00:22 ls /var/log/schaubroeck/         
 5614  2019-07-25 12:11:30 ls /etc/logrotate.d/                       
 5615  2019-07-25 12:11:35 cat /etc/logrotate.d/samba     
 5616  2019-07-25 12:11:49 cat /etc/logrotate.d/named 

So this command works for commands up to four arguments long, but I need to replace the fixed columns by a range to account for all cases:

history | awk -F "[ ]" '!keep[$4 $5 $6 $7]++'

I feel @kvantour is getting me on the right path, so I tried:

history | awk '{t=$0;$1=$2=$3=$4="";k=$0;$0=t}_[k]++' | grep cd

But this still yields duplicate lines

 1102  2017-10-27 09:05:07 cd /tmp/
 1109  2017-10-27 09:07:03 cd /tmp/
 1112  2017-10-27 09:07:15 cd nagent-rhel_64/
 1124  2017-11-07 16:38:50 cd /etc/init.d/
 1127  2017-12-29 11:13:26 cd /tmp/
 1144  2018-06-21 13:04:26 cd /etc/init.d/
 1161  2018-06-28 09:53:21 cd /etc/init.d/
 1169  2018-07-09 16:33:52 cd /var/log/
 1179  2018-07-10 15:54:32 cd /etc/init.d/

Upvotes: 0

Views: 316

Answers (2)

Chris Maes
Chris Maes

Reputation: 37842

you can use sort:

history | sort -u -k4
  • -u for unique
  • -k4 to sort only on all columns starting the fourth.

Running this on

 1102  2017-10-27 09:05:07 cd /tmp/
 1109  2017-10-27 09:07:03 cd /tmp/
 1112  2017-10-27 09:07:15 cd nagent-rhel_64/
 1124  2017-11-07 16:38:50 cd /etc/init.d/
 1127  2017-12-29 11:13:26 cd /tmp/
 1144  2018-06-21 13:04:26 cd /etc/init.d/
 1161  2018-06-28 09:53:21 cd /etc/init.d/
 1169  2018-07-09 16:33:52 cd /var/log/
 1179  2018-07-10 15:54:32 cd /etc/init.d/

yields:

 1124  2017-11-07 16:38:50 cd /etc/init.d/                                                                                                                                                                         
 1112  2017-10-27 09:07:15 cd nagent-rhel_64/                                                                                                                                                                      
 1102  2017-10-27 09:05:07 cd /tmp/                                                                                                                                                                                
 1169  2018-07-09 16:33:52 cd /var/log/

EDIT if you want to keep the order you might apply a second sort:

history | sort -u -k4 | sort -n

Upvotes: 2

kvantour
kvantour

Reputation: 26581

The command you propose will not work as you expect. Imagine you have two lines like:

a b c d 12 13 1
x y z d 1 21 31

Both lines will be considered duplicates as the key, used in the array _ is for both d12131.

This is probably what you are interested in:

$ history | awk '{t=$0;$1=$2=$3="";k=$0;$0=t}!_[k]++'

Here we store the original record in the variable t. Remove the first three fields of the record by assigning empty values to it. This will redefine the record $0 and store it in the key k. Then we reset the record to the value of t. We do the check with the key k which now holds all fields except the first 3.

note: setting the field separtor as -F" " will not set it to a single space, but to any seqence of blanks (spaces and tabs). This is also the default behaviour. If you want a single space, add -F"[ ]"

Upvotes: 2

Related Questions