Split a field and then remove duplicates

Question

Sample file:

# cat test1 
-rw-r--r-- 1 root root   19460 Feb 10 03:56 catalina.2015-02-10.log
-rw-r--r-- 1 root root  206868 May  4 15:05 catalina.2015-05-04.log
-rw-r--r-- 1 root root  922121 Jun 24 09:26 catalina.out
-rw-r--r-- 1 root root       0 Feb 10 02:27 host-manager.2015-02-10.log
-rw-r--r-- 1 root root       0 May  4 04:17 host-manager.2015-05-04.log
-rw-r--r-- 1 root root    2025 Feb 10 03:56 localhost.2015-02-10.log
-rw-r--r-- 1 root root    8323 May  4 15:05 localhost.2015-05-04.log
-rw-r--r-- 1 root root     873 Feb 10 03:56 localhost_access_log.2015-02-10.txt
-rw-r--r-- 1 root root  458600 May  4 23:59 localhost_access_log.2015-05-04.txt
-rw-r--r-- 1 root root       0 Feb 10 02:27 manager.2015-02-10.log
-rw-r--r-- 1 root root       0 May  4 04:17 manager.2015-05-04.log

Expected Output:

catalina
host-manager
localhost
localhost_access_log
manager

Attempt 1 (works):

# awk '{split($9,a,"."); print a[1]}' test1 | awk '!z[$i]++'
catalina
host-manager
localhost
localhost_access_log
manager

Attempt 2 (works):

# awk '{split($9,a,"."); print a[1]}' test1 | uniq
catalina
host-manager
localhost
localhost_access_log
manager

Attempt 3 (Fails):

# awk '{split($9,a,"."); a[1]++} {for (i in a){print a[i]}}' test1
1
2015-02-10
log
1
2015-05-04
log
1
out
.
.
.

Question:

I wanted to split the 9th field and then display only the uniq entries. However, I wanted to do this in a single awk one-liner. Seeking help on my 3rd attempt.

fedorqui · Accepted Answer

You have to use the END block to print the results:

awk '{split($NF,a,"."); b[a[1]]} END{for (i in b){print i}}' file

Notes:

I am using $NF to catch the last field. This way, if you happen to have more or less fields than 9, it will also work (as long as there are no filenames with spaces, because parsing ls is evil).
We cannot loop directly through the a[] array, because it is the one containing the splitted data. For this we need to create another array, for example b[]. That's why we say b[a[1]]. Alone, there is no need to b[a[1]]++ unless you want to keep track of how many times any item appears.
END block is executed after processing the whole file. Otherwise you were going through the results once per record (that is, once per line) and subsequently duplicates were appearing.

Split a field and then remove duplicates

Answers (2)

Related Questions