mikernova
mikernova

Reputation: 55

Extracting matching lines from a CSV

I have a file that looks like this:

64fe12c7-b50c-4f63-b292-99f4ed74e5aa, ip, 1.2.3.4, 
64fe12c7-b50c-4f63-b292-99f4ed74e5aa, ip, 4.5.6.7, 
bacd8a9d-807f-4ae9-95d2-f7cc17222cab, ip, 0.0.0.0/0, silly string
bacd8a9d-807f-4ae9-95d2-f7cc17222cab, ip, 0.0.0.0/0, crazy town
db86d211-0b09-4a8f-b222-a21a54ad2f9c, ip, 8.9.0.1, wild wood
db86d211-0b09-4a8f-b222-a21a54ad2f9c, ip, 0.0.0.0/0, wacky tabacky
611f8cf5-f6f2-4f3a-ad24-12245652a7bd, ip, 0.0.0.0/0, cuckoo cachoo

I would like to extract a list of just the unique GUIDs where

  1. The GUID doesn't have a 0.0.0.0/0 in column 3
  2. column 3 matches 0.0.0.0/0 and there is more than one instance of the GUID and where at least one of the matches is not 0.0.0.0/0

In this case, the desired output would be:

64fe12c7-b50c-4f63-b292-99f4ed74e5aa
db86d211-0b09-4a8f-b222-a21a54ad2f9c

Trying to think through this, I feel like I should make an array/list of the unique GUIDs, and then kinda grep the matching lines and run the process of the two conditions above, but I just don't know the best way to go about this in a short script or perhaps grep/awk/sort/cut one liner. Appreciate any help!

(the original file is a 4 column csv where the 4th column is often null)

Upvotes: 4

Views: 135

Answers (4)

melpomene
melpomene

Reputation: 85837

Sounds like it could be done with a three-step pipeline:

  1. Filter out rows where column 3 is 0.0.0.0/0: grep -v '^[^,]*,[^,]*, *0\.0\.0\.0/0,'
  2. Select column 1: cut -d, -f1
  3. Only print unique elements: sort -u (alternatively, if all duplicates are adjacent, uniq)
grep -v '^[^,]*,[^,]*, *0\.0\.0\.0/0,' | cut -d, -f1 | sort -u

Upvotes: 0

Vinicius Placco
Vinicius Placco

Reputation: 1731

Just adding another possible solution, similar (but uglier and using more than one command) than the other proposed awk solution. If I understood the question correctly, your condition #2 is already taken into account by #1. In any case, the following awk+sort worked for me:

awk -F, '$3!~/^ 0\.0\.0\.0\/0/ {print $1}' file.csv | sort -u

Using the -u (unique) flag on sort, you'll exclude duplicates. Not completely foolproof, but works in this case.

Hope it helps!

Upvotes: 0

Akshay Hegde
Akshay Hegde

Reputation: 16997

Using awk:

awk -F, '$3 !~/0\.0\.0\.0\/0/ && !seen[$1]++{print $1}' infile

Explanation:

  • $3 !~/0\.0\.0\.0\/0/ field3 doesn't match regexp and (&&)
  • !seen[$1]++ field1 not seen before ( whenever awk sees duplicate key ($1), array value will be incremented by 1, we used logical negation to print value only once )
    • ! is logical negation operator
    • seen is array
    • $1 is array key
    • ++ increment operator (current context post increment)
  • print $1 print field1

Test Results:

$ cat infile
64fe12c7-b50c-4f63-b292-99f4ed74e5aa, ip, 1.2.3.4, 
64fe12c7-b50c-4f63-b292-99f4ed74e5aa, ip, 4.5.6.7, 
bacd8a9d-807f-4ae9-95d2-f7cc17222cab, ip, 0.0.0.0/0, silly string
bacd8a9d-807f-4ae9-95d2-f7cc17222cab, ip, 0.0.0.0/0, crazy town
db86d211-0b09-4a8f-b222-a21a54ad2f9c, ip, 8.9.0.1, wild wood
db86d211-0b09-4a8f-b222-a21a54ad2f9c, ip, 0.0.0.0/0, wacky tabacky
611f8cf5-f6f2-4f3a-ad24-12245652a7bd, ip, 0.0.0.0/0, cuckoo cachoo

$ awk -F, '$3 !~/0\.0\.0\.0\/0/ && !seen[$1]++{print $1}' infile
64fe12c7-b50c-4f63-b292-99f4ed74e5aa
db86d211-0b09-4a8f-b222-a21a54ad2f9c

Upvotes: 2

RomanPerekhrest
RomanPerekhrest

Reputation: 92884

Awk solution:

awk -F',[[:space:]]*' '$3 !~ /^(0\.){3}0\/0/{ guids[$1] }
                       END{ for(k in guids) print k }' testfile.txt

The output:

db86d211-0b09-4a8f-b222-a21a54ad2f9c
64fe12c7-b50c-4f63-b292-99f4ed74e5aa

Upvotes: 1

Related Questions