Karthi
Karthi

Reputation: 708

Matching a column against multiple possible values

I am trying to use awk to filter out data from a file. As of now, for a single value I am using the command like the below

hadoop fs -text file:///a/b/filename.snappy  awk -F'|'  '$11 == 655' > filter_20180705.txt

I want to pass a list of values in comparison section, instead of passing $11 == 655, I would like to pass a list like $11 IN (list). Any thoughts on this will be helpful

Sample-data:

karthick,bangalore,software,it,bfsi,spark,hadoop,bigdata,etl,pentaho,655,dev
kumar,bangalore,software,it,bfsi,spark,hadoop,bigdata,etl,pentaho,611,dev
Raj,bangalore,software,it,bfsi,spark,hadoop,bigdata,etl,pentaho,800,dev
John,bangalore,software,it,bfsi,spark,hadoop,bigdata,etl,pentaho,823,dev

As of now I am getting the result as

karthick,bangalore,software,it,bfsi,spark,hadoop,bigdata,etl,pentaho,655,dev

I will need pass a list in the comparison filter. For example, I want to add 611 and 823. So the expected result would be

karthick,bangalore,software,it,bfsi,spark,hadoop,bigdata,etl,pentaho,655,dev
kumar,bangalore,software,it,bfsi,spark,hadoop,bigdata,etl,pentaho,611,dev
John,bangalore,software,it,bfsi,spark,hadoop,bigdata,etl,pentaho,823,dev

Upvotes: 1

Views: 56

Answers (2)

Inian
Inian

Reputation: 85560

Define a variable in the context of Awk (using -v) to include alternates for your regex match (separated by |) and do your match as below. The ~ is your regex match operator in Awk that allows you match $11 to the string defined (see How to Use Regular Expressions).

awk -F, -v list="655|611|823" '$11 ~ list' file

With the above approach your command translates to match $11 to any of the entries defined in the variable list. To negate your regex match (i.e. opposite of what I want to match), use the negation ! operator as

awk -F, -v list="655|611|823" '$11 !~ list' file

To avoid false matches like 1182345 to be part of the match, make the regex more strict by including the start and end patterns

awk -F, -v list="655|611|823" 'BEGIN{list="^(" list ")$"} $11 ~ list' file

Upvotes: 2

Ed Morton
Ed Morton

Reputation: 203324

I'd do it as the following for efficiency:

awk -F, -v list='655,611,823' '
    BEGIN {
        split(list,tmp)
        for (i in tmp) {
            expected[tmp[i]]
        }
    }
    $11 in expected
' file

That way you're just doing a hash lookup for each input line rather than a regexp comparison.

Upvotes: 0

Related Questions