Nobita Nobi
Nobita Nobi

Reputation: 31

Filtering Using GREP

The Question is like Find those names who have got number greater than equal to m but less than n. A ".csv" file is given. It is preferable to solve this using grep (regex) .


I am going like this:

cat abc.csv|cut -f 3,7 -d ","|grep "4[4-9][0-9]*"|head

But it is giving me other than desired

NOTE column 3 is person's name and column 7 is the corresponding number of those people.

Any suggestion to solve this will be very helpful.


Upvotes: 2

Views: 1267

Answers (4)

Ed Morton
Ed Morton

Reputation: 203324

Some people, when confronted with a problem, think "I know,
I'll use regular expressions." Now they have two problems.

(see https://blog.codinghorror.com/regular-expressions-now-you-have-two-problems/ for reference).

This isn't a good example of how to use grep as it's well documented that using a regexp to do a numeric comparison is a far more difficult and fragile approach than just comparing numbers, e.g. with awk, and that using grep on a line when your data is in a specific field is also more difficult and fragile than using a tool that understands fields, e.g. awk again.

The right way to test for the contents of a field being in a numeric range is to do a numeric comparison on just that field:

awk -F, '(440<=$7) && ($7<500){print $3}' abc.csv

I'm guessing at the values you want the range to have based on the regexp you tried in your question, if I guessed wrong just change them.

I see from some other answers that you do not want to print lines where $7 contains a . or maybe it's that you only want lines where $7 is an integer. If so then that's a trivial and appropriate thing to use a regexp to test for:

awk -F, '($7 !~ /\./) && (440<=$7) && ($7<500){print $3}' abc.csv

or:

awk -F, '($7 ~ /^[0-9]+$/) && (440<=$7) && ($7<500){print $3}' abc.csv

Hopefully you can see how clear, simple, robust and easy to modify in future that is vs trying to do the same with a regexp across a line using grep.

Upvotes: 2

The fourth bird
The fourth bird

Reputation: 163277

Using a pattern printing the values from column 3 where column 7 is in the range of 400-499 with only awk instead of piping through multiple programs.

The pattern ^4[0-9][0-9]$ uses anchors ^ and $ to prevent partial matches and 2 ranges 0-9 to match 400 to 499.

awk -F, '
$7 ~ /^4[0-9][0-9]$/ {
  print $3
}
' abc.csv

If you can use gnu grep, you can match the value of the 3rd field, if the 7th field in in range 400-499, but it is a long pattern and I would recommend using awk.

^(?:[^,]*,){2}\K[^,\n]+(?=(?:,[^,\n]*){3},\s*4[0-9][0-9](?=\s*,|$))
  • ^ Start of string
  • (?:[^,]*,){2} Match the first 2 comma separated fields
  • \K Forget what is matched so far
  • [^,]+ Match the 3rd field
  • (?= Positive lookahead assertion
    • (?:,[^,\n]*){3},\s*4[0-9][0-9](?=\s*,|$) Match the 7th field to be a value 400-499 between followed by either a comma or the end of the string to prevent a partial match
  • ) Close lookahead

See a regex demo

For example

grep -oP "^(?:[^,]*,){2}\K[^,]+(?=(?:,[^,]*){3},\s*4[0-9][0-9](?=\s*,|$))" abc.csv

Upvotes: 1

Claudiu Cruceanu
Claudiu Cruceanu

Reputation: 337

If you need only the name then you have to add:

cut -f 1 -d ","

If you need only real numers between 400.00 and 499.99 (as I see from your result) then grep should be:

grep "4[0-9][0-9]\.[0-9][0-9]"

If you need to admit any number of decimals and also integers and take care of optional trailing spaces and end of line($) you can use:

grep -E "4[0-9][0-9](\.[0-9][0-9]*)* *$"

If you need to be sure it does not match 1400 or names that contains 400 then you should use:

grep -E " *, *4[0-9][0-9](\.[0-9][0-9]*)* *$"

We can go on, but I will stop here. My proposal is to use this:

cat Bulk.csv | cut -f 3,7 -d "," | grep -E " *, *4[0-9][0-9](\.[0-9][0-9]*)* *$" | cut -f 1 -d ","

Upvotes: 0

Pierre Fran&#231;ois
Pierre Fran&#231;ois

Reputation: 6061

Try:

cut -d, -f 3,7 Bulk.csv | grep ',4[0-9][0-9][^0-9]' | cut -d, -f 1

Explanation: cat is not necessary. The expression [^0-9] means everything except a digit; using only ,4[0-9][0-9] as regex would select also lines containing numbers with more digits before the decimal point, like 4247.14, which is not what you want.

We miss a sample of your input file Bulk.csv to reproduce your problem.

Upvotes: 1

Related Questions