Tastybrownies
Tastybrownies

Reputation: 937

Removing a line where the first column is not numeric

I have a large dataset that is a CSV file and I've already cleaned it up quite a bit. However there are instances where I have a column which is not numeric and I want to remove those rows.

So far I thought something like this would work. I am thinking the solution is probably simple. I am not sure whether I can even do something like this. It is a CSV file so I tried to tell awk that its delimited by commas. Also an example of a good value in column one is this: 323870133825187840

awk '/,/$1 != numeric'

Are there any suggestions on this one. I initially thought about doing some kind of a one liner such as, [0-9], etc. I would greatly appreciate it if anyone could help me out with this. For someone out there this is probably like child's play :)

Upvotes: 3

Views: 3450

Answers (3)

Hermann
Hermann

Reputation: 166

Try this:

awk -F, '$1+0 != $1{next}1' csvFile

Upvotes: 1

Jonathan Leffler
Jonathan Leffler

Reputation: 753525

Assuming that you're only dealing with unsigned integer values, you can use either:

awk -F, '$1 ~ /^[[:digit:]]+$/'

or

awk -F, '$1 !~ /[^[:digit:]]/'

Both use the 'implicit print' action. The first checks that $1 only contains digits; the second checks that $1 does not contain a non-digit.

If your numbers are more general (signed, possibly with decimal point, possibly with exponential notation — 6.0221413e+23 (Avogadro's number), for example), then you need a more complex regular expression and will use positive matching, only selecting rows where field 1 matches your regex for a valid number.

Upvotes: 5

anubhava
anubhava

Reputation: 784958

You can simply grep it:

grep -P '^[+-]?\d*(\.\d+)?(?<=.),' file

This will return lines if it finds decimal/integer number in first column with optional =/- sign.

For ex following numbers will be matched in column 1:

6.72
1235.3
72
.66
2.8
+3.5
-5.9

PS: Empty column value in 1st column will not be matched.

Upvotes: 1

Related Questions