Reputation: 135
Quite simply, I have a csv file, with one column that should only contain integers. However, not all of them are integers, and I want to check this file (over 5 gigabytes large) and to capture the line numbers and (preferably) the values that are not integers. I've tried a number of things, such as using masks, but to no avail.
For example, we have the following csv table:
ID
5342
76375
sdfg23
2342lslf
jfijfojwo
395-34425
abc-24523
afhfhue3224
I would want to know that lines 3, 4, 5, 6, 7, and 8 are not integers. Output would look like (as a dataframe/table equivalent):
+-------------+------+
| ID | Row |
+-------------+------+
| sdfg23 | 3 |
| 2342lslf | 4 |
| jfijfojwo | 5 |
| 395-34425 | 6 |
| abc-24523 | 7 |
| afhfhue3224 | 8 |
+-------------+------+
Or even just spilling the line numbers to standard out would be really helpful.
I've tried things like using sed
for example: sed -n '/?![[:digit:]]=' csvfile.csv
Upvotes: 0
Views: 351
Reputation: 626699
You may use grep
to find all lines that are numeric and invert the result:
grep -vE '^[0-9]+(\.[0-9]+)?$' file
The ^[0-9]+(\.[0-9]+)?$
pattern (POSIX ERE syntax enabled with -E
) matches lines that fully match 111
or 111.111111
like numbers and -v
will invert the result.
See the online grep
demo:
s="11.1111
5342
76375
sdfg23
2342lslf
jfijfojwo
395-34425
abc-24523
afhfhue3224"
grep -vE '^[0-9]+(\.[0-9]+)?$' <<< "$s"
Output:
sdfg23
2342lslf
jfijfojwo
395-34425
abc-24523
afhfhue3224
Upvotes: 1
Reputation: 23667
You can check if any line contains any non-digit character.
$ # -n option enables line number in output
$ grep -n '[^0-9]' ip.txt
1:ID
4:sdfg23
5:2342lslf
6:jfijfojwo
7:395-34425
8:abc-24523
9:afhfhue3224
If you need further processing, awk
would suit. Below is just an example, you can modify as per your needs.
$ awk 'NR==1{print "ID Row"; next} /[^0-9]/{print $0, NR-1}' ip.txt
ID Row
sdfg23 3
2342lslf 4
jfijfojwo 5
395-34425 6
abc-24523 7
afhfhue3224 8
Upvotes: 3