Reputation: 31
I have a large .csv file containing inconsistent data that looks something like this:
O_T_11c,234858DEF8,OHL11_c_234858DEF8.csv,c
Ohl12und,234858DEF8,OHL12_u234858DEF8.csv,u
Ohlabx13und,234858DEF8,OHL13u_234858DEF8.csv,u
Ohl14und,234858DEF8,OHL14u_234858DEF8.csv,u
I want to remove the all letters only from only the first column and leave the rest alone. I've attempted to do this using a regex in awk, but this gives me strange output. I want my output to ultimately be this:
11,234858DEF8,OHL11_c_234858DEF8.csv,c
12,234858DEF8,OHL12_u234858DEF8.csv,u
13,234858DEF8,OHL13u_234858DEF8.csv,u
14,234858DEF8,OHL14u_234858DEF8.csv,u
I understand I can use a sed like this
sed 's/[^0-9]*//g'
To remove all non-numeric characters, but I only want to apply this to the first column.
Upvotes: 1
Views: 3596
Reputation: 6758
Now that we know the actual input is comma delimited, here's the quickest solution to your question.
sed -i 's/^[^0-9]\+//;s/[^0-9,]\+//' largefile.csv
s/^[^0-9]\+//
- will remove any non-numeric from the start of the line.
s/[^0-9,]\+//
- will remove any non-numeric trailing the numbers just before the comma.
Upvotes: 2
Reputation: 92854
Short awk approach:
awk -F, '{ gsub(/[^0-9]/,"",$1) }1' OFS=',' file
The output:
11,234858DEF8,OHL11_c_234858DEF8.csv,c
12,234858DEF8,OHL12_u234858DEF8.csv,u
13,234858DEF8,OHL13u_234858DEF8.csv,u
14,234858DEF8,OHL14u_234858DEF8.csv,u
Upvotes: 2