Reputation: 198
I have a file s.csv
a,b+ -.,c
aa,bb ().,c._c
I want to remove all special characters from 2nd column (file separated by comma)
cat s.csv | tr -dc '[:alnum:]\n\r' | tr '[:upper:]' '[:lower:]'
The above code also removes special characters from 3rd column as well.
awk -F, '{print $2}' s.csv | tr -dc '[:alnum:]\n\r' | tr '[:upper:]' '[:lower:]'
This code only print 2nd column.
Any idea how can I remove special char from 2nd column and price all
Required output should be
a,b,c
aa,bb,c._c
Upvotes: 2
Views: 609
Reputation: 2875
You don't have to alter locale
just to do it - by using octals
instead of letters, the regex engine respects them as ASCII
instead of being overly clever - i even intentionally set it to Belgian French to illustrate :
CODE
echo 'a,b+ -.,c
aa,bb ().,c._c' | {m,g}awk '
gsub("[^\\060-\\071\\101-\\132\\141-\\172]+","",$(!_+!_))^_' \
OFS=',' FS=','
OUTPUT
a,b,c
aa,bb,c._c
SHOWCASE LOCALE=C
isn't needed
LANG="fr_BE.UTF8" gawk -e '
BEGIN { for(_=8*4;_<8^4;_++) { printf("%c",_) } } ' |
LANG="fr_BE.UTF8" gawk -p- -e '
gsub("[^\\060-\\071\\101-\\132\\141-\\172]+","",$-_)^_' OFS=',' FS=','
——————————
0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
# profile gawk, cr'e'e Sun May 29 05:58:26 2022
# R`egle(s)
1 (gsub("[^\\060-\\071\\101-\\132\\141-\\172]+", "", $-_)) ^ _ { # 1
1 print
}
Upvotes: 0
Reputation: 12395
Use this Perl one-liner:
perl -F',' -lane '$F[1] =~ s{[\W_]+}{}g; @F = map { lc } @F; print join ",", @F; ' in_file > out_file
The Perl one-liner uses these command line flags:
-e
: Tells Perl to look for code in-line, instead of in a file.
-n
: Loop over the input one line at a time, assigning it to $_
by default.
-l
: Strip the input line separator ("\n"
on *NIX by default) before executing the code in-line, and append it when printing.
-a
: Split $_
into array @F
on whitespace or on the regex specified in -F
option.
-F','
: Split into @F
on comma, rather than on whitespace.
s{[\W_]+}{}
: Replace 1 or more occurrences of \W
(non-word character) or underscore with nothing.
The regex uses these modifiers:
/g
: Match the pattern repeatedly.
SEE ALSO:
perldoc perlrun
: how to execute the Perl interpreter: command line switches
perldoc perlre
: Perl regular expressions (regexes)
perldoc perlre
: Perl regular expressions (regexes): Quantifiers; Character Classes and other Special Escapes; Assertions; Capture groups
perldoc perlrequick
: Perl regular expressions quick start
Upvotes: 0
Reputation: 1126
You can use the [:alpha:]
character class using awk
, here for second field and remove with gsub()
function the characteres that aren't alpha:
awk 'BEGIN{OFS=FS=","} {gsub(/[^[:alpha:]]+/, "", $2)} 1' file
a,b,c
aa,bb,c._c
Upvotes: 0
Reputation: 3985
Remove all (from second field)
[^A-Z
a-z
0-9]
$2
-F ','
OFS=FS
$ awk -F ',' 'BEGIN{OFS=FS}{gsub(/[^A-Za-z0-9]/,"",$2); print}' s.csv
# test
$ awk -F ',' 'BEGIN{OFS=FS}{gsub(/[^A-Za-z0-9]/,"",$2); print}' <<<'aa,bb ().,c._c'
aa,bb,c._c
As @Léa Gris mentioned below
Don't forget to set the locale to
C
or[^A-Za-z0-9]
is gonna be interpreted unexpectedly in non-western European alphabets. Prepend awk invocation withLC_ALL=C
Upvotes: 3