Reputation: 237
I have an input csv file that looks something like:
Name,Index,Location,ID,Message
Alexis,10,Punggol,4090b43,Production 4090b43
Scott,20,Bedok,bfb34d3,Prevent
Ronald,30,one-north,86defac,Difference 86defac
Cindy,40,Punggol,40d0ced,Central
Eric,50,one-north,aeff08d,Military aeff08d
David,60,Bedok,5d1152d,Study
And I want to write a bash shell script using awk
and gsub
to replace 6-7 alpha numeric character long strings under the ID column with "xxxxx", with the output in a separate .csv file.
Right now I've got:
#!/bin/bash
awk -F ',' -v OFS=',' '{gsub(/^([a-zA-Z0-9]){6,7}/g, "xxxxx", $4);}1' input.csv > output.csv
But the output from I'm getting from running bash myscript.sh input.csv
doesn't make any sense. The output.csv file looks like:
Name,Index,Location,ID,Message
Alexis,10,Punggol,4xxxxx9xxxxxb43,Production 4090b43
Scott,20,Bedok,bfb34d3,Prevent
Ronald,30,one-north,86defac,Difference 86defac
Cindy,40,Punggol,4xxxxxdxxxxxced,Central
Eric,50,one-north,aeffxxxxx8d,Military aeff08d
David,60,Bedok,5d1152d,Study
but the expected output csv should look like:
Name,Index,Location,ID,Message
Alexis,10,Punggol,xxxxx,Production 4090b43
Scott,20,Bedok,xxxxx,Prevent
Ronald,30,one-north,xxxxx,Difference 86defac
Cindy,40,Punggol,xxxxx,Central
Eric,50,one-north,xxxxx,Military aeff08d
David,60,Bedok,xxxxx,Study
Upvotes: 2
Views: 1048
Reputation: 26471
The short version to your answer would be the following:
$ awk 'BEGIN{FS=OFS=","}(FNR>1){$4="xxxxxx"}1' file
This will replace all entries in column 4 by "xxxxxx".
If you only want to change the first 6 to 7 characters of column 4 (and not if there are only 5 of them, there are a couple of ways:
$ awk 'BEGIN{FS=OFS=","}(FNR>1)&&(length($4)>5){$4="xxxxxx" substr($4,8)}1' file
$ awk 'BEGIN{FS=OFS=","}(FNR>1)&&{sub(/.......?/,"xxxxxx",$4)}1' file
Here, we will replace 123456abcde into xxxxxxabcde
Why is your script failing:
Besides the fact that the approach is wrong, I'll try to explain what the following command does: gsub(/([a-zA-Z0-9]){6,7}/g,"xxxxx",$4)
The notation /abc/g
is valid awk syntax, but it does not do what you expect it to do. The notation /abc/
is an ERE-token (an extended regular expression). The notation g
is, at this point, nothing more than an undefined variable which defaults to an empty string or zero, depending on its usage. awk will now try to execute the operation /abc/g
by first executing /abc/
which means: if my current record ($0
) matches the regular expression "abc", return 1
otherwise return 0
. So it converts /abc/g
into 0g
which means to concatenate the content of g
to the number 0
. For this, it will convert the number 0
to a string "0"
and concatenate it with the empty string g. In the end, your gsub
command is equivalent to gsub("0","xxxxx",$4)
and means to replace all the ZERO's by "xxxxx".
Why are you getting always gsub("0","xxxxx",$4)
and never gsub("1","xxxxx",$4)
. The reason is that your initial regular expression never matches anything in the full record/line ($0
). Your reguar expression reads /^([a-zA-Z0-9]){6,7}/
, and while there are lines that start with 6 or 7 characters, it is likely that your awk does not recognize the extended regular expression notation '{m,n}' which makes it fail. If you use gnu awk, the output would be different when using -re-interval
which in old versions of GNU awk is not enabled by default.
Upvotes: 3
Reputation: 133458
With your shown sample, please try the following code:
awk -F ',[[:space:]]+' -v OFS=',\t' '
{
sub(/^([a-zA-Z0-9]){6,7}$/, "xxxxx", $4)
$1=$1
}
1
' Input_file | column -t -s $'\t'
Explanation: Setting field separator as comma, space(s), then setting output field separator as comma tab here. Then substituting from starting to till end of value(6 to 7 occurrences) of alphanumeric(s) with xxxxx
in 4th field. Finally printing current line. Then sending output of awk
program to column
command to make it as per shown sample of OP.
EDIT: In case your Input_file is separated by only ,
as per edited samples now, then try following.
awk -F ',' -v OFS=',' '
{
sub(/^([a-zA-Z0-9]){6,7}$/, "xxxxx", $4)
}
1
' Input_file
Note: OP has installed latest version of awk from older version and these codes helped.
Upvotes: 4
Reputation: 36390
I tried to find why your code behave like that, for simplicty sake I made example concering only gsub
you have used:
awk 'BEGIN{id="4090b43"}END{gsub(/^([a-zA-Z0-9]){6,7}/g, "xxxxx", id);print id}' emptyfile.txt
output is
4xxxxx9xxxxxb43
after removing g
in first argument
awk 'BEGIN{id="4090b43"}END{gsub(/^([a-zA-Z0-9]){6,7}/, "xxxxx", id);print id}' emptyfile.txt
output is
xxxxx
So regular expression followed by g
caused malfunction. I was unable to find relevant passage in GNU AWK
manual what g
after /
is supposed to do.
(tested in gawk 4.2.1)
Upvotes: 0