baked goods
baked goods

Reputation: 237

Replacing columns of a CSV with a string using awk and gsub

I have an input csv file that looks something like:

Name,Index,Location,ID,Message
Alexis,10,Punggol,4090b43,Production 4090b43
Scott,20,Bedok,bfb34d3,Prevent
Ronald,30,one-north,86defac,Difference 86defac
Cindy,40,Punggol,40d0ced,Central
Eric,50,one-north,aeff08d,Military aeff08d
David,60,Bedok,5d1152d,Study

And I want to write a bash shell script using awk and gsub to replace 6-7 alpha numeric character long strings under the ID column with "xxxxx", with the output in a separate .csv file.

Right now I've got:

#!/bin/bash
awk -F ',' -v OFS=',' '{gsub(/^([a-zA-Z0-9]){6,7}/g, "xxxxx", $4);}1' input.csv > output.csv

But the output from I'm getting from running bash myscript.sh input.csv doesn't make any sense. The output.csv file looks like:

Name,Index,Location,ID,Message
Alexis,10,Punggol,4xxxxx9xxxxxb43,Production 4090b43
Scott,20,Bedok,bfb34d3,Prevent
Ronald,30,one-north,86defac,Difference 86defac
Cindy,40,Punggol,4xxxxxdxxxxxced,Central
Eric,50,one-north,aeffxxxxx8d,Military aeff08d
David,60,Bedok,5d1152d,Study

but the expected output csv should look like:

Name,Index,Location,ID,Message
Alexis,10,Punggol,xxxxx,Production 4090b43
Scott,20,Bedok,xxxxx,Prevent
Ronald,30,one-north,xxxxx,Difference 86defac
Cindy,40,Punggol,xxxxx,Central
Eric,50,one-north,xxxxx,Military aeff08d
David,60,Bedok,xxxxx,Study

Upvotes: 2

Views: 1048

Answers (3)

kvantour
kvantour

Reputation: 26471

The short version to your answer would be the following:

$ awk 'BEGIN{FS=OFS=","}(FNR>1){$4="xxxxxx"}1' file

This will replace all entries in column 4 by "xxxxxx".

If you only want to change the first 6 to 7 characters of column 4 (and not if there are only 5 of them, there are a couple of ways:

$ awk 'BEGIN{FS=OFS=","}(FNR>1)&&(length($4)>5){$4="xxxxxx" substr($4,8)}1' file
$ awk 'BEGIN{FS=OFS=","}(FNR>1)&&{sub(/.......?/,"xxxxxx",$4)}1' file

Here, we will replace 123456abcde into xxxxxxabcde

Why is your script failing:

Besides the fact that the approach is wrong, I'll try to explain what the following command does: gsub(/([a-zA-Z0-9]){6,7}/g,"xxxxx",$4)

The notation /abc/g is valid awk syntax, but it does not do what you expect it to do. The notation /abc/ is an ERE-token (an extended regular expression). The notation g is, at this point, nothing more than an undefined variable which defaults to an empty string or zero, depending on its usage. awk will now try to execute the operation /abc/g by first executing /abc/ which means: if my current record ($0) matches the regular expression "abc", return 1 otherwise return 0. So it converts /abc/g into 0g which means to concatenate the content of g to the number 0. For this, it will convert the number 0 to a string "0" and concatenate it with the empty string g. In the end, your gsub command is equivalent to gsub("0","xxxxx",$4) and means to replace all the ZERO's by "xxxxx".

Why are you getting always gsub("0","xxxxx",$4) and never gsub("1","xxxxx",$4). The reason is that your initial regular expression never matches anything in the full record/line ($0). Your reguar expression reads /^([a-zA-Z0-9]){6,7}/, and while there are lines that start with 6 or 7 characters, it is likely that your awk does not recognize the extended regular expression notation '{m,n}' which makes it fail. If you use gnu awk, the output would be different when using -re-interval which in old versions of GNU awk is not enabled by default.

Upvotes: 3

RavinderSingh13
RavinderSingh13

Reputation: 133458

With your shown sample, please try the following code:

awk -F ',[[:space:]]+'  -v OFS=',\t' '
{
  sub(/^([a-zA-Z0-9]){6,7}$/, "xxxxx", $4)
  $1=$1
}
1
' Input_file | column -t -s $'\t'

Explanation: Setting field separator as comma, space(s), then setting output field separator as comma tab here. Then substituting from starting to till end of value(6 to 7 occurrences) of alphanumeric(s) with xxxxx in 4th field. Finally printing current line. Then sending output of awk program to column command to make it as per shown sample of OP.

EDIT: In case your Input_file is separated by only , as per edited samples now, then try following.

awk -F ',' -v OFS=',' '
{
  sub(/^([a-zA-Z0-9]){6,7}$/, "xxxxx", $4)
}
1
' Input_file

Note: OP has installed latest version of awk from older version and these codes helped.

Upvotes: 4

Daweo
Daweo

Reputation: 36390

I tried to find why your code behave like that, for simplicty sake I made example concering only gsub you have used:

awk 'BEGIN{id="4090b43"}END{gsub(/^([a-zA-Z0-9]){6,7}/g, "xxxxx", id);print id}' emptyfile.txt

output is

4xxxxx9xxxxxb43

after removing g in first argument

awk 'BEGIN{id="4090b43"}END{gsub(/^([a-zA-Z0-9]){6,7}/, "xxxxx", id);print id}' emptyfile.txt

output is

xxxxx

So regular expression followed by g caused malfunction. I was unable to find relevant passage in GNU AWK manual what g after / is supposed to do.

(tested in gawk 4.2.1)

Upvotes: 0

Related Questions