SeungCheol Han
SeungCheol Han

Reputation: 125

How to use sed or something to remove strings within both square brackets and braces

We have now some uncommon CSV data file which partly contains JSON data type as shown below:

"00001","str1","[a.b.c] str3, str4",true,false,"2022-04-18T12:00:00+00:00","[{""k1"":""v1"",""k2"":""v2""}]","str5"

We wanted to remove all characters within square brackets and braces which come together later with no other changing. But, when I use the following sed command sed -e 's/[.*]//g', it returns undesired output like:

"00001","str1","","str5"

If it were truly expected, it should be like:

"00001","str1","[a.b.c] str3, str4",true,false,"2022-04-18T12:00:00+00:00","","str5"

We do not know how to capture and replace the part containing JSON-typed data and cannot find the relative information to do so.

How can we achieve this?

Upvotes: 0

Views: 65

Answers (2)

Ed Morton
Ed Morton

Reputation: 203169

You shouldn't do what you're asking for as that approach will fail given input like ...,"[{""k1"":""v1"",""foo]bar"":""v2""}]",... where the JSON just happens to contain a ]. For example using this modified input:

$ cat file
"00001","str1","[a.b.c] str3, str4",true,false,"2022-04-18T12:00:00+00:00","[{""k1"":""v1"",""foo]bar"":""v2""}]","str5"

and the currently accepted answer, we get incorrect output that includes a field "bar"":""v2""}]" instead of just "":

$ sed 's/\[{[^]]*]//' file
"00001","str1","[a.b.c] str3, str4",true,false,"2022-04-18T12:00:00+00:00","bar"":""v2""}]","str5"

You should instead be asking how to delete the contents of a field that exactly contains a string like "[{...}]", e.g. using GNU awk for FPAT:

$ awk -v FPAT='[^,]*|("([^"]|"")*")' -v OFS=',' '
    { for (i=1; i<=NF; i++) sub(/^"\[\{.*}]"$/,"",$i) }
1' file
"00001","str1","[a.b.c] str3, str4",true,false,"2022-04-18T12:00:00+00:00","","str5"

See What's the most robust way to efficiently parse CSV using awk? for more info on parsing CSVs with awk.

Upvotes: 0

sseLtaH
sseLtaH

Reputation: 11207

Your current code is greedy matchig from the first [ to the last ] hence removing everything in between and also seems to have a redundant g flag.

Try this sed

$ sed 's/\[{[^]]*]//' input_file
"00001","str1","[a.b.c] str3, str4",true,false,"2022-04-18T12:00:00+00:00","","str5"

Match from [{ an opening square bracket with curly braces beside to the next occurance of a closing sqare bracket [^]]*

Upvotes: 1

Related Questions