Russell Christopher
Russell Christopher

Reputation: 1707

Regex: Deal with CSV containing nested JSON objects (comma hell!)

I have a CSV file which contains arbitrary JSON objects. Here's a simplified version of the file:

v1,2020-06-09T22:44:46.377Z,cb6deb64-d6a0-4151-ba9b-bfa54ae75180,{"payload":{"assetId":"a3c2a944-d554-44bb-90a4-b7beafbc6bff","permissionsToParty":[{"partyType":1,"partyId":"74457bd4-c2ab-4760-942b-d6c623a97f19","permissions":["CREATE","DELETE","DOWNLOAD","EDIT","VIEW"]}]}},lastcolumn
v2,2020-06-09T22:44:47.377Z,50769c0d-0a05-4028-9f0b-40ab570af31a,{"scheduleIds":[]},lastcolumn
v3,2020-06-09T22:44:48.377Z,12345678-0a05-4028-9f0b-40ab570af31a,{"jobId":"4dfeb16d-f9d6-4480-9b84-60c5af0bd3ce","result":"success","status":"completed"},lastcolumn

The commas (if any) inside the JSON wreak havok with CSV parsing.

I'm looking for a way to either...

...capture and replace all the commas outside the JSON objects with pipes (|) so I can simply key on those:

v1|2020-06-09T22:44:46.377Z|cb6deb64-d6a0-4151-ba9b-bfa54ae75180|{"payload":{"assetId":"a3c2a944-d554-44bb-90a4-b7beafbc6bff"**,**"permissionsToParty":[{"partyType":1,"partyId":"74457bd4-c2ab-4760-942b-d6c623a97f19","permissions":["CREATE","DELETE","DOWNLOAD","EDIT","VIEW"]}]}}|lastcolumn
v2|2020-06-09T22:44:47.377Z|50769c0d-0a05-4028-9f0b-40ab570af31a|{"scheduleIds":[]}|lastcolumn
v3|2020-06-09T22:44:48.377Z|12345678-0a05-4028-9f0b-40ab570af31a|{"jobId":"4dfeb16d-f9d6-4480-9b84-60c5af0bd3ce","result":"success","status":"completed"}|lastcolumn

...or wrap each JSON object with single quotes:

v1,2020-06-09T22:44:46.377Z,cb6deb64-d6a0-4151-ba9b-bfa54ae75180,'{"payload":{"assetId":"a3c2a944-d554-44bb-90a4-b7beafbc6bff","permissionsToParty":[{"partyType":1,"partyId":"74457bd4-c2ab-4760-942b-d6c623a97f19","permissions":["CREATE","DELETE","DOWNLOAD","EDIT","VIEW"]}]}}',lastcolumn
v2,2020-06-09T22:44:47.377Z,50769c0d-0a05-4028-9f0b-40ab570af31a,'{"scheduleIds":[]}',lastcolumn
v3,2020-06-09T22:44:48.377Z,12345678-0a05-4028-9f0b-40ab570af31a,'{"jobId":"4dfeb16d-f9d6-4480-9b84-60c5af0bd3ce","result":"success","status":"completed"}',lastcolumn

Alas, my regex kung-fu is too weak to create something flexible enough based on the arbitrary nature of the JSON objects that may show up.

The closest I've gotten is:

(?!\B{[^}]*),(?![^{]*}\B)

Which still captures commas (the comma directly before "permissionsToParty", below) in an object like this:

{"payload":{"assetId":"710728f9-7c13-4bcb-8b5d-ef347afe0b58","permissionsToParty":[{"partyType":0,"partyId":"32435a92-c7b3-4fc0-b722-2e88e9e839e5","permissions":["CREATE","DOWNLOAD","VIEW"]}]}}

Can anyone simplify what I've done thus far and help me with an expression that ignores ALL commas within the outermost {} symbols of the JSON?

Upvotes: 0

Views: 285

Answers (2)

user7571182
user7571182

Reputation:

Your regex is quite close to your expectations. In order to get rid of the expected comma you may try the below regex:

(?!\B{[^}]*|"),(?![^{]*}\B|")
           ^^             ^^
           Changed this part

You can find the demo of the above regex in here.

But you can use this second regex below to put the json string to quotes. This is more efficient as compared to the above regex and I'll recommend that you use this.

(\{.*\})

Explanation of the above regex:

() - Represents capturing group.

\{ - Matches the curly brace { literally.

.* - Matches everything inside of the curly braces greedily in order to capture all the nested braces.

'$1' - You can replace the given test string with 1st captured group $1 and put quotes outside of it.

You can find the demo of the above regex in here.

Pictorial Representation of the first regex:

Pictorial

Upvotes: 1

Per Ghosh
Per Ghosh

Reputation: 533

Remember that you can do alot with regex but sometimes you need create your own code for it. You can't do everything with it.

How to do this depends on what you know about the csv sent. It looks like there isn't any values within double quotes if you do not count the json part?

Some regex engines has recurson.
If that works finding json parts with this expression \{((?>[^{}]+|(?R))*)\}
Description how it works Recursion explained here.

Here is a guide how csv can be parsed if it has double quoted parts.
Expression: (?:(?:"(?:[^"]|"")*"|(?<=,)[^,]*(?=,))|^[^,]+|^(?=,)|[^,]+$|(?<=,)$)
Guide to parse cvs

If you know that cvs do not contain any double quoted values, then it may be doable if you convert it in a two step process.

Upvotes: 1

Related Questions