ambrrrgris
ambrrrgris

Reputation: 157

Regex: Remove Commas within quotes

I'm using NiFi and I have a series of JSONs that look like this:

{
  "url": "RETURNED URL",
  "repository_url": "RETURNED URL",
  "labels_url": "RETURNED URL",
  "comments_url": "RETURNED URL",
  "events_url": "RETURNED URL",
  "html_url": "RETURNED URL",
  "id": "RETURNED_ID",
  "node_id": "RETURNED id",
  "number": 10,
    ...
  "author_association": "xxxx",
  "active_lock_reason": null,
  "body": "text text text, text text, text text text, text, text text",
  "performed_via_github_app": null
}

My focus is on the "body" attribute. Because I'm merging them into one giant JSON to convert into a csv, I need the commas within the "body" text to go away (to help with possible NLP later down the road as well). I know I can just use the replace text, but capturing the commas themselves is the part I'm struggling with. So far I have the following:

((?<="body"\s:\s").*(?=",))

Every guide I look at, though, doesn't match the commas within the quotes. Any suggestions?

Upvotes: 1

Views: 105

Answers (1)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626861

You can use

(\G(?!^)|\"body\"\s*:\s*\")([^\",]*),

In case there are escape sequences in the string use

(\G(?!^)|\"body\"\s*:\s*\")([^\",\\]*(?:\\.[^\",\\]*)*),

See the regex demo (and regex demo #2), replace with $1$2.

Details:

  • (\G(?!^)|\"body\"\s*:\s*\") - Group 1: end of the previous match or "body", zero or more whitespaces, :, zero or more whitespaces
  • ([^\",]*) - Group 2 ($2): any zero or more chars other than " and ,
  • , - a comma (to be removed/replaced).

Upvotes: 1

Related Questions