Jordan Reed
Jordan Reed

Reputation: 596

Using sed to find inside of quotes and skip escaped quotes

I have a curl call that queries JIRA REST API and returns a JSON string like the following (expect on a single line):

 {
    "expand":"renderedFields,names,schema,transitions,operations,editmeta,changelog",
    "id":"36112","self":"https://jira.company.com/rest/api/2/issue/36112",
    "key":"FOO-1218",
    "fields":
        {"summary":"the \"special\" field is not returning what is expected"}
 }

I was trying to parse out the "summary" field using this sed script:

sed 's/^.*summary":"\([^"]*\)".*$/\1/'

Which works fine if the "summary" doesn't have an escaped \" inside of it - but of course, with the escaped quote all I get back is from the example is:

the \

My desired output would either be:

the \"special\" field is not returning what is expected

Or even more fancily this:

the "special" field is not returning what is expected

It doesn't appear that I can do a lookbehind in sed, is there a simple way to solve this in a bash script?

Upvotes: 0

Views: 283

Answers (4)

Jordan Reed
Jordan Reed

Reputation: 596

After serious struggling, I have figured out a method that is working for this very specific use-case. I convern the escaped quotes (\") into an even more obscure character sequence of five underscores (_), do the regex, and then convert it back:

sed -e 's/\\"/_____/g' -e 's/^.*summary":"\([^"]*\)".*$/\1/' -e 's/_____/"/g'

So the full test looks like this:

echo '{"expand":"renderedFields,names,schema,transitions,operations,editmeta,changelo‌​g","id":"36112","self":"https://jira.company.com/rest/api/2/issue/36112","key":"F‌​OO-1218","fields":{"summary":"the \"special\" field is not returning what is expected"}}' | sed -e 's/\\"/_____/g' -e 's/^.*summary":"\([^"]*\)".*$/\1/' -e 's/_____/"/g'

And the output looks like this:

the "special" field is not returning what is expected

Upvotes: 0

tripleee
tripleee

Reputation: 189739

For this limited case, you could use something like

vnix$ sed -n 's/.*summary":"\(\([^\\"]*\|\\.\)*\)".*/\1/p' file.json
the \"special\" field is not returning what is expected

Inside the quoted string, double quotes are disallowed, except any character is allowed immediately after a literal backslash. The character class disallows backslashes, too, to prevent a backslash from "leaking" into the wrong partial match. The repeat after the character class is just an optimization to avoid needless backtracking.

Any attempt at generalizing this will quickly become quite unwieldy. The Friedl book has an example which stretches over more than a page just to illustrate the futility of this.

Upvotes: 0

qwwqwwq
qwwqwwq

Reputation: 7329

You're asking for a JSON parser written in sed. Sorry, but this is insane.

Here's an example of a sane way to do this in python:

import requests
response = requests.get(JIRA_API_ENDPOINT, headers = JIRA_HEADERS)
obj = response.json()
obj['fields']['summary']

There's also a good JIRA API wrapper in python, called jira-python. Just use that and you wont have to do any parsing at all. I've used it to good effect before. Link here: http://jira-python.readthedocs.org/en/latest/

Your coworkers will thank you.

Upvotes: 2

zx81
zx81

Reputation: 41838

For the inside of double quotes, you really want at least one of these facilities:

  1. lookarounds (so you can check that what precedes and follows are quote).
  2. \K (so you can drop the opening quote)
  3. the ability to examine capture groups (so you can match the whole quote, but only capture what's inside).

Typically, you would want something like this:

(?<=(?<!\\)")(?:\\"|[^"])*(?=")

In grep -P mode, which uses PCRE, you can tap into even more features, such as the possessive quantifier I'll add here:

(?<=(?<!\\)")(?:\\"|[^"])*+(?=") 

Note that the [^"] can normally run across multiple lines, which you'd typically control with [^"\r\n], but grep only looks line by line anyway.

Upvotes: 1

Related Questions