Reputation: 596
I have a curl call that queries JIRA REST API and returns a JSON string like the following (expect on a single line):
{
"expand":"renderedFields,names,schema,transitions,operations,editmeta,changelog",
"id":"36112","self":"https://jira.company.com/rest/api/2/issue/36112",
"key":"FOO-1218",
"fields":
{"summary":"the \"special\" field is not returning what is expected"}
}
I was trying to parse out the "summary" field using this sed script:
sed 's/^.*summary":"\([^"]*\)".*$/\1/'
Which works fine if the "summary" doesn't have an escaped \" inside of it - but of course, with the escaped quote all I get back is from the example is:
the \
My desired output would either be:
the \"special\" field is not returning what is expected
Or even more fancily this:
the "special" field is not returning what is expected
It doesn't appear that I can do a lookbehind in sed, is there a simple way to solve this in a bash script?
Upvotes: 0
Views: 283
Reputation: 596
After serious struggling, I have figured out a method that is working for this very specific use-case. I convern the escaped quotes (\") into an even more obscure character sequence of five underscores (_), do the regex, and then convert it back:
sed -e 's/\\"/_____/g' -e 's/^.*summary":"\([^"]*\)".*$/\1/' -e 's/_____/"/g'
So the full test looks like this:
echo '{"expand":"renderedFields,names,schema,transitions,operations,editmeta,changelog","id":"36112","self":"https://jira.company.com/rest/api/2/issue/36112","key":"FOO-1218","fields":{"summary":"the \"special\" field is not returning what is expected"}}' | sed -e 's/\\"/_____/g' -e 's/^.*summary":"\([^"]*\)".*$/\1/' -e 's/_____/"/g'
And the output looks like this:
the "special" field is not returning what is expected
Upvotes: 0
Reputation: 189739
For this limited case, you could use something like
vnix$ sed -n 's/.*summary":"\(\([^\\"]*\|\\.\)*\)".*/\1/p' file.json
the \"special\" field is not returning what is expected
Inside the quoted string, double quotes are disallowed, except any character is allowed immediately after a literal backslash. The character class disallows backslashes, too, to prevent a backslash from "leaking" into the wrong partial match. The repeat after the character class is just an optimization to avoid needless backtracking.
Any attempt at generalizing this will quickly become quite unwieldy. The Friedl book has an example which stretches over more than a page just to illustrate the futility of this.
Upvotes: 0
Reputation: 7329
You're asking for a JSON parser written in sed
. Sorry, but this is insane.
Here's an example of a sane way to do this in python:
import requests
response = requests.get(JIRA_API_ENDPOINT, headers = JIRA_HEADERS)
obj = response.json()
obj['fields']['summary']
There's also a good JIRA API wrapper in python, called jira-python. Just use that and you wont have to do any parsing at all. I've used it to good effect before. Link here: http://jira-python.readthedocs.org/en/latest/
Your coworkers will thank you.
Upvotes: 2
Reputation: 41838
For the inside of double quotes, you really want at least one of these facilities:
\K
(so you can drop the opening quote)Typically, you would want something like this:
(?<=(?<!\\)")(?:\\"|[^"])*(?=")
In grep -P
mode, which uses PCRE
, you can tap into even more features, such as the possessive quantifier I'll add here:
(?<=(?<!\\)")(?:\\"|[^"])*+(?=")
Note that the [^"]
can normally run across multiple lines, which you'd typically control with [^"\r\n]
, but grep
only looks line by line anyway.
Upvotes: 1