Reputation: 13181
First ~ Thanks for taking the time to read this. If any further information or restatements are needed please comment so that I can improve the question. I am new to jq and appreciate any assistance provided. If there is any confusion in the topic it is due to my lack of experience with the jq tool. This seems fairly complex so even a partial answer is welcome.
Background
I have some JSON objects within a series of JSON arrays (sample at bottom). The objects have a number of elements but only the values associated with the "data" key are of interest to me. I want to output a single array of JSON objects where the values are translated into key/value pairs based on some regular expression rules.
I want to essentially combine multiple "data" values to form a key phrase (and then a value-phrase) which I need to output as the array of target objects. I believe I should be able to use a regular expression or a set of known text (for the key-phrase) to compile the text into a single key or value.
Current Logic
Using: jq-1.5, Mac OS 10.12.6, Bash terminal
Some things I have examined are by looking at the (:) colon in the value field (it indicates the end of a key-phrase). So for example, below represents the key "Company Address":
"data":"Company ",
...
"data": "Address:"
...
{
"top": 333,
"left": 520,
"width": 66,
"height": 15,
"font": 5,
"data":"123 Main St. "
...
"data":"Smallville "
...
"data":"KS "
...
"data":"606101"
In this case, the colon in the value indicates that the next value attached to the following useful "data" key is the beginning of an address.
A space trailing the value indicates that the next data value found is a continuation of the key phrase or the value phrase I am attempting to combine into a new JSON object.
I have a set of values that I can use to delimit a new JSON object. Essentially the following example would allow me to create a key "Company Name":
...
"data":"Company "
...
"data":"Name"
(note that this entry does not have a colon but the pattern will be the start of each new JSON object to be generated)
Notes
I can determine when the end of a key or value is reached depending on whether or not it's value ends with a space. (if there is no space then I consider the value to be the end of the value phrase and begin capturing the next key phrase).
Things I've tried
Any assistance with translating this logic into one or more useful jq filter(s) would be greatly appreciated. I've taken a look at the JQ Cookbook, the JQ Manual, this article, examined other SO questions on jq, and made an evaluation of an alternate tool (underscore_cli). I am new to jq and my naive expressions keep failing...
I've tried some simple tests to attempt to select values of interest. (I am not successfully able to walk the json tree to get to the information under the text array. Another wrinkle is that I have multiple text arrays. Is it possible to have the same algorithm performed on each array of objects?)
jq -s '.[] | select(.data | contains(":"))'
jq: error (at :0): Cannot index array with string "data"
Sample
A sample of the header JSON
[
{
"number": 1,
"pages": 254,
"height": 1263,
"width": 892,
"fonts": [
{
"fontspec": "0",
"size": "-1",
"family": "Times",
"color": "#ffffff"
},
{
"fontspec": "1",
"size": "31",
"family": "Times",
"color": "#000000"
},
{
"fontspec": "2",
"size": "16",
"family": "Helvetica",
"color": "#000000"
},
{
"fontspec": "3",
"size": "13",
"family": "Times",
"color": "#237db8"
},
{
"fontspec": "4",
"size": "17",
"family": "Times",
"color": "#000000"
},
{
"fontspec": "5",
"size": "13",
"family": "Times",
"color": "#000000"
},
{
"fontspec": "6",
"size": "8",
"family": "Times",
"color": "#9f97a7"
},
{
"fontspec": "7",
"size": "10",
"family": "Times",
"color": "#9f97a7"
}
],
"text": [
{
"top": 83,
"left": 60,
"width": 0,
"height": 1,
"font": 0,
"data": " "
},
{
"top": 333,
"left": 68,
"width": 68,
"height": 15,
"font": 5,
"data": "Company "
},
{
"top": 333,
"left": 135,
"width": 40,
"height": 15,
"font": 5,
"data": "Name"
},
...(more of these objects with data)
]
]
I am looking to output a JSON array of objects whose keys are composed of known strings (patterns) for the key/value pair bound by a colon (:) indicating the end of a key-phrase and whose next data-value would be the start of the value-phrase. The presence of a trailing space indicates that the data-value should be appended as part of the value-phrase until the trailing space no longer appears in the data-value. At that point the next data-value represents the start of another key-phrase.
The answers below are very helpful. I've gone back to the jq manual and incorporated the advice below. I am getting a string but unable to separate out the set of data tags into a single string.
.[].text | tostring
However, I am seeing the JSON being escaped and the other tags showing up in the string
top, left, right
(along with their values). I'd like to have the tokens associated only with the data key as a string. Then run the regular expressions over that string to parse out a set of JSON objects where the keys and values can be defined.
Upvotes: 0
Views: 2443
Reputation: 134491
So from what I could tell what you're trying to do, you're trying to get all the "data"
elements and concatenating them into a single string.
Should be simple enough to do:
[.. | .data? | select(. != null) | tostring] | join("")
There's not enough example data to know where the start of one "grouping" of data begins and ends. But assuming every item in the root array is a single phrase, select each item first before performing the search (or map them):
map([.. | .data? | select(. != null) | tostring] | join(""))
If ultimately you'd want to parse the data bits to a json object, it's not too far off:
map(
[.. | .data? | select(. != null) | tostring]
| join("")
| split(":") as [$key,$value]
| {$key,$value}
) | from_entries
Upvotes: 2
Reputation: 14685
You may want to consider using jq Streaming for this. With your sample data the following filter picks out the paths to the "data" attributes:
tostream
| select(length==2) as [$p,$v]
| select($p[-1]=="data")
| [$p,$v]
If this is in filter.jq
and your sample data is in data.json
the command
$ jq -Mc -f filter.jq data.json
produces
[[0,"text",0,"data"]," "]
[[0,"text",1,"data"],"Company "]
[[0,"text",2,"data"],"Name"]
From this you can see your data has information in the paths .[0].text[0].data
, .[0].text[1].data
and .[0].text[2].data
.
You can build on this using reduce to collect the values into groups based on the presence of the trailing space. With your data the following filter
reduce (
tostream
| select(length==2) as [$p,$v]
| select($p[-1]=="data")
) as [$p,$v] (
[""]
; .[-1] += $v
| if $v|endswith(" ")|not then . += [""] else . end
)
| map(select(. != ""))
produces
[" Company Name"]
This example only groups data into a list. You can use a more sophisticated reduce if you need.
Here is a Try it online! link you can experiment with.
To take this further let's use the following sample data:
[
{ "data":"Company " },
{ "data": "Address:" },
{ "data":"123 Main St. " },
{ "data":"Smallville " },
{ "data":"KS " },
{ "data":"606101" }
]
The filter as is will generate
["Company Address:","123 Main St. Smallville KS 606101"]
To convert that into an object you could add another reduce. For example this filter
reduce (
tostream
| select(length==2) as [$p,$v]
| select($p[-1]=="data")
) as [$p,$v] (
[""]
; .[-1] += $v
| if $v|endswith(" ")|not then . += [""] else . end
)
| map(select(. != ""))
| reduce .[] as $e (
{k:"", o:{}}
; if $e|endswith(":") then .k = $e[:-1] else .o[.k] += $e end
)
| .o
produces
{"Company Address":"123 Main St. Smallville KS 606101"}
One last thing: at this point the filter is getting pretty large so it would make sense to refactor a bit and break it down into functions so that it's easier to manage and extend. e.g.
def extract:
[ tostream
| select(length==2) as [$p,$v] # collect values for
| select($p[-1]=="data") # paths to "data"
| $v # in an array
]
;
def gather:
reduce .[] as $v (
[""] # state: list of grouped values
; .[-1] += $v # add value to last group
| if $v|endswith(" ")|not # if the value ended with " "
then . += [""] # form a new group
else .
end
)
| map(select(. != "")) # produce final result
;
def combine:
reduce .[] as $e (
{k:"", o:{}} # k: current key o: combined object
; if $e|endswith(":") # if value ends with a ":"
then .k = $e[:-1] # use it as a new current key
else .o[.k] += $e # otherwise add to current key's value
end
)
| .o # produce the final object
;
extract # extract "data" values
| gather # gather into groups
| combine # combine into an object
Upvotes: 1