Reputation: 511
First of all, I’m sorry about the title. Even though English isn’t my first language, I wouldn’t even know how to call what I’m trying to accomplish in my mother tongue.
What I’m trying to do is take an input (automatically generated by downloading a page with curl
, then converted from HTML to JSON in a very crude way using pup
) and convert it into something that would be easier to work with later on. The input looks like this:
[
{
"children": [
{
"class": "label label-info",
"tag": "span",
"text": "Lesson"
},
{
"tag": "h2",
"text": "Is That So?"
},
{
"tag": "p",
"text": "Learn how to provide shortened answers with そうです and stay in the conversation with そうですか."
},
{
"class": "btn btn-primary",
"href": "https://www.nihongomaster.com/japanese/lessons/view/62/is-that-so",
"tag": "a",
"text": "Read Lesson"
}
],
"class": "row col-sm-12",
"tag": "div"
},
{
"children": [
{
"class": "label label-warning",
"tag": "span",
"text": "Drills"
},
{
"tag": "h2",
"text": "Yes, That Is So."
},
{
"tag": "p",
"text": "Practice the phrases and vocab from the lesson, Is That So?"
}
],
"class": "row col-sm-12",
"tag": "div"
}
]
And my desired output would pull various values from each object’s children
array into something like this:
[
{
"title": "Is That So?", // <-- in other words, find "tag" == "h2" and output "text" value
"perex": "Learn how to provide shortened answers with そうです and stay in the conversation with そうですか.", // "tag" == "p", "text" value
"type": "lesson", // "tag" == "span", "text" value (lowercased if possible? Not needed though)
"link": "https://www.nihongomaster.com/japanese/lessons/view/62/is-that-so" // "tag" == "a", "href" value
},
{
"title": "Yes, That Is So."
"perex": "Practice the phrases and vocab from the lesson, Is That So?",
"type": "drills",
"link": null // Can be missing!
}
]
I tried various experiments with the select
function but got nowhere near any usable result, so I’m not sure if my attempts are even worth sharing.
Upvotes: 0
Views: 138
Reputation: 117027
Here's a straightforward solution to the original problem:
[
.[]
| .children
| { title: [.[] | select(.tag == "h2") | .text][0],
perex: [.[] | select(.tag == "p") | .text][0],
type: [.[] | select(.tag == "span") | .text | ascii_downcase][0],
link: [.[] | select(.tag == "a") | .href][0] }
]
The key point here is the use of the idiom [...][0]
to handle all possibilities with respect to the number of items in ...
(including 0).
Upvotes: 1
Reputation: 511
In the process of writing the above question, I somewhat randomly stumbled upon the correct solution. Rather than keeping the knowledge for myself, I thought I’d share the answer here as well. Please feel free to delete this entire question & answer if that’s not aligned to the site rules (I’m sorry if that’s the case).
select
really is the key, but I was not using it in the correct way at the time of writing the question. Here is the full jq
command to accomplish my needs, showcasing all of the above requirements:
children
array;type
value;link
values;link
, so I added that as well).def format(link): if link | tostring | startswith("/") then "https://www.nihongomaster.com" + link else link end;
[.[] | { title: .children[] | select(.tag == "h2").text, type: .children[] | select(.tag == "span").text | ascii_downcase, perex: .children[] | select(.tag == "p").text, link: format(((.children[] | select(.tag == "a").href) // null)) }]
There really is nothing better than rubber-duck debugging.
Upvotes: 0