Cellane
Cellane

Reputation: 511

jq transform JSON structure by finding output values in input nested array

First of all, I’m sorry about the title. Even though English isn’t my first language, I wouldn’t even know how to call what I’m trying to accomplish in my mother tongue.

What I’m trying to do is take an input (automatically generated by downloading a page with curl, then converted from HTML to JSON in a very crude way using pup) and convert it into something that would be easier to work with later on. The input looks like this:

[
 {
  "children": [
   {
    "class": "label label-info",
    "tag": "span",
    "text": "Lesson"
   },
   {
    "tag": "h2",
    "text": "Is That So?"
   },
   {
    "tag": "p",
    "text": "Learn how to provide shortened answers with そうです and stay in the conversation with そうですか."
   },
   {
    "class": "btn btn-primary",
    "href": "https://www.nihongomaster.com/japanese/lessons/view/62/is-that-so",
    "tag": "a",
    "text": "Read Lesson"
   }
  ],
  "class": "row col-sm-12",
  "tag": "div"
 },
 {
  "children": [
   {
    "class": "label label-warning",
    "tag": "span",
    "text": "Drills"
   },
   {
    "tag": "h2",
    "text": "Yes, That Is So."
   },
   {
    "tag": "p",
    "text": "Practice the phrases and vocab from the lesson, Is That So?"
   }
  ],
  "class": "row col-sm-12",
  "tag": "div"
 }
]

And my desired output would pull various values from each object’s children array into something like this:

[
  {
    "title": "Is That So?", // <-- in other words, find "tag" == "h2" and output "text" value
    "perex": "Learn how to provide shortened answers with そうです and stay in the conversation with そうですか.", // "tag" == "p", "text" value
    "type": "lesson", // "tag" == "span", "text" value (lowercased if possible? Not needed though)
    "link": "https://www.nihongomaster.com/japanese/lessons/view/62/is-that-so" // "tag" == "a", "href" value
  },
  {
    "title": "Yes, That Is So."
    "perex": "Practice the phrases and vocab from the lesson, Is That So?",
    "type": "drills",
    "link": null // Can be missing!
  }
]

I tried various experiments with the select function but got nowhere near any usable result, so I’m not sure if my attempts are even worth sharing.

Upvotes: 0

Views: 138

Answers (2)

peak
peak

Reputation: 117027

Here's a straightforward solution to the original problem:

[
  .[]
  | .children
  | { title: [.[] | select(.tag == "h2") | .text][0],
      perex: [.[] | select(.tag == "p") | .text][0],
      type:  [.[] | select(.tag == "span") | .text | ascii_downcase][0],
      link:  [.[] | select(.tag == "a") | .href][0] }
]

The key point here is the use of the idiom [...][0] to handle all possibilities with respect to the number of items in ... (including 0).

Upvotes: 1

Cellane
Cellane

Reputation: 511

In the process of writing the above question, I somewhat randomly stumbled upon the correct solution. Rather than keeping the knowledge for myself, I thought I’d share the answer here as well. Please feel free to delete this entire question & answer if that’s not aligned to the site rules (I’m sorry if that’s the case).

select really is the key, but I was not using it in the correct way at the time of writing the question. Here is the full jq command to accomplish my needs, showcasing all of the above requirements:

  • how to select the nested values based on searching through the children array;
  • how to lowercase the type value;
  • how to deal with sometimes missing link values;
  • (and I did not realize at that time, but sometimes I want to alter the form of the link, so I added that as well).
def format(link): if link | tostring | startswith("/") then "https://www.nihongomaster.com" + link else link end;

[.[] | { title: .children[] | select(.tag == "h2").text, type: .children[] | select(.tag == "span").text | ascii_downcase, perex: .children[] | select(.tag == "p").text, link: format(((.children[] | select(.tag == "a").href) // null)) }]

There really is nothing better than rubber-duck debugging.

Upvotes: 0

Related Questions