Fab ian
Fab ian

Reputation: 127

Parse HTML with CURL in Shell Script

I'm trying to parse a specific content of a webpage in shell script.

I need to grep the content inside the <div> tag.

<div class="tracklistInfo">
<p class="artist">Diplo - Justin Bieber - Skrillex</p>
<p>Where Are U Now</p>
</div>

If I use grep -E -m 1 -o '<div class="tracklistInfo">', the resume is only <div class="tracklistInfo">

How can I access the Artist (Diplo - Justin Bieber - Skrillex) and how the title (Where Are U Now)?

Upvotes: 5

Views: 16678

Answers (5)

eMPee584
eMPee584

Reputation: 2054

Because this will come up in searches, here are some more CLI tools to extract data from HTML:

  • xidel: download and extract data from HTML/XML pages using CSS selectors, XPath/XQuery 3.0, as well as querying JSON
  • htmlq: Like jq, but for HTML.
  • pup: command line tool for processing HTML … using CSS selectors
  • tq: Perform a lookup by CSS selector on an HTML input
  • html-xml-utils: hxextract (extract selected elements) & hxselect (extract elements that match a (CSS) selector)
  • hq: lightweight command line HTML processor using CSS and XPath selectors
  • cascadia: CSS selector CLI tool
  • xpe: commandline xpath tool that is easy to use
  • hred: html reduce … reads HTML from standard input and outputs JSON
  • parsel: Select parts of a HTML document based on CSS selectors

And here's a chart of popularity for those projects available on github:

Star History Chart

Upvotes: 0

Reino
Reino

Reputation: 3423

Your title starts with "Parse HTML with CURL", but curl is not a html-parser. If you want to use a command-line tool, use instead.

xidel -s "<url>" -e '//div[@class="tracklistInfo"]/p'
Diplo - Justin Bieber - Skrillex
Where Are U Now

xidel -s "<url>" -e '//div[@class="tracklistInfo"]/join(p," | ")'
Diplo - Justin Bieber - Skrillex | Where Are U Now

Upvotes: 2

Casimir et Hippolyte
Casimir et Hippolyte

Reputation: 89557

Using xmllint:

a='<div class="tracklistInfo">
<p class="artist">Diplo - Justin Bieber - Skrillex</p>
<p>Where Are U Now</p>
</div>'

xmllint --html --xpath 'concat(//div[@class="tracklistInfo"]/p[1]/text(), "#", //div[@class="tracklistInfo"]/p[2]/text())' <<<"$a"

You obtain:

Diplo - Justin Bieber - Skrillex#Where Are U Now

That can be easily separated.

Upvotes: 7

Ali ISSA
Ali ISSA

Reputation: 408

cat - > file.html << EOF
<div class="tracklistInfo">
<p class="artist">Diplo - Justin Bieber - Skrillex</p>
<p>Where Are U Now</p>
</div><div class="tracklistInfo">
<p class="artist">toto</p>
<p>tata</p>
</div>
EOF


cat file.html | tr -d '\n'  | sed -e "s/<\/div>/<\/div>\n/g" | sed -n 's/^.*class="artist">\([^<]*\)<\/p> *<p>\([^<]*\)<.*$/artist : \1\ntitle : \2\n/p'

Upvotes: 1

Martin Tournoij
Martin Tournoij

Reputation: 27822

Don't. Use a HTML parser. For example, BeautifulSoup for Python is easy to use and can do this very easily.

That being said, remember that grep works on lines. The pattern is matched for every line, not for the entire string.

What you can use is -A to also print out lines after the match:

grep -A2 -E -m 1 '<div class="tracklistInfo">'

Should output:

<div class="tracklistInfo">
<p class="artist">Diplo - Justin Bieber - Skrillex</p>
<p>Where Are U Now</p>

You can then get the last or second-last line by piping it to tail:

$ grep -A2 -E -m 1 '<div class="tracklistInfo">' | tail -n1
<p>Where Are U Now</p>

$ grep -A2 -E -m 1 '<div class="tracklistInfo">' |  tail -n2 | head -n1
<p class="artist">Diplo - Justin Bieber - Skrillex</p>

And strip the HTML with sed:

$ grep -A2 -E -m 1 '<div class="tracklistInfo">' | tail -n1
Where Are U Now

$ grep -A2 -E -m 1 '<div class="tracklistInfo">' |  tail -n2 | head -n1 | sed 's/<[^>]*>//g'
Diplo - Justin Bieber - Skrillex


But as said, this is fickle, likely to break, and not very pretty. Here's the same with BeautifulSoup, by the way:

html = '''<body>
<p>Blah text</p>
<div class="tracklistInfo">
<p class="artist">Diplo - Justin Bieber - Skrillex</p>
<p>Where Are U Now</p>
</div>
</body>'''

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')

for track in soup.find_all(class_='tracklistInfo'):
    print(track.find_all('p')[0].text)
    print(track.find_all('p')[1].text)

This also works with multiple rows of tracklistInfo − adding that to the shell command requires more work ;-)

Upvotes: 1

Related Questions