Reputation: 45
I need to extract the below bolded data from the html code below:
<div class="name-ad hidden" data-count="91">
<div class="name-data-item" data-name="**I NEED TO SCRAPE THIS**" data-
count="92">
<div class="name-data-name">Washington NH</div>
<div class="name-data-location">Sullivan, Washington,
NH<br></div><div class="name-data-status">**I NEED TO
SCRAPE THIS AS WELL**</div> </div>
Can this be done with the sed command? If not, how can I do this?
Thank you in advance!
Upvotes: 1
Views: 668
Reputation: 88626
With xmlstarlet
and this more vaild html (file.html):
<html>
<body>
<div class="name-ad hidden" data-count="91">
<div class="name-data-item" data-name="**I NEED TO SCRAPE THIS**" data-count="92">
<div class="name-data-name">Washington NH</div>
<div class="name-data-location">Sullivan, Washington, NH<br /></div>
<div class="name-data-status">**I NEED TO SCRAPE THIS AS WELL**</div>
</div>
</div>
</body>
</html>
Command:
xmlstarlet sel --html -t \
-v "//html/body/div/div/@data-name" \
-v "//html/body/div/div/div[@class='name-data-status']" file.html
Output:
**I NEED TO SCRAPE THIS****I NEED TO SCRAPE THIS AS WELL**
or with a newline:
xmlstarlet sel --html -t \
-v "//html/body/div/div/@data-name" \
-n \
-v "//html/body/div/div/div[@class='name-data-status']" file.html
Output:
**I NEED TO SCRAPE THIS** **I NEED TO SCRAPE THIS AS WELL**
Upvotes: 1
Reputation: 1419
Try it with awk
:
$ cat file
<div class="name-ad hidden" data-count="91">
<div class="name-data-item" data-name="**I NEED TO SCRAPE THIS**" data-
count="92">
<div class="name-data-name">Washington NH</div>
<div class="name-data-location">Sullivan, Washington,
NH<br></div><div class="name-data-status">**I NEED TO
SCRAPE THIS AS WELL**</div> </div>
$ awk -F\" '/name-data-item/ {print $4}' file
**I NEED TO SCRAPE THIS**
Upvotes: 1