Reputation: 1562
I am trying to scrape data from my account on an online bookmark service. The page with the bookmarks is organised as following:
<!DOCTYPE html>
<html lang="en">
<body>
<div id="item1" class="outer_block">
<div class="title">Bookmark 1</div>
<div class="link">
<a href="https://bookmark1.com">https://bookmark1.com</a>
</div>
<div class="tags">
<a href="http://mylink.com/tag1">tag1</a>
<a href="http://mylink.com/tag2">tag2</a>
</div>
</div>
<div id="item2" class="outer_block">
<div class="title">Bookmark 2</div>
<div class="link">
<a href="https://bookmark2.com">https://bookmark2.com</a>
</div>
<div class="tags">
<a href="http://mylink.com/tag1">tag1</a>
</div>
</div>
<div id="item3" class="outer_block">
<div class="title">Bookmark 3</div>
<div class="link">
<a href="https://bookmark3.com">https://bookmark3.com</a>
</div>
<div class="tags">
<a href="http://mylink.com/tag3">tag3</a>
</div>
</div>
</body>
</html>
For each block I would like to extract the title, the link and the tags. In Python 3.5, I do:
# Import modules
import requests
from lxml import html
# Read the html
# url = 'mylink'
# page = requests.get(url)
# tree = html.fromstring(page.content)
# This is the replicable example
tree = html.fromstring('<!DOCTYPE html><html lang="en"><body><div id="item1" class="outer_block"> <div class="title">Item 1</div> <div class="link"> <a href="https://bookmark1.com">https://bookmark1.com</a> </div> <div class="tags"> <a href="http://mylink.com/tag1">tag1</a> <a href="http://mylink.com/tag2">tag2</a> </div></div><div id="item2" class="outer_block"> <div class="title">Item 2</div> <div class="link"> <a href="https://bookmark2.com">https://bookmark2.com</a> </div> <div class="tags"> <a href="http://mylink.com/tag1">tag1</a> </div></div><div id="item3" class="outer_block"> <div class="title">Item 3</div> <div class="link"> <a href="https://bookmark3.com">https://bookmark3.com</a> </div> <div class="tags"> <a href="http://mylink.com/tag3">tag3</a> </div></div></body></html>')
I use xpath
to extract patterns of strings, say the title:
titles = tree.xpath('//div[@class="title"]/text()')
print(titles)
['Bookmark 1', 'Bookmark 2', 'Bookmark 3']
In order to extract the tags, I use the same principle:
tags = tree.xpath('//div[@class="tags"]//a/text()')
print(tags)
['tag1', 'tag2', 'tag1', 'tag3']
The problem is that each link has various tags so I cannot associate the array titles
with the array tags
.
I thought I could extract each block and then work on them separately:
blocks = tree.xpath('//div[@class="outer_block"]')
block1 = blocks[0]
What I don't understand is that when I extract the tags from block1
, it still maintains all of the tags of the original html.
tags_block1 = block1.xpath('//div[@class="tags"]//a/text()'
print(tags_block1)
['tag1', 'tag2', 'tag1', 'tag3']
How do I extract the title and the corresponding tags, what is the best output format and is there any other package that could do the job more easily?
Upvotes: 1
Views: 1422
Reputation: 598
You can use two property in two different brackets
description = tree.xpath("//div[@class='details-content'][@itemprop='description']/text()")
Upvotes: 0
Reputation: 302
You should think about using BeautifulSoup. Consider the code below (source is a string of the HTML):
from bs4 import BeautifulSoup
soup = BeautifulSoup(source, "html.parser")
outer_blocks = soup.find_all("div", class_="outer_block")
for block in outer_blocks:
title = block.find("div", class_="title").contents[0]
link = block.find("a").contents[0]
tags = [x.contents[0] for x in block.find("div", class_="tags").find_all("a")]
print([title, link, tags])
The output is:
['Bookmark 1', 'https://bookmark1.com', ['tag1', 'tag2']]
['Bookmark 2', 'https://bookmark2.com', ['tag1']]
['Bookmark 3', 'https://bookmark3.com', ['tag3']]
Upvotes: 1