Python Web scraping: extract one attribute with multiple tags

Question

I am trying to scrape data from my account on an online bookmark service. The page with the bookmarks is organised as following:





    Bookmark 1
    
        https://bookmark1.com
    
    
        tag1
        tag2
    


    Bookmark 2
    
        https://bookmark2.com
    
    
        tag1
    


    Bookmark 3
    
        https://bookmark3.com
    
    
        tag3

For each block I would like to extract the title, the link and the tags. In Python 3.5, I do:

# Import modules
import requests
from lxml import html

# Read the html
# url = 'mylink'
# page = requests.get(url)
# tree = html.fromstring(page.content)
# This is the replicable example
tree = html.fromstring(' Item 1
  https://bookmark1.com 
  tag1 tag2 
 Item 2
  https://bookmark2.com 
  tag1 
 Item 3
  https://bookmark3.com 
  tag3 ')

I use xpath to extract patterns of strings, say the title:

titles = tree.xpath('//div[@class="title"]/text()')
print(titles)

['Bookmark 1', 'Bookmark 2', 'Bookmark 3']

In order to extract the tags, I use the same principle:

tags = tree.xpath('//div[@class="tags"]//a/text()')
print(tags)

['tag1', 'tag2', 'tag1', 'tag3']

The problem is that each link has various tags so I cannot associate the array titles with the array tags. I thought I could extract each block and then work on them separately:

blocks = tree.xpath('//div[@class="outer_block"]')
block1 = blocks[0]

What I don't understand is that when I extract the tags from block1, it still maintains all of the tags of the original html.

tags_block1 = block1.xpath('//div[@class="tags"]//a/text()'
print(tags_block1)

['tag1', 'tag2', 'tag1', 'tag3']

How do I extract the title and the corresponding tags, what is the best output format and is there any other package that could do the job more easily?

Mike R · Accepted Answer

You should think about using BeautifulSoup. Consider the code below (source is a string of the HTML):

from bs4 import BeautifulSoup 

soup = BeautifulSoup(source, "html.parser")
outer_blocks = soup.find_all("div", class_="outer_block")

for block in outer_blocks:
    title = block.find("div", class_="title").contents[0]
    link = block.find("a").contents[0]
    tags = [x.contents[0] for x in block.find("div", class_="tags").find_all("a")]
    print([title, link, tags])

The output is:

['Bookmark 1', 'https://bookmark1.com', ['tag1', 'tag2']]
['Bookmark 2', 'https://bookmark2.com', ['tag1']]
['Bookmark 3', 'https://bookmark3.com', ['tag3']]

Python Web scraping: extract one attribute with multiple tags

Answers (2)

Related Questions