user3314418
user3314418

Reputation: 3041

Beautiful Soup Using Regex to Find Tags?

I'd really like to be able to allow Beautiful Soup to match any list of tags, like so. I know attr accepts regex, but is there anything in beautiful soup that allows you to do so?

soup.findAll("(a|div)")

Output:

<a> ASDFS
<div> asdfasdf
<a> asdfsdf

My goal is to create a scraper that can grab tables from sites. Sometimes tags are named inconsistently, and I'd like to be able to input a list of tags to name the 'data' part of a table.

Upvotes: 39

Views: 105503

Answers (3)

Manu J4
Manu J4

Reputation: 2859

Note that you can also use regular expressions to search in attributes of tags. For example:

import re
from bs4 import BeautifulSoup

soup.find_all('a', {'href': re.compile(r'crummy\.com/')})

This example finds all <a> tags that link to a website containing the substring 'crummy.com'.

Upvotes: 94

hwnd
hwnd

Reputation: 70732

find_all() is the most favored method in the Beautiful Soup search API.

You can pass a variation of filters. Also, pass a list to find multiple tags:

>>> soup.find_all(['a', 'div']) 

Example:

>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup('<html><body><div>asdfasdf</div><p><a>foo</a></p></body></html>')
>>> soup.find_all(['a', 'div'])
[<div>asdfasdf</div>, <a>foo</a>]

Or you can use a regular expression to find tags that contain a or div:

>>> import re
>>> soup.find_all(re.compile("(a|div)"))

Upvotes: 56

ZJS
ZJS

Reputation: 4051

yes see docs...

http://www.crummy.com/software/BeautifulSoup/bs3/documentation.html

import re

soup.findAll(re.compile("^a$|(div)"))

Upvotes: 7

Related Questions