How to Apply Regular Expression to BeautifulSoup with Python using find_All()

Question

So I am trying to scrape a website with multiple pages. Each page has multiple tags with ids ranging from 19 to 29. the number of tables on each page is random

Here is an example:

page 1 HTML

...
...

page 2 HTML

...
...
...

page 3 HTML

...
...
...

....

page n HTML

...

I am trying to isolate these tables from the html pages, in order to scrape them. So far, I am able to loop through each page, but the regex that I wrote in order to extract the tables from each page don't seem to work. Please help me.

Here is my code:

tables = soup.find_all('table', id = re.compile('^table\d(19|2[0-9])'))

Andrej Kesely · Accepted Answer

You can use regex expression 'table[12]\d' (regex101):

data = '''



'''

from bs4 import BeautifulSoup
import re

soup = BeautifulSoup(data, 'html.parser')

for table in soup.find_all('table', {'id':re.compile(r'table[12]\d')}):
    print(table)

Prints:

EDIT: For table 19 or 20-29 use non-capturing group (regex101):

for table in soup.find_all('table', {'id':re.compile(r'table(?:19|2\d)')}):
    print(table)

How to Apply Regular Expression to BeautifulSoup with Python using find_All()

Answers (2)

Related Questions