Reputation: 1810
So I am trying to scrape a website with multiple pages.
Each page has multiple </table>
tags with ids ranging from 19 to 29. the number of tables on each page is random
Here is an example:
page 1 HTML
<table id='table20'>...</table>
<table id='table25'>...</table>
page 2 HTML
<table id='table19'>...</table>
<table id='table21'>...</table>
<table id='table29'>...</table>
page 3 HTML
<table id='table19'>...</table>
<table id='table20'>...</table>
<table id='table21'>...</table>
....
page n HTML
<table id='table19'>...</table>
I am trying to isolate these tables from the html pages, in order to scrape them. So far, I am able to loop through each page, but the regex that I wrote in order to extract the tables from each page don't seem to work. Please help me.
Here is my code:
tables = soup.find_all('table', id = re.compile('^table\d(19|2[0-9])'))
Upvotes: 0
Views: 1385
Reputation: 84465
If that id start string is unique to the tables of interest could you not use attribute = value css selector and starts with operator?
for table in soup.select('table[id^=table]'):
#do something with table
Upvotes: 0
Reputation: 195448
You can use regex expression 'table[12]\d'
(regex101):
data = '''<table id='table19'><tr></tr></table>
<table id='table20'><tr></tr></table>
<table id='table21'><tr></tr></table>
<table id='table40'><tr></tr></table>'''
from bs4 import BeautifulSoup
import re
soup = BeautifulSoup(data, 'html.parser')
for table in soup.find_all('table', {'id':re.compile(r'table[12]\d')}):
print(table)
Prints:
<table id="table19"><tr></tr></table>
<table id="table20"><tr></tr></table>
<table id="table21"><tr></tr></table>
EDIT: For table 19 or 20-29 use non-capturing group (regex101):
for table in soup.find_all('table', {'id':re.compile(r'table(?:19|2\d)')}):
print(table)
Upvotes: 2