Reputation: 145
I have the below HTML source code that needs to be scraped. The data includes a table that is not configured with <table>
tag. I cannot use tags to identify the element as the same tag is used throughout the HTML code. How do I scrape the data to receive the below output? The header line always remains the same, the data within the table varies.
Code
<p>
<div class="my-test-class" style="white-space: pre-wrap; font-size: small; font-family: "Courier New"">
<div class="my-test-class">Random text goes on.........</div>
<div class="my-test-class"><br></div>
<div class="my-test-class">Header1 Header2 Header3 Header4 Header5
</div>
<div class="my-test-class">--------------------------------------
</div>
<div class="my-test-class">A1 B1 C1 D1 E1</div>
<div class="my-test-class">A2 B2 C2 D2 E2</div>
<div class="my-test-class">A3 B3 C3 D3 E3</div>
<div class="my-test-class">--------------------------------------
</div>
</div>
</p>
Output:
Header1 Header2 Header3 Header4 Header5
--------------------------------------
A1 B1 C1 D1 E1
A2 B2 C2 D2 E2
A3 B3 C3 D3 E3
--------------------------------------
Scraping code so far:
from bs4 import BeautifulSoup
import re
soup = BeautifulSoup('''\
<p>
<div class="my-test-class" style="white-space: pre-wrap; font-size: small; font-family: "Courier New"">
<div class="my-test-class">Random text goes on.........</div>
<div class="my-test-class"><br></div>
<div class="my-test-class">Header1 Header2 Header3 Header4 Header5
</div>
<div class="my-test-class">--------------------------------------
</div>
<div class="my-test-class">A1 B1 C1 D1 E1</div>
<div class="my-test-class">A2 B2 C2 D2 E2</div>
<div class="my-test-class">A3 B3 C3 D3 E3</div>
<div class="my-test-class">--------------------------------------
</div>
</div>
</p>
''')
h = soup.find_all(text=re.compile('Header1*'))
print(h)
Upvotes: 2
Views: 120
Reputation: 195438
You can find header and then .find_next_siblings()
is your table:
from bs4 import BeautifulSoup
txt = '''<p>
<div class="my-test-class" style="white-space: pre-wrap; font-size: small; font-family: "Courier New"">
<div class="my-test-class">Random text goes on.........</div>
<div class="my-test-class"><br></div>
<div class="my-test-class">Header1 Header2 Header3 Header4 Header5
</div>
<div class="my-test-class">--------------------------------------
</div>
<div class="my-test-class">A1 B1 C1 D1 E1</div>
<div class="my-test-class">A2 B2 C2 D2 E2</div>
<div class="my-test-class">A3 B3 C3 D3 E3</div>
<div class="my-test-class">--------------------------------------
</div>
</div>
</p>'''
soup = BeautifulSoup(txt, 'html.parser')
header = soup.find(text=lambda t: '----' in t).parent.find_previous()
print(header.text)
print(*[tag.get_text(strip=True) for tag in header.find_next_siblings()], sep='\n')
Prints:
Header1 Header2 Header3 Header4 Header5
--------------------------------------
A1 B1 C1 D1 E1
A2 B2 C2 D2 E2
A3 B3 C3 D3 E3
--------------------------------------
Upvotes: 1
Reputation: 5531
This should help you:
from bs4 import BeautifulSoup
import re
soup = BeautifulSoup('''\
<p>
<div class="my-test-class" style="white-space: pre-wrap; font-size: small; font-family: "Courier New"">
<div class="my-test-class">Random text goes on.........</div>
<div class="my-test-class"><br></div>
<div class="my-test-class">Header1 Header2 Header3 Header4 Header5
</div>
<div class="my-test-class">--------------------------------------
</div>
<div class="my-test-class">A1 B1 C1 D1 E1</div>
<div class="my-test-class">A2 B2 C2 D2 E2</div>
<div class="my-test-class">A3 B3 C3 D3 E3</div>
<div class="my-test-class">--------------------------------------
</div>
</div>
</p>
''', 'html5lib')
txt = soup.find_all('div', class_ = "my-test-class", text=True)
txt = [elem.text.strip() for elem in txt]
pattern = re.compile('[A-Z][0-9]')
[print(elem) for elem in txt if 'Header' in elem or '-' in elem or pattern.match(elem)]
Output:
Header1 Header2 Header3 Header4 Header5
--------------------------------------
A1 B1 C1 D1 E1
A2 B2 C2 D2 E2
A3 B3 C3 D3 E3
--------------------------------------
Upvotes: 0