Praveen Kumar
Praveen Kumar

Reputation: 88

Extract text from nested tags inside another nested tags using beautifulsoup in python3

I have an html page in which it has the same set of html codes with different data, i need to get the data "709". I am able to get all the texts inside the tr tag, but i dunno how to get inside of the tr tag and to get the data in the td tag alone. Please help me. Below is the html code.

<table class="readonlydisplaytable">
	<tbody>
		<tr class="readonlydisplayfield">
			<th class="readonlydisplayfieldlabel">Payer Phone #</th>
			<td class="readonlydisplayfielddata">1234</td>
		</tr>
		<tr class="readonlydisplayfield">
			<th class="readonlydisplayfieldlabel">Name</th>
			<td class="readonlydisplayfielddata">ABC SERVICES</td>
		</tr>
		
		<tr class="readonlydisplayfield">
			<th class="readonlydisplayfieldlabel">Package #</th>
			<td class="readonlydisplayfielddata">709</td>
		</tr>
		
		<tr class="readonlydisplayfield">
			<th class="readonlydisplayfieldlabel">Case #</th>
			<td class="readonlydisplayfielddata">n/a</td>
		</tr>
		
		<tr class="readonlydisplayfield">
			<th class="readonlydisplayfieldlabel">Date</th>
			<td class="readonlydisplayfielddata">n/a</td>
		</tr>
		
		<tr class="readonlydisplayfield">
			<th class="readonlydisplayfieldlabel">Adjuster</th>
			<td class="readonlydisplayfielddata">n/a</td>
		</tr>
		
		<tr class="readonlydisplayfield">
			<th class="readonlydisplayfieldlabel">Adjuster Phone #</th>
			<td class="readonlydisplayfielddata">n/a</td>
		</tr>
		
		<tr class="readonlydisplayfield">
			<th class="readonlydisplayfieldlabel">Adjuster Fax #</th>
			<td class="readonlydisplayfielddata">n/a</td>
		</tr>
		
		<tr class="readonlydisplayfield">
			<th class="readonlydisplayfieldlabel">Body Part</th>
			<td class="readonlydisplayfielddata">n/a</td>
		</tr>
		
		<tr class="readonlydisplayfield">
			<th class="readonlydisplayfieldlabel">Deadline</th>
			<td class="readonlydisplayfielddata">11/22/2014</td>
		</tr>			
	</tbody>
</table>

Below is the code i used.

from selenium import webdriver
import os, time, csv, datetime
from selenium.webdriver.common.keys import Keys
import threading
import multiprocessing
from selenium.webdriver.support.select import Select
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
import openpyxl
from bs4 import BeautifulSoup
import urllib.request
import pandas as pd


soup = BeautifulSoup(open("C:\\Users\\mapraveenkumar\\Desktop\\phonepayor.htm"), "html5lib")
a = soup.find_all("table", class_="readonlydisplaytable")
for b in a:
    c = b.find_all("tr", class_="readonlydisplayfield")
    for d in c:
        if "Package #" in d.get_text():
            print(d.get_text())

Upvotes: 0

Views: 2098

Answers (2)

Bill Bell
Bill Bell

Reputation: 21643

You want the text inside the td element adjacent to the th element that contains 'Package #'. I begin by looking for that, then I find its parent and the parent's siblings. As usual, I find it easiest to work in an interactive environment when I'm trying to ellucidate how to capture what I want. I suspect that the main point is to use find_all with string=.

>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(open('temp.htm').read(),'lxml')
>>> target = soup.find_all(string='Package #')
>>> target
['Package #']
>>> target[0].findParent()
<th class="readonlydisplayfieldlabel">Package #</th>
>>> target[0].findParent().fetchNextSiblings()
[<td class="readonlydisplayfielddata">709</td>]
>>> tds = target[0].findParent().fetchNextSiblings()
>>> tds[0].text
'709'

Upvotes: 1

innicoder
innicoder

Reputation: 2688

html = '''code above (html'''
soup = bs(html,'lxml')

find_tr = soup.find_all('tr') #Iterates through 'tr'
for i in find_tr:
    for j in i.find_all('th'): #iterates through 'th' tags in the 'tr'
        print(j)
    for k in i.find_all('td'): #iterates through 'td' tags in 'tr'
        print(k)

This should do the job. We make a for loop that goes through each TR tag and for EACH value of the tr tag example (we'll make 2 loops that find all th and td tags:

<tr class="readonlydisplayfield">
        <th class="readonlydisplayfieldlabel">Payer Phone #</th>
        <td class="readonlydisplayfielddata">1234</td>
</tr>

Now this will work also if there is more than 1 td or th tag. For one tag (td,th) use, we can do the following:

find_tr = soup.find_all('tr') #finds all tr
for i in find_tr: #Goes through all tr
    print(i.th.text) # the .th will gives us the th tag from one TR
    print(i.td.text) # .td will return the  td.text value.

Upvotes: 0

Related Questions