Get the content of tr in tbody

Question

I have the following table :


        
            
                Extent of IFRS application
                Status
                Additional Information
            
        
        
                    
                        IFRS Standards are required for domestic public companies
                        
                        
                        
                    
                    
                        IFRS Standards are permitted but not required for domestic public companies
                        
                                
                        
                        Permitted, but very few companies use IFRS Standards.
                    
                    
                        IFRS Standards are required or permitted for listings by foreign companies
                        
                        
                        
                    
                    
                        The IFRS for SMEs Standard is required or permitted
                        
                                
                        
                        The IFRS for SMEs Standard is permitted, but very few companies use it. Nearly all SMEs use Paraguayan national accounting standards.
                    
                    
                        The IFRS for SMEs Standard is under consideration

I am trying to extract the data like in its original source :

This is my work :

from bs4 import BeautifulSoup
import requests
import pandas as pd
import re
# Site URL
url = "https://www.ifrs.org/use-around-the-world/use-of-ifrs-standards-by-jurisdiction/paraguay"
# Make a GET request to fetch the raw HTML content
html_content = requests.get(url).text
# Parse HTML code for the entire site
soup = BeautifulSoup(html_content, "lxml")
gdp = soup.find_all("table", attrs={"class": "adoption-status-table"})
print("Number of tables on site: ",len(gdp))

table1 = gdp[0]
body = table1.find_all("tr")
head = body[0] 
body_rows = body[1:] 

headings = []
for item in head.find_all("th"):
    item = (item.text).rstrip("
")
    headings.append(item)
print(headings)

all_rows = [] 
for row_num in range(len(body_rows)): 
    row = [] 
    for row_item in body_rows[row_num].find_all("td"):
        aa = re.sub("(\xa0)|(
)|,","",row_item.text)
        row.append(aa)
    all_rows.append(row)

df = pd.DataFrame(data=all_rows,columns=headings)

This is the only output I get :

Number of tables on site:  1
['Extent of IFRS application', 'Status', 'Additional Information']

I want to replace the NULL cells by False and the path to the image check by True.

Danila Ganchar · Accepted Answer

You need to look for img element inside td. Here is an example:

data = []
for tr in body_rows:
    cells = tr.find_all('td')
    img = cells[1].find('img')
    if img and img['src'] == '/images/icons/tick.png':
        status = True
    else:
        status = False
    
    data.append({
        'Extent of IFRS application': cells[0].string,
        'Status': status,
        'Additional Information': cells[2].string,
    })

print(pd.DataFrame(data).head())

Get the content of tr in tbody

Answers (2)

Related Questions