Reputation: 1
I have the following table :
<table class="table table-bordered adoption-status-table">
<thead>
<tr>
<th>Extent of IFRS application</th>
<th>Status</th>
<th>Additional Information</th>
</tr>
</thead>
<tbody>
<tr>
<td>IFRS Standards are required for domestic public companies</td>
<td>
</td>
<td></td>
</tr>
<tr>
<td>IFRS Standards are permitted but not required for domestic public companies</td>
<td>
<img src="/images/icons/tick.png" alt="tick">
</td>
<td>Permitted, but very few companies use IFRS Standards.</td>
</tr>
<tr>
<td>IFRS Standards are required or permitted for listings by foreign companies</td>
<td>
</td>
<td></td>
</tr>
<tr>
<td>The IFRS for SMEs Standard is required or permitted</td>
<td>
<img src="/images/icons/tick.png" alt="tick">
</td>
<td>The IFRS for SMEs Standard is permitted, but very few companies use it. Nearly all SMEs use Paraguayan national accounting standards.</td>
</tr>
<tr>
<td>The IFRS for SMEs Standard is under consideration</td>
<td>
</td>
<td></td>
</tr>
</tbody>
</table>
I am trying to extract the data like in its original source :
This is my work :
from bs4 import BeautifulSoup
import requests
import pandas as pd
import re
# Site URL
url = "https://www.ifrs.org/use-around-the-world/use-of-ifrs-standards-by-jurisdiction/paraguay"
# Make a GET request to fetch the raw HTML content
html_content = requests.get(url).text
# Parse HTML code for the entire site
soup = BeautifulSoup(html_content, "lxml")
gdp = soup.find_all("table", attrs={"class": "adoption-status-table"})
print("Number of tables on site: ",len(gdp))
table1 = gdp[0]
body = table1.find_all("tr")
head = body[0]
body_rows = body[1:]
headings = []
for item in head.find_all("th"):
item = (item.text).rstrip("\n")
headings.append(item)
print(headings)
all_rows = []
for row_num in range(len(body_rows)):
row = []
for row_item in body_rows[row_num].find_all("td"):
aa = re.sub("(\xa0)|(\n)|,","",row_item.text)
row.append(aa)
all_rows.append(row)
df = pd.DataFrame(data=all_rows,columns=headings)
This is the only output I get :
Number of tables on site: 1
['Extent of IFRS application', 'Status', 'Additional Information']
I want to replace the NULL cells by False and the path to the image check by True.
Upvotes: 2
Views: 639
Reputation: 45402
Above answer is good, one other option is to use pandas.read_html
to extract the table into a dataframe and populate the missing Status
column using lxml
xpath (or beautifulsoup if you prefer but it's more verbose) :
import pandas as pd
import requests
from lxml import html
r = requests.get("https://www.ifrs.org/use-around-the-world/use-of-ifrs-standards-by-jurisdiction/paraguay")
table = pd.read_html(r.content)[0]
tree = html.fromstring(r.content)
table["Status"] = [True if t.xpath("img") else False for t in tree.xpath('//table/tbody/tr/td[2]')]
print(table)
Upvotes: 2
Reputation: 11222
You need to look for img
element inside td
. Here is an example:
data = []
for tr in body_rows:
cells = tr.find_all('td')
img = cells[1].find('img')
if img and img['src'] == '/images/icons/tick.png':
status = True
else:
status = False
data.append({
'Extent of IFRS application': cells[0].string,
'Status': status,
'Additional Information': cells[2].string,
})
print(pd.DataFrame(data).head())
Upvotes: 2