Reputation: 129
I have this html table: I need to get specific data from this table and assign it to a variable, I do not need all the information. flag = "United Arab Emirates", home_port="Sharjah" etc. Since there are no 'class' on html elements, how do we extract this data.
r = requests.get('http://maritime-connector.com/ship/'+str(imo_number), headers={'User-Agent': 'Mozilla/5.0'})
soup = BeautifulSoup(r.content, "lxml")
table = soup.find("table", { "class" : "ship-data-table" })
for row in table.findAll("tr"):
tname = row.findAll("th")
cells = row.findAll("td")
print (type(tname))
print (type(cells))
I am using the python module beautfulSoup.
<table class="ship-data-table" style="margin-bottom:3px">
<thead>
<tr>
<th>IMO number</th>
<td>9492749</td>
</tr>
<tr>
<th>Name of the ship</th>
<td>SHARIEF PILOT</td>
</tr>
<tr>
<th>Type of ship</th>
<td>ANCHOR HANDLING VESSEL</td>
</tr>
<tr>
<th>MMSI</th>
<td>470535000</td>
</tr>
<tr>
<th>Gross tonnage</th>
<td>499 tons</td>
</tr>
<tr>
<th>DWT</th>
<td>222 tons</td>
</tr>
<tr>
<th>Year of build</th>
<td>2008</td>
</tr>
<tr>
<th>Builder</th>
<td>NANYANG SHIPBUILDING - JINGJIANG, CHINA</td>
</tr>
<tr>
<th>Flag</th>
<td>UNITED ARAB EMIRATES</td>
</tr>
<tr>
<th>Home port</th>
<td>SHARJAH</td>
</tr>
<tr>
<th>Manager & owner</th>
<td>GLOBAL MARINE SERVICES - SHARJAH, UNITED ARAB EMIRATES</td>
</tr>
<tr>
<th>Former names</th>
<td>SUPERIOR PILOT until 2008 Sep</td>
</tr>
</thead>
</table>
Upvotes: 2
Views: 450
Reputation: 2088
I would do something like this:
html = """
<your table>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
flag = soup.find("th", string="Flag").find_next("td").get_text(strip=True)
home_port = soup.find("th", string="Home port").find_next("td").get_text(strip=True)
print(flag)
print(home_port)
That way I'm making sure I match the text only in th
elements and getting the contents of next td
Upvotes: 0
Reputation: 473833
Go over all the th
elements in the table, get the text and the following td
sibling's text:
from pprint import pprint
from bs4 import BeautifulSoup
data = """your HTML here"""
soup = BeautifulSoup(data, "html.parser")
result = {header.get_text(strip=True): header.find_next_sibling("td").get_text(strip=True)
for header in soup.select("table.ship-data-table tr th")}
pprint(result)
This would construct a nice dictionary with headers as keys and corresponding td
texts as values:
{'Builder': 'NANYANG SHIPBUILDING - JINGJIANG, CHINA',
'DWT': '222 tons',
'Flag': 'UNITED ARAB EMIRATES',
'Former names': 'SUPERIOR PILOT until 2008 Sep',
'Gross tonnage': '499 tons',
'Home port': 'SHARJAH',
'IMO number': '9492749',
'MMSI': '470535000',
'Manager & owner': 'GLOBAL MARINE SERVICES - SHARJAH, UNITED ARAB EMIRATES',
'Name of the ship': 'SHARIEF PILOT',
'Type of ship': 'ANCHOR HANDLING VESSEL',
'Year of build': '2008'}
Upvotes: 2