Reputation: 533
I'm trying to extract a table from a webpage. I have managed to get all the data in the table into a list. However all the table data is being put into one list element. I need assistance getting the 'clean' data (i.e. the strings, without all the HTML packaging) from the rows of the table into their own list elements.
So instead of...
list = [<tr>
<th><a href="/7.62x25mm_TT_AKBS" title="7.62x25mm TT AKBS"><img alt="TTAKBS.png" decoding="async" height="64" src="https://static.wikia.nocookie.net/escapefromtarkov_gamepedia/images/6/61/TTAKBS.png/revision/latest/scale-to-width-down/64?cb=20190519001904" width="64"/></a>
</th>
<th><a href="/7.62x25mm_TT_AKBS" title="7.62x25mm TT AKBS">7.62x25mm TT AKBS</a>
</th>
<td>58
</td>
<td>12
</td>
<td>32]
I would like...
list = ['href="/7.62x25mm_TT_AKBS" title="7.62x25mm TT AKBS"><img alt="TTAKBS.png" decoding="async" height="64" src="https://static.wikia.nocookie.net/escapefromtarkov_gamepedia/images/6/61/TTAKBS.png/revision/latest/scale-to-width-down/64?cb=20190519001904" width="64"',
'href="/7.62x25mm_TT_AKBS" title="7.62x25mm TT AKBS">7.62x25mm TT AKBS',
'58',
'12',
'32']
My code and list
can be replicated using the following.
#Import Modules
import re
import requests
from bs4 import BeautifulSoup
#Get page
cartridge_url = 'https://escapefromtarkov.gamepedia.com/7.62x25mm_Tokarev'
cartridge_page = requests.get(cartridge_url)
cartridge_soup = BeautifulSoup(cartridge_page.content, 'html.parser')
#This gets the rows of the table I want
list = cartridge_soup.find_all(lambda t: t.name =='tr')
#This gets rid of an element which is not useful
list = [n for n in dirty_temp_type if not 'class="va-navbox' in str(n)]
#I had hoped this might assemble a list..
list = [str(n) for n in list]
I'm learning python, I think I grasp HTML, but I cannot get python to interact with my bs4.element.ResultSet
. I know this is not a sophisticated solution but I have hit a brick wall after trying a number of different approaches. My 'true' end goal is a list like the following...
list = ['7.62x25mm_TT_AKBS',
'58',
'12',
'32']
Attempts to Implement Suggested Solutions:
---> As suggested by AzyCrw4282
That's an incredible username btw.
(i)
I [think I] can see roughly what I'm supposed to do but I'm failing to properly implement it.
Using...
cartridge_table = cartridge_soup.find_all('table')
I get what looks to be all the right data in HTML format stored inside cartridge_table
. However, running...
for row in cartridge_table.find_all("tr")[:1]:
print([cell.get_text(strip=True) for cell in row.find_all("td")])
...returns...
ResultSet object has no attribute 'find_all'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?
... and replacing find_all
with find
doesn't remedy the issue.
(ii)
I half-heartedly ran...
for row in cartridge_soup.find_all("tr")[:1]:
print([cell.get_text(strip=True) for cell in row.find_all("td")])
...but this returns an empty list.
(iii)
The question you originally linked to defines a variable called header
prior to filling the table
variable with the necassary data...
header = soup.find("b", text="Payable")
table = header.find_parent("table")
I'm not grasping what to replace "Payable" with to get this to work for me.
(iv)
I tried to negate the above problem in (iii) by giving this a stab...
cartridge_table = cartridge_soup.find_parent("table")
for row in cartridge_soup.find_all("tr")[:1]:
print([cell.get_text(strip=True) for cell in row.find_all("td")])
But it returns an empty list. When I checked it's because nothing gets stored under the cartridge_table
variable.
(V)
I tried running...
header = cartridge_soup.find("b", text="Payable")
... and replacing "Payable"
with a variety of seemingly sensible alternatives to see what would happen, but I got nowhere. Ultimately the header
variable always seemed to remain empty.
Examples: "Icon"
, "Name"
, "Fragmentation Chance"
, "wikitable sortable"
, "7.62x25mm TT LRN"
, "7.62x25mm_TT_AKBS"
.
Upvotes: 2
Views: 432
Reputation: 7744
I have played around to solve the problem but there seems to be something wrong with the table given on the page — at least that's what I think. The extraction of the table should yield elements of size n
for the given number of rows but for some reason, it gives all of the rows as a single element in the array. I did look into but didn't get far with this(and I am also short of time).
Given that you are only interested in the cells in the first rows then in this case you can easily do it by targetting those elements with the XPath
approach. This will easily locate the elements and yield the values you require. Xpath
however doesn't work with BeautifulSoup
.
To solve this problem, I ended up using a hardcoded approach to select the required elements in the array. This targets the first extraction of the name
column, followed by the other columns.
Code
import re
import requests
from bs4 import BeautifulSoup
import urllib.request
#Get page
cartridge_url = 'https://escapefromtarkov.gamepedia.com/7.62x25mm_Tokarev'
page = urllib.request.urlopen(cartridge_url)
cartridge_soup = BeautifulSoup(page.read())
tables = cartridge_soup.findChildren('table')
my_table = tables[0]
cartridge_table = my_table.findChildren(['table','th', 'tr'])
dataArray = []
dataArray.append(str(cartridge_table[13]).split('</a>')[0][45:62].replace(" ","_"))
splitChar = str(cartridge_table[13]).split("</td>")
for data in splitChar[:3]:
dataArray.append(data[-3:-1])
print(dataArray)
Gives
['7.62x25mm_TT_AKBS', '58', '12', '32']
Let me know if it solves your problem or if it needs adapting for other use cases.
Upvotes: 1
Reputation: 7744
In my understanding there's no way to parse this
list = [<tr>
<th><a href="/7.62x25mm_TT_AKBS" title="7.62x25mm TT AKBS"><img alt="TTAKBS.png" decoding="async" height="64" src="https://static.wikia.nocookie.net/escapefromtarkov_gamepedia/images/6/61/TTAKBS.png/revision/latest/scale-to-width-down/64?cb=20190519001904" width="64"/></a>
</th>
<th><a href="/7.62x25mm_TT_AKBS" title="7.62x25mm TT AKBS">7.62x25mm TT AKBS</a>
</th>
<td>58
</td>
<td>12
</td>
<td>32]
to your intended format. There may be a way to do it using regex
but this overly complicates it.
The problem though can be solved using other methods. Also, it is not possible to locally debug your code since you haven't defined what dirty_temp_type
is [This was an error in variable naming which has since been corrected]. In addition(as already mentioned in the comment), do not use list
as a variable name, since that can invoke the built-in definition of list
and can cause errors.
There's a perfect answer here - Python+BeautifulSoup: scraping a particular table from a webpage - and this shows you exactly what you need to do.
Code snippett from the mentioned link
for row in table.find_all("tr")[:1]:
print([cell.get_text(strip=True) for cell in row.find_all("td")])
Using this would fetch the data from the cells of the first row and will give you the output of your end-goal in a list.
Upvotes: 1