Web-scrapeing a table to a list

Question

I'm trying to extract a table from a webpage. I have managed to get all the data in the table into a list. However all the table data is being put into one list element. I need assistance getting the 'clean' data (i.e. the strings, without all the HTML packaging) from the rows of the table into their own list elements.

So instead of...

list  = [
         
         
         7.62x25mm TT AKBS
         
         58
         
         12
         
         32]

I would like...

list  = ['href="/7.62x25mm_TT_AKBS" title="7.62x25mm TT AKBS">7.62x25mm TT AKBS',
         '58',
         '12',
         '32']

My code and list can be replicated using the following.

#Import Modules
import re
import requests
from bs4 import BeautifulSoup

#Get page
cartridge_url = 'https://escapefromtarkov.gamepedia.com/7.62x25mm_Tokarev'
cartridge_page = requests.get(cartridge_url)
cartridge_soup = BeautifulSoup(cartridge_page.content, 'html.parser')

#This gets the rows of the table I want
list = cartridge_soup.find_all(lambda t: t.name =='tr')

#This gets rid of an element which is not useful
list = [n for n in dirty_temp_type if not 'class="va-navbox' in str(n)]

#I had hoped this might assemble a list..  
list = [str(n) for n in list]

I'm learning python, I think I grasp HTML, but I cannot get python to interact with my bs4.element.ResultSet. I know this is not a sophisticated solution but I have hit a brick wall after trying a number of different approaches. My 'true' end goal is a list like the following...

list  = ['7.62x25mm_TT_AKBS',
         '58',
         '12',
         '32']

Attempts to Implement Suggested Solutions:

---> As suggested by AzyCrw4282

That's an incredible username btw.

(i)

I [think I] can see roughly what I'm supposed to do but I'm failing to properly implement it.

Using...

cartridge_table = cartridge_soup.find_all('table')

I get what looks to be all the right data in HTML format stored inside cartridge_table. However, running...

for row in cartridge_table.find_all("tr")[:1]:
    print([cell.get_text(strip=True) for cell in row.find_all("td")])

...returns...

ResultSet object has no attribute 'find_all'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?

... and replacing find_all with find doesn't remedy the issue.

(ii)

I half-heartedly ran...

for row in cartridge_soup.find_all("tr")[:1]:
    print([cell.get_text(strip=True) for cell in row.find_all("td")])

...but this returns an empty list.

(iii)

The question you originally linked to defines a variable called header prior to filling the table variable with the necassary data...

header = soup.find("b", text="Payable")
table = header.find_parent("table")

I'm not grasping what to replace "Payable" with to get this to work for me.

(iv)

I tried to negate the above problem in (iii) by giving this a stab...

cartridge_table = cartridge_soup.find_parent("table")

for row in cartridge_soup.find_all("tr")[:1]:
    print([cell.get_text(strip=True) for cell in row.find_all("td")])

But it returns an empty list. When I checked it's because nothing gets stored under the cartridge_table variable.

(V)

I tried running...

header = cartridge_soup.find("b", text="Payable")

... and replacing "Payable" with a variety of seemingly sensible alternatives to see what would happen, but I got nowhere. Ultimately the header variable always seemed to remain empty.

Examples: "Icon", "Name", "Fragmentation Chance", "wikitable sortable", "7.62x25mm TT LRN", "7.62x25mm_TT_AKBS".

AzyCrw4282 · Accepted Answer

I have played around to solve the problem but there seems to be something wrong with the table given on the page — at least that's what I think. The extraction of the table should yield elements of size n for the given number of rows but for some reason, it gives all of the rows as a single element in the array. I did look into but didn't get far with this(and I am also short of time).

Given that you are only interested in the cells in the first rows then in this case you can easily do it by targetting those elements with the XPath approach. This will easily locate the elements and yield the values you require. Xpath however doesn't work with BeautifulSoup.

To solve this problem, I ended up using a hardcoded approach to select the required elements in the array. This targets the first extraction of the name column, followed by the other columns.

Code

import re
import requests
from bs4 import BeautifulSoup
import urllib.request

#Get page
cartridge_url = 'https://escapefromtarkov.gamepedia.com/7.62x25mm_Tokarev'
page = urllib.request.urlopen(cartridge_url)
cartridge_soup = BeautifulSoup(page.read())
tables = cartridge_soup.findChildren('table')
my_table = tables[0]

cartridge_table = my_table.findChildren(['table','th', 'tr'])
dataArray = []
dataArray.append(str(cartridge_table[13]).split('')[0][45:62].replace(" ","_"))
splitChar = str(cartridge_table[13]).split("")

for data in splitChar[:3]:
    dataArray.append(data[-3:-1])

print(dataArray)

Gives

['7.62x25mm_TT_AKBS', '58', '12', '32']

Let me know if it solves your problem or if it needs adapting for other use cases.

Web-scrapeing a table to a list

Answers (2)

Related Questions