Solebay Sharp
Solebay Sharp

Reputation: 533

Web-scrapeing a table to a list

I'm trying to extract a table from a webpage. I have managed to get all the data in the table into a list. However all the table data is being put into one list element. I need assistance getting the 'clean' data (i.e. the strings, without all the HTML packaging) from the rows of the table into their own list elements.

So instead of...

list  = [<tr>
         <th><a href="/7.62x25mm_TT_AKBS" title="7.62x25mm TT AKBS"><img alt="TTAKBS.png" decoding="async" height="64" src="https://static.wikia.nocookie.net/escapefromtarkov_gamepedia/images/6/61/TTAKBS.png/revision/latest/scale-to-width-down/64?cb=20190519001904" width="64"/></a>
         </th>
         <th><a href="/7.62x25mm_TT_AKBS" title="7.62x25mm TT AKBS">7.62x25mm TT AKBS</a>
         </th>
         <td>58
         </td>
         <td>12
         </td>
         <td>32]

I would like...

list  = ['href="/7.62x25mm_TT_AKBS" title="7.62x25mm TT AKBS"><img alt="TTAKBS.png" decoding="async" height="64" src="https://static.wikia.nocookie.net/escapefromtarkov_gamepedia/images/6/61/TTAKBS.png/revision/latest/scale-to-width-down/64?cb=20190519001904" width="64"',
         'href="/7.62x25mm_TT_AKBS" title="7.62x25mm TT AKBS">7.62x25mm TT AKBS',
         '58',
         '12',
         '32']

My code and list can be replicated using the following.

#Import Modules
import re
import requests
from bs4 import BeautifulSoup

#Get page
cartridge_url = 'https://escapefromtarkov.gamepedia.com/7.62x25mm_Tokarev'
cartridge_page = requests.get(cartridge_url)
cartridge_soup = BeautifulSoup(cartridge_page.content, 'html.parser')

#This gets the rows of the table I want
list = cartridge_soup.find_all(lambda t: t.name =='tr')

#This gets rid of an element which is not useful
list = [n for n in dirty_temp_type if not 'class="va-navbox' in str(n)]

#I had hoped this might assemble a list..  
list = [str(n) for n in list]

I'm learning python, I think I grasp HTML, but I cannot get python to interact with my bs4.element.ResultSet. I know this is not a sophisticated solution but I have hit a brick wall after trying a number of different approaches. My 'true' end goal is a list like the following...

list  = ['7.62x25mm_TT_AKBS',
         '58',
         '12',
         '32']

Attempts to Implement Suggested Solutions:

---> As suggested by AzyCrw4282

That's an incredible username btw.

(i)

I [think I] can see roughly what I'm supposed to do but I'm failing to properly implement it.

Using...

cartridge_table = cartridge_soup.find_all('table')

I get what looks to be all the right data in HTML format stored inside cartridge_table. However, running...

for row in cartridge_table.find_all("tr")[:1]:
    print([cell.get_text(strip=True) for cell in row.find_all("td")])

...returns...

ResultSet object has no attribute 'find_all'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?

... and replacing find_all with find doesn't remedy the issue.

(ii)

I half-heartedly ran...

for row in cartridge_soup.find_all("tr")[:1]:
    print([cell.get_text(strip=True) for cell in row.find_all("td")])

...but this returns an empty list.

(iii)

The question you originally linked to defines a variable called header prior to filling the table variable with the necassary data...

header = soup.find("b", text="Payable")
table = header.find_parent("table")

I'm not grasping what to replace "Payable" with to get this to work for me.

(iv)

I tried to negate the above problem in (iii) by giving this a stab...

cartridge_table = cartridge_soup.find_parent("table")

for row in cartridge_soup.find_all("tr")[:1]:
    print([cell.get_text(strip=True) for cell in row.find_all("td")])

But it returns an empty list. When I checked it's because nothing gets stored under the cartridge_table variable.

(V)

I tried running...

header = cartridge_soup.find("b", text="Payable")

... and replacing "Payable" with a variety of seemingly sensible alternatives to see what would happen, but I got nowhere. Ultimately the header variable always seemed to remain empty.

Examples: "Icon", "Name", "Fragmentation Chance", "wikitable sortable", "7.62x25mm TT LRN", "7.62x25mm_TT_AKBS".

Upvotes: 2

Views: 432

Answers (2)

AzyCrw4282
AzyCrw4282

Reputation: 7744

I have played around to solve the problem but there seems to be something wrong with the table given on the page — at least that's what I think. The extraction of the table should yield elements of size n for the given number of rows but for some reason, it gives all of the rows as a single element in the array. I did look into but didn't get far with this(and I am also short of time).

Given that you are only interested in the cells in the first rows then in this case you can easily do it by targetting those elements with the XPath approach. This will easily locate the elements and yield the values you require. Xpath however doesn't work with BeautifulSoup.

To solve this problem, I ended up using a hardcoded approach to select the required elements in the array. This targets the first extraction of the name column, followed by the other columns.

Code

import re
import requests
from bs4 import BeautifulSoup
import urllib.request

#Get page
cartridge_url = 'https://escapefromtarkov.gamepedia.com/7.62x25mm_Tokarev'
page = urllib.request.urlopen(cartridge_url)
cartridge_soup = BeautifulSoup(page.read())
tables = cartridge_soup.findChildren('table')
my_table = tables[0]

cartridge_table = my_table.findChildren(['table','th', 'tr'])
dataArray = []
dataArray.append(str(cartridge_table[13]).split('</a>')[0][45:62].replace(" ","_"))
splitChar = str(cartridge_table[13]).split("</td>")

for data in splitChar[:3]:
    dataArray.append(data[-3:-1])

print(dataArray)

Gives

['7.62x25mm_TT_AKBS', '58', '12', '32']

Let me know if it solves your problem or if it needs adapting for other use cases.

Upvotes: 1

AzyCrw4282
AzyCrw4282

Reputation: 7744

In my understanding there's no way to parse this

list  = [<tr>
         <th><a href="/7.62x25mm_TT_AKBS" title="7.62x25mm TT AKBS"><img alt="TTAKBS.png" decoding="async" height="64" src="https://static.wikia.nocookie.net/escapefromtarkov_gamepedia/images/6/61/TTAKBS.png/revision/latest/scale-to-width-down/64?cb=20190519001904" width="64"/></a>
         </th>
         <th><a href="/7.62x25mm_TT_AKBS" title="7.62x25mm TT AKBS">7.62x25mm TT AKBS</a>
         </th>
         <td>58
         </td>
         <td>12
         </td>
         <td>32]

to your intended format. There may be a way to do it using regex but this overly complicates it.

The problem though can be solved using other methods. Also, it is not possible to locally debug your code since you haven't defined what dirty_temp_type is [This was an error in variable naming which has since been corrected]. In addition(as already mentioned in the comment), do not use list as a variable name, since that can invoke the built-in definition of list and can cause errors.

There's a perfect answer here - Python+BeautifulSoup: scraping a particular table from a webpage - and this shows you exactly what you need to do.

Code snippett from the mentioned link

for row in table.find_all("tr")[:1]:
    print([cell.get_text(strip=True) for cell in row.find_all("td")])

Using this would fetch the data from the cells of the first row and will give you the output of your end-goal in a list.

Upvotes: 1

Related Questions