Reputation: 922
I'm new to python and Beautiful soup, but I am working on a web scraper that will grab the data from this website :
http://yiimp.eu/site/tx?address=DFc6oo4CAemHF4KerLG39318E1KciTs742
The webpage is pretty simplistic, basically just a table, so I'm just trying to grab each field within the table. My issue is that for the first field, I'm trying to actually grab the date in the span title
rather than the actual value that is displayed. I can grab a list of the span titles
, or I can grab the other information from the other two fields, but I am unable to grab the span title and the other two fields at the same time. Heres an example of what I'm trying to accomplish:
2018-01-20 03:37:00
3.90135252
8ece3baba44382eec3d62fa76b5beba98ae398f81ad2d77556b95c3c1a739b4f
Instead, the best I'm able to do so far is
{'title': '2018-01-20 03:57:00'}
2h ago
{'title': '2018-01-20 03:57:00'}
3.90135252
{'title': '2018-01-20 03:57:00'}
8ece3baba44382eec3d62fa76b5beba98ae398f81ad2d77556b95c3c1a739b4f
This is close, but unfortunately it duplicates the title time, leaves the title tag in the output, and it actually just repeats that same date and time for every single record. What is the best way to achieve the results I'm looking for?
Here is my code
import requests
import time
from bs4 import BeautifulSoup
theurl = "http://yiimp.eu/site/tx?address=DFc6oo4CAemHF4KerLG39318E1KciTs742"
thepage = requests.get(theurl, headers={'User-Agent':'MyAgent'})
soup = BeautifulSoup(thepage.text, "html.parser")
for table in soup.findAll('td'):
print(table.text)
for time in soup.findAll('span'):
print(time.attrs)
count = 1
if count == 1:
count ==0
break
Upvotes: 0
Views: 3139
Reputation: 7248
Try this for getting the values from all the rows:
for row in soup.find_all('tr', {'class': 'ssrow', 'style': None}):
time = row.find('span')['title']
amount = row.find('td', {'align': 'right'}).find('b').text
tx = row.find('a').text
# Print these values however you want.
To check the code for first row:
row = soup.find('tr', {'class': 'ssrow', 'style': None})
time = row.find('span')['title']
amount = row.find('td', {'align': 'right'}).find('b').text
tx = row.find('a').text
print(time, amount, tx)
Output:
2018-01-20 06:56:43 4.42507599 d142445fd36e6a141a18071110faa8f6f3f9f8a42de888a149d8aa9416fe83ce
Explanation:
All the rows are included in the <tr>
tag, but the first <tr>
tag is for the heading. To filter that out, I've added the attribute 'class': 'ssrow'
as all other rows have that attribute. But if you can see the last row it's the total with its <tr>
tag containing style="border-top: 2px solid #eee;"
. To filter that out, I've added 'style': None
.
Upvotes: 2