Reputation: 31
I'm trying to scrape an html table with bs4, but my code is not working. I'd like to get the tds row data information so that I can write them in a csv file. this is my html code:
<table class="sc-jAaTju bVEWLO">
<thead>
<tr>
<td width="10%">Rank</td>
<td>Trending Topic</td>
<td width="30%">Tweet Volume</td>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td><a href="http:///example.com/search?q=%23One" target="_blank" without="true" rel="noopener noreferrer">#One</a></td>
<td>1006.4K tweets</td>
</tr>
<tr>
<td>2</td>
<td><a href="http:///example.com/search?q=%23Two" target="_blank" without="true" rel="noopener noreferrer">#Two</a></td>
<td>1028.7K tweets</td>
</tr>
<tr>
<td>3</td>
<td><a href="http:///example.com/search?q=%23Three" target="_blank" without="true" rel="noopener noreferrer">#Three</a></td>
<td>Less than 10K tweets</td>
</tr>
</tbody>
</table>
This is my first try:
url = requests.get(f"https://www.exportdata.io/trends/italy/2020-01-01/0")
soup = BeautifulSoup(url.text, "html.parser")
table = soup.find_all("table", attrs={"class":"sc-jAaTju bVEWLO"})
And my second one:
tables = soup.find_all('table')
for table in tables:
td = tables.td.text.strip()
But neither are working. What am I missing? Thank you
Upvotes: 0
Views: 168
Reputation: 1662
the page loads dynamically, so you need to find the request and substitute the date and time into it
import requests
import pandas as pd
url = "https://api.exportdata.io/trends/locations/it?date=2020-01-01&hour=0"
response = requests.get(url)
df = pd.DataFrame(response.json()).fillna('Less than 10K tweets')
print(df.to_string(columns=['name', 'tweet_volume']))
OUTPUT:
name tweet_volume
0 #lannocheverra Less than 10K tweets
1 Happy New Year 4948992.0
2 Buon 2020 18359.0
3 #Mattarella 19304.0
4 #skamfrance Less than 10K tweets
5 Mariah Carey 36853.0
6 #GliAristogatti Less than 10K tweets
7 Orietta Berti Less than 10K tweets
8 Gigi D'Alessio Less than 10K tweets
9 Auguriiiii Less than 10K tweets
10 #NewYear 163253.0
11 Welcome 2020 101403.0
12 Romina Power Less than 10K tweets
13 Auguri Matteo Less than 10K tweets
14 Al Bano Less than 10K tweets
15 fabrizio moro Less than 10K tweets
16 Panicucci Less than 10K tweets
17 John Boyega 78097.0
18 Inizio Less than 10K tweets
19 Auguri Silvia Less than 10K tweets
20 Auguri Marco Less than 10K tweets
21 #Ghostbusters Less than 10K tweets
22 #thebluesbrothers Less than 10K tweets
23 #FeliceAnnoNuovo Less than 10K tweets
24 #bottidicapodanno Less than 10K tweets
25 #ventiventi Less than 10K tweets
26 #quirinale Less than 10K tweets
Upvotes: 1
Reputation: 1
If you want to keep using BS something like this should help ->
'''
url = requests.get("https://www.exportdata.io/trends/italy/2020-01-01/0")
soup = BeautifulSoup(url.text, "html.parser")
table = soup.find_all("table", class_="sc-jAaTju bVEWLO")
for tds in table:
td = tds.find_all("td")
print (td.get_text(strip=True))
'''
Upvotes: 0
Reputation: 10460
That page is loading data dynamically. You can get the table in question using selenium:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support.relative_locator import locate_with
import pandas as pd
chrome_options = Options()
chrome_options.add_argument("--no-sandbox")
# chrome_options.add_argument("--headless")
webdriver_service = Service("chromedriver/chromedriver") ## path to where you saved chromedriver binary
browser = webdriver.Chrome(service=webdriver_service, options=chrome_options)
url = 'https://www.exportdata.io/trends/italy/2020-01-01/0'
browser.get(url)
table = WebDriverWait(browser, 10).until(EC.element_to_be_clickable((By.CLASS_NAME, "bVEWLO")))
dfs = pd.read_html(str(browser.page_source))
browser.quit()
print(dfs[0])
Result:
Rank Trending Topic Tweet Volume
0 1.0 #lannocheverra Less than 10K tweets
1 2.0 Happy New Year 4949K tweets
2 3.0 Buon 2020 18.4K tweets
3 4.0 #Mattarella 19.3K tweets
4 5.0 #skamfrance Less than 10K tweets
[...]
Upvotes: 1
Reputation: 129
I think pandas can help you here. You can pass your html page to pandas built-in function and it will do the job:
dfs = pd.read_html(soup)
table = dfs[0]
I have verified that it works. :)
Upvotes: 1