Kome Gognome
Kome Gognome

Reputation: 31

Scrape html table row with Beautiful Soup

I'm trying to scrape an html table with bs4, but my code is not working. I'd like to get the tds row data information so that I can write them in a csv file. this is my html code:

<table class="sc-jAaTju bVEWLO">
    <thead>
        <tr>
            <td width="10%">Rank</td>
            <td>Trending Topic</td>
            <td width="30%">Tweet Volume</td>
        </tr>
        </thead>
        <tbody>
        <tr>
            <td>1</td>
            <td><a href="http:///example.com/search?q=%23One" target="_blank" without="true" rel="noopener noreferrer">#One</a></td>
            <td>1006.4K tweets</td>
        </tr>
        <tr>
            <td>2</td>
            <td><a href="http:///example.com/search?q=%23Two" target="_blank" without="true" rel="noopener noreferrer">#Two</a></td>
            <td>1028.7K tweets</td>
        </tr>
        <tr>
            <td>3</td>
            <td><a href="http:///example.com/search?q=%23Three" target="_blank" without="true" rel="noopener noreferrer">#Three</a></td>
            <td>Less than 10K tweets</td>
        </tr>
    </tbody>
</table>

This is my first try:

url = requests.get(f"https://www.exportdata.io/trends/italy/2020-01-01/0")
soup = BeautifulSoup(url.text, "html.parser")

table = soup.find_all("table", attrs={"class":"sc-jAaTju bVEWLO"})

And my second one:

tables = soup.find_all('table') 


for table in tables:
    td = tables.td.text.strip()

But neither are working. What am I missing? Thank you

Upvotes: 0

Views: 168

Answers (4)

Sergey K
Sergey K

Reputation: 1662

the page loads dynamically, so you need to find the request and substitute the date and time into it

import requests
import pandas as pd


url = "https://api.exportdata.io/trends/locations/it?date=2020-01-01&hour=0"
response = requests.get(url)
df = pd.DataFrame(response.json()).fillna('Less than 10K tweets')
print(df.to_string(columns=['name', 'tweet_volume']))

OUTPUT:

                 name          tweet_volume
0      #lannocheverra  Less than 10K tweets
1      Happy New Year             4948992.0
2           Buon 2020               18359.0
3         #Mattarella               19304.0
4         #skamfrance  Less than 10K tweets
5        Mariah Carey               36853.0
6     #GliAristogatti  Less than 10K tweets
7       Orietta Berti  Less than 10K tweets
8      Gigi D'Alessio  Less than 10K tweets
9          Auguriiiii  Less than 10K tweets
10           #NewYear              163253.0
11       Welcome 2020              101403.0
12       Romina Power  Less than 10K tweets
13      Auguri Matteo  Less than 10K tweets
14            Al Bano  Less than 10K tweets
15      fabrizio moro  Less than 10K tweets
16          Panicucci  Less than 10K tweets
17        John Boyega               78097.0
18             Inizio  Less than 10K tweets
19      Auguri Silvia  Less than 10K tweets
20       Auguri Marco  Less than 10K tweets
21      #Ghostbusters  Less than 10K tweets
22  #thebluesbrothers  Less than 10K tweets
23   #FeliceAnnoNuovo  Less than 10K tweets
24  #bottidicapodanno  Less than 10K tweets
25        #ventiventi  Less than 10K tweets
26         #quirinale  Less than 10K tweets

Upvotes: 1

Jaybe999
Jaybe999

Reputation: 1

If you want to keep using BS something like this should help ->

'''

url = requests.get("https://www.exportdata.io/trends/italy/2020-01-01/0")
soup = BeautifulSoup(url.text, "html.parser")


table = soup.find_all("table", class_="sc-jAaTju bVEWLO")

for tds in table:
    td = tds.find_all("td")
    print (td.get_text(strip=True))

'''

Upvotes: 0

Barry the Platipus
Barry the Platipus

Reputation: 10460

That page is loading data dynamically. You can get the table in question using selenium:

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support.relative_locator import locate_with
import pandas as pd

chrome_options = Options()
chrome_options.add_argument("--no-sandbox")
# chrome_options.add_argument("--headless")


webdriver_service = Service("chromedriver/chromedriver") ## path to where you saved chromedriver binary
browser = webdriver.Chrome(service=webdriver_service, options=chrome_options)

url = 'https://www.exportdata.io/trends/italy/2020-01-01/0'

browser.get(url)
table = WebDriverWait(browser, 10).until(EC.element_to_be_clickable((By.CLASS_NAME, "bVEWLO")))
dfs = pd.read_html(str(browser.page_source))
browser.quit()
print(dfs[0])

Result:

Rank    Trending Topic  Tweet Volume
0   1.0 #lannocheverra  Less than 10K tweets
1   2.0 Happy New Year  4949K tweets
2   3.0 Buon 2020   18.4K tweets
3   4.0 #Mattarella 19.3K tweets
4   5.0 #skamfrance Less than 10K tweets
[...]

Upvotes: 1

Muhammad Ahsan
Muhammad Ahsan

Reputation: 129

I think pandas can help you here. You can pass your html page to pandas built-in function and it will do the job:

dfs = pd.read_html(soup)
table = dfs[0]

I have verified that it works. :)

Upvotes: 1

Related Questions