loxm
loxm

Reputation: 27

Trying to extract a table from webpage using BeautifulSoup (table inconsistent with real data)

So far I have exported the link to my notebook are parsed the phrase using beautiful soup:

html_data = requests.get('https://www.macrotrends.net/stocks/charts/TSLA/tesla/revenue')
soup = BeautifulSoup(html_data.text, 'lxml')

Then I tried to basically make a table that's only containing revenue (Telsa Quarterly Revenue) here (trying to omit Nan values):

tesla_revenue = pd.DataFrame(columns=["Date", "Revenue"])
table = soup.find('table', attrs={'class': 'historical_data_table table'})
for result in table:
    if table.find('th').getText().startswith("Tesla Quarterly Revenue"):
        for row in result.find_all('tbody').find_all("tr"):
            col = row.find("td")
            if len(col) != 2: continue
            Date = col[0].text
            Revenue = col[1].text
            tesla_revenue = tesla_revenue.append({"Date":Date,  "Revenue":Revenue}, ignore_index=True)

tesla_revenue = tesla_revenue.apply (pd.to_numeric, errors='coerce')
tesla_revenue = tesla_revenue.dropna()

Then when I tried to print out the tail of the table, I just get this:

| Date | Revenue |

(only the headers)

I think I might done something wrong when I made my table, but I can't be sure. Any help would be appreciated.

Upvotes: 0

Views: 3119

Answers (5)

TsSaha
TsSaha

Reputation: 1

Here you can try this:

tesla_revenue = pd.read_html(url, match = "Tesla Quarterly Revenue", flavor='bs4')[0]
print("Dataframe columns: ", tesla_revenue.columns)
tesla_revenue = tesla_revenue.rename(columns = {"Tesla Quarterly Revenue(Millions of US $)":"Date"})
tesla_revenue = tesla_revenue.rename(columns = {"Tesla Quarterly Revenue(Millions of US $).1":"Revenue"})  
tesla_revenue.head()

Upvotes: 0

Ajay
Ajay

Reputation: 5347

tesla_revenue = pd.DataFrame(columns=["Date", "Revenue"])
table = soup.find_all('table', attrs={'class': 'historical_data_table table'})

for result in table:
    if result.find('th').getText().startswith("Tesla Quarterly Revenue"):
        for row in result.find_all('tbody'):
            for col in row.find_all('tr'):
                #print(col)
                items = col.text.split('$')
                items1=[i.strip('\n') for i in items]
                #print(items1)
                if len(items1) ==2:
                    
                    Date=items1[0]
                    Revenue = '$'+items1[1]
                    tesla_revenue = tesla_revenue.append({"Date":Date,  "Revenue":Revenue}, ignore_index=True)



pd.to_datetime(tesla_revenue['Date'])
pd.to_numeric(tesla_revenue['Revenue'],errors='coerce')

tesla_revenue = tesla_revenue.dropna()
tesla_revenue

Upvotes: 0

norie
norie

Reputation: 9857

Try using select to get all the table data elements in the table.

import requests
from bs4 import BeautifulSoup
import pandas as pd

html_data = requests.get('https://www.macrotrends.net/stocks/charts/TSLA/tesla/revenue')
soup = BeautifulSoup(html_data.text, 'lxml')#

tesla_revenue = pd.DataFrame(columns=["Date", "Revenue"])
table = soup.select('table.historical_data_table:contains("Quarterly") tr td')

for idx in range(0, len(table), 2):
  Date = table[idx].getText()
  Revenue = table[idx+1].getText()[1:]
  tesla_revenue = tesla_revenue.append({"Date":Date,  "Revenue":Revenue}, ignore_index=True)

tesla_revenue['Date'] = pd.to_datetime(tesla_revenue['Date'])
tesla_revenue['Revenue'] = pd.to_numeric(tesla_revenue['Revenue'])

print(tesla_revenue)

Upvotes: 0

chitown88
chitown88

Reputation: 28565

As stated by furas, you need to pull out the correct table tag (so please accept his solution). But I also wanted to add to consider using pandas .read_html() method as it does the hard work of parsing the table tags for you (and uses beautifulsoup under the hood).

So .read_html() will return a list of dataframes. I believe the dataframe you want was the 'Tesla Quarterly Revenue', which is the second table (in index position 1):

import pandas as pd

df = pd.read_html('https://www.macrotrends.net/stocks/charts/TSLA/tesla/revenue')[1]

Output:

print(df)
   Tesla Quarterly Revenue(Millions of US $) Tesla Quarterly Revenue(Millions of US $).1
0                                 2020-12-31                                     $10,744
1                                 2020-09-30                                      $8,771
2                                 2020-06-30                                      $6,036
3                                 2020-03-31                                      $5,985
4                                 2019-12-31                                      $7,384
5                                 2019-09-30                                      $6,303
6                                 2019-06-30                                      $6,350
7                                 2019-03-31                                      $4,541
8                                 2018-12-31                                      $7,226
9                                 2018-09-30                                      $6,824
10                                2018-06-30                                      $4,002
11                                2018-03-31                                      $3,409
12                                2017-12-31                                      $3,288
13                                2017-09-30                                      $2,985
14                                2017-06-30                                      $2,790
15                                2017-03-31                                      $2,696
16                                2016-12-31                                      $2,285
17                                2016-09-30                                      $2,298
18                                2016-06-30                                      $1,270
19                                2016-03-31                                      $1,147
20                                2015-12-31                                      $1,214
21                                2015-09-30                                        $937
22                                2015-06-30                                        $955
23                                2015-03-31                                        $940
24                                2014-12-31                                        $957
25                                2014-09-30                                        $852
26                                2014-06-30                                        $769
27                                2014-03-31                                        $621
28                                2013-12-31                                        $615
29                                2013-09-30                                        $431
30                                2013-06-30                                        $405
31                                2013-03-31                                        $562
32                                2012-12-31                                        $306
33                                2012-09-30                                         $50
34                                2012-06-30                                         $27
35                                2012-03-31                                         $30
36                                2011-12-31                                         $39
37                                2011-09-30                                         $58
38                                2011-06-30                                         $58
39                                2011-03-31                                         $49
40                                2010-12-31                                         $36
41                                2010-09-30                                         $31
42                                2010-06-30                                         $28
43                                2010-03-31                                         $21
44                                2009-12-31                                         NaN
45                                2009-09-30                                         $46
46                                2009-06-30                                         $27
47                                2008-12-31         

                            NaN

Upvotes: 2

furas
furas

Reputation: 142641

There are few mistakes in this code but main problem is there are 4 tables in HTML but you use find('table', ...) instead of find_all('table',...) so you get only first table but Revenue is in other table (probably in second table).

import requests
from bs4 import BeautifulSoup
import pandas as pd

response = requests.get('https://www.macrotrends.net/stocks/charts/TSLA/tesla/revenue')
soup = BeautifulSoup(response.text, 'lxml')

all_tables = soup.find_all('table', attrs={'class': 'historical_data_table table'})

tesla_revenue = pd.DataFrame(columns=["Date", "Revenue"])

for table in all_tables:
    if table.find('th').getText().startswith("Tesla Quarterly Revenue"):
        for row in table.find_all("tr"):
            col = row.find_all("td")  
            if len(col) == 2: 
                date = col[0].text
                revenue = col[1].text.replace('$', '').replace(',', '')
                tesla_revenue = tesla_revenue.append({"Date": date, "Revenue": revenue}, ignore_index=True)

#tesla_revenue = tesla_revenue.apply(pd.to_numeric, errors='coerce')
#tesla_revenue = tesla_revenue.dropna()

print(tesla_revenue)

Upvotes: 2

Related Questions