Reputation: 27
So far I have exported the link to my notebook are parsed the phrase using beautiful soup:
html_data = requests.get('https://www.macrotrends.net/stocks/charts/TSLA/tesla/revenue')
soup = BeautifulSoup(html_data.text, 'lxml')
Then I tried to basically make a table that's only containing revenue (Telsa Quarterly Revenue) here (trying to omit Nan values):
tesla_revenue = pd.DataFrame(columns=["Date", "Revenue"])
table = soup.find('table', attrs={'class': 'historical_data_table table'})
for result in table:
if table.find('th').getText().startswith("Tesla Quarterly Revenue"):
for row in result.find_all('tbody').find_all("tr"):
col = row.find("td")
if len(col) != 2: continue
Date = col[0].text
Revenue = col[1].text
tesla_revenue = tesla_revenue.append({"Date":Date, "Revenue":Revenue}, ignore_index=True)
tesla_revenue = tesla_revenue.apply (pd.to_numeric, errors='coerce')
tesla_revenue = tesla_revenue.dropna()
Then when I tried to print out the tail of the table, I just get this:
| Date | Revenue |
(only the headers)
I think I might done something wrong when I made my table, but I can't be sure. Any help would be appreciated.
Upvotes: 0
Views: 3119
Reputation: 1
Here you can try this:
tesla_revenue = pd.read_html(url, match = "Tesla Quarterly Revenue", flavor='bs4')[0]
print("Dataframe columns: ", tesla_revenue.columns)
tesla_revenue = tesla_revenue.rename(columns = {"Tesla Quarterly Revenue(Millions of US $)":"Date"})
tesla_revenue = tesla_revenue.rename(columns = {"Tesla Quarterly Revenue(Millions of US $).1":"Revenue"})
tesla_revenue.head()
Upvotes: 0
Reputation: 5347
tesla_revenue = pd.DataFrame(columns=["Date", "Revenue"])
table = soup.find_all('table', attrs={'class': 'historical_data_table table'})
for result in table:
if result.find('th').getText().startswith("Tesla Quarterly Revenue"):
for row in result.find_all('tbody'):
for col in row.find_all('tr'):
#print(col)
items = col.text.split('$')
items1=[i.strip('\n') for i in items]
#print(items1)
if len(items1) ==2:
Date=items1[0]
Revenue = '$'+items1[1]
tesla_revenue = tesla_revenue.append({"Date":Date, "Revenue":Revenue}, ignore_index=True)
pd.to_datetime(tesla_revenue['Date'])
pd.to_numeric(tesla_revenue['Revenue'],errors='coerce')
tesla_revenue = tesla_revenue.dropna()
tesla_revenue
Upvotes: 0
Reputation: 9857
Try using select to get all the table data elements in the table.
import requests
from bs4 import BeautifulSoup
import pandas as pd
html_data = requests.get('https://www.macrotrends.net/stocks/charts/TSLA/tesla/revenue')
soup = BeautifulSoup(html_data.text, 'lxml')#
tesla_revenue = pd.DataFrame(columns=["Date", "Revenue"])
table = soup.select('table.historical_data_table:contains("Quarterly") tr td')
for idx in range(0, len(table), 2):
Date = table[idx].getText()
Revenue = table[idx+1].getText()[1:]
tesla_revenue = tesla_revenue.append({"Date":Date, "Revenue":Revenue}, ignore_index=True)
tesla_revenue['Date'] = pd.to_datetime(tesla_revenue['Date'])
tesla_revenue['Revenue'] = pd.to_numeric(tesla_revenue['Revenue'])
print(tesla_revenue)
Upvotes: 0
Reputation: 28565
As stated by furas, you need to pull out the correct table tag (so please accept his solution). But I also wanted to add to consider using pandas
.read_html()
method as it does the hard work of parsing the table tags for you (and uses beautifulsoup under the hood).
So .read_html()
will return a list of dataframes. I believe the dataframe you want was the 'Tesla Quarterly Revenue'
, which is the second table (in index position 1):
import pandas as pd
df = pd.read_html('https://www.macrotrends.net/stocks/charts/TSLA/tesla/revenue')[1]
Output:
print(df)
Tesla Quarterly Revenue(Millions of US $) Tesla Quarterly Revenue(Millions of US $).1
0 2020-12-31 $10,744
1 2020-09-30 $8,771
2 2020-06-30 $6,036
3 2020-03-31 $5,985
4 2019-12-31 $7,384
5 2019-09-30 $6,303
6 2019-06-30 $6,350
7 2019-03-31 $4,541
8 2018-12-31 $7,226
9 2018-09-30 $6,824
10 2018-06-30 $4,002
11 2018-03-31 $3,409
12 2017-12-31 $3,288
13 2017-09-30 $2,985
14 2017-06-30 $2,790
15 2017-03-31 $2,696
16 2016-12-31 $2,285
17 2016-09-30 $2,298
18 2016-06-30 $1,270
19 2016-03-31 $1,147
20 2015-12-31 $1,214
21 2015-09-30 $937
22 2015-06-30 $955
23 2015-03-31 $940
24 2014-12-31 $957
25 2014-09-30 $852
26 2014-06-30 $769
27 2014-03-31 $621
28 2013-12-31 $615
29 2013-09-30 $431
30 2013-06-30 $405
31 2013-03-31 $562
32 2012-12-31 $306
33 2012-09-30 $50
34 2012-06-30 $27
35 2012-03-31 $30
36 2011-12-31 $39
37 2011-09-30 $58
38 2011-06-30 $58
39 2011-03-31 $49
40 2010-12-31 $36
41 2010-09-30 $31
42 2010-06-30 $28
43 2010-03-31 $21
44 2009-12-31 NaN
45 2009-09-30 $46
46 2009-06-30 $27
47 2008-12-31
NaN
Upvotes: 2
Reputation: 142641
There are few mistakes in this code but main problem is there are 4 tables in HTML but you use find('table', ...)
instead of find_all('table',...)
so you get only first table but Revenue
is in other table (probably in second table).
import requests
from bs4 import BeautifulSoup
import pandas as pd
response = requests.get('https://www.macrotrends.net/stocks/charts/TSLA/tesla/revenue')
soup = BeautifulSoup(response.text, 'lxml')
all_tables = soup.find_all('table', attrs={'class': 'historical_data_table table'})
tesla_revenue = pd.DataFrame(columns=["Date", "Revenue"])
for table in all_tables:
if table.find('th').getText().startswith("Tesla Quarterly Revenue"):
for row in table.find_all("tr"):
col = row.find_all("td")
if len(col) == 2:
date = col[0].text
revenue = col[1].text.replace('$', '').replace(',', '')
tesla_revenue = tesla_revenue.append({"Date": date, "Revenue": revenue}, ignore_index=True)
#tesla_revenue = tesla_revenue.apply(pd.to_numeric, errors='coerce')
#tesla_revenue = tesla_revenue.dropna()
print(tesla_revenue)
Upvotes: 2