Reputation: 1
I am trying to pull all of the tables together. I can grab the first set of data which I think means that the scraping aspect works, however, I think there is an issue when I'm trying to bring all of it together.
I've tried to declare the dataframe early on and then have the table data fill it in every loop.
names = {'Iron-Man',
'Incredible-Hulk-The',
'Thor',
'Iron-Man-2',
'Captain-America-The-First-Avenger',
'Avengers-The-(2012)',
'Iron-Man-3',
'Thor-The-Dark-World',
'Captain-America-The-Winter-Soldier',
'Guardians-of-the-Galaxy',
'Avengers-Age-of-Ultron',
'Ant-Man',
'Captain-America-Civil-War',
'Doctor-Strange-(2016)',
'Guardians-of-the-Galaxy-Vol-2',
'Spider-Man-Homecoming',
'Thor-Ragnarok',
'Black-Panther',
'Avengers-Infinity-War',
'Ant-Man-and-the-Wasp',
'Captain-Marvel-(2019)',
'Avengers-Endgame-(2019)'
}
This piece of code works for grabbing the pages table
data = requests.get('https://www.the-numbers.com/movie/Iron-Man#tab=box- office')
soup = BeautifulSoup(data.text, 'html.parser')
data = []
div = soup.find('div' , {'id': 'box_office_chart'})
table = div.find('table')
tbody = table.find('tbody')
html = table.encode().decode('utf8')
dfs = pd.read_html(html,header=0)
df = dfs[0]
df
This piece of code is where I'm expecting it to loop through everything and grab it.
for name in names:
print(name)
data = requests.get('https://www.the-numbers.com/movie/' + name + '#tab=box-office')
soup = BeautifulSoup(data.text, 'html.parser')
div = soup.find('div' , {'id': 'box_office_chart'})
table = div.find('table')
tbody = table.find('tbody')
html = table.encode().decode('utf8')
dfs = pd.read_html(html,header=0)
df2 = dfs[0]
df2
df.append(df2)
print(name)
df
All of the movies printed out twice so I know that it at least went to each page. Here is the output which doesn't include any of the other movies.
df Output:
Date Rank Gross % Change Theaters Per Theaters Total Gross Week movie
0 May 2, 2008 1 $102,118,668 NaN 4105 $24,877 $102,118,668 1 Iron-Man
1 May 9, 2008 1 $51,190,629 -50% 4111 $12,452 $177,825,024 2 Iron-Man
2 May 16, 2008 2 $31,838,996 -38% 4154 $7,665 $223,124,385 3 Iron-Man
3 May 23, 2008 3 $20,447,253 -36% 3915 $5,223 $252,614,669 4 Iron-Man
4 May 30, 2008 4 $13,541,264 -34% 3650 $3,710 $276,166,336 5 Iron-Man
5 Jun 6, 2008 6 $7,477,439 -45% 2931 $2,551 $288,847,640 6 Iron-Man
6 Jun 13, 2008 7 $5,620,375 -25% 2403 $2,339 $297,918,329 7 Iron-Man
7 Jun 20, 2008 9 $4,030,272 -28% 1912 $2,108 $304,816,141 8 Iron-Man
8 Jun 27, 2008 11 $2,257,113 -44% 1379 $1,637 $309,179,318 9 Iron-Man
9 Jul 4, 2008 12 $1,459,613 -35% 1019 $1,432 $311,708,133 10 Iron-Man
10 Jul 11, 2008 14 $939,134 -36% 710 $1,323 $313,421,025 11 Iron-Man
11 Jul 18, 2008 16 $451,838 -52% 375 $1,205 $314,376,968 12 Iron-Man
12 Jul 25, 2008 22 $310,654 -31% 274 $1,134 $314,925,955 13 Iron-Man
13 Aug 1, 2008 16 $580,179 +87% 407 $1,426 $315,687,768 14 Iron-Man
14 Aug 8, 2008 19 $426,502 -26% 45 $1,236 $316,468,817 15 Iron-Man
15 Aug 15, 2008 23 $341,178 -20% 315 $1,083 $317,058,295 16 Iron-Man
16 Aug 22, 2008 29 $243,342 -29% 257 $947 $317,473,452 17 Iron-Man
17 Aug 29, 2008 33 $223,636 -8% 220 $1,017 $317,794,156 18 Iron-Man
18 Sep 5, 2008 38 $126,734 -43% 205 $618 $318,006,770 19 Iron-Man
19 Sep 12, 2008 39 $94,816 -25% 156 $608 $318,134,740 20 Iron-Man
20 Sep 19, 2008 43 $59,037 -38% 124 $476 $318,219,154 21 Iron-Man
21 Sep 26, 2008 48 $58,364 -1% 121 $482 $318,298,180 22 Iron-Man
I keep expecting to have all of the tables from the other pages added to df. Not sure where I'm going wrong.
EDIT: So I got rid of the first attempt at grabbing data and just used a bunch of elif statements to create all 22 dataframes. Thanks to everyone for the suggestions.
Upvotes: 0
Views: 95
Reputation: 28565
No need to do all the elif statements. To append the current df from your loop into a final results df, you need to use df = df.append(df2)
.
import requests
import pandas as pd
from bs4 import BeautifulSoup
names = {'Iron-Man',
'Incredible-Hulk-The',
'Thor',
'Iron-Man-2',
'Captain-America-The-First-Avenger',
'Avengers-The-(2012)',
'Iron-Man-3',
'Thor-The-Dark-World',
'Captain-America-The-Winter-Soldier',
'Guardians-of-the-Galaxy',
'Avengers-Age-of-Ultron',
'Ant-Man',
'Captain-America-Civil-War',
'Doctor-Strange-(2016)',
'Guardians-of-the-Galaxy-Vol-2',
'Spider-Man-Homecoming',
'Thor-Ragnarok',
'Black-Panther',
'Avengers-Infinity-War',
'Ant-Man-and-the-Wasp',
'Captain-Marvel-(2019)',
'Avengers-Endgame-(2019)'
}
df = pd.DataFrame()
for name in names:
print(name)
url = 'https://www.the-numbers.com/movie/' + name + '#tab=box-office'
data = requests.get(url)
soup = BeautifulSoup(data.text, 'html.parser')
div = soup.find('div' , {'id': 'box_office_chart'})
table = div.find('table')
tbody = table.find('tbody')
html = table.encode().decode('utf8')
dfs = pd.read_html(html,header=0)
df2 = dfs[0]
df2['movie'] = name
df = df.append(df2)
print(name)
df = df.reset_index(drop=True)
Output:
print (df)
Date Rank ... Week movie
0 Mar 8, 2019 1 ... 1 Captain-Marvel-(2019)
1 Mar 15, 2019 1 ... 2 Captain-Marvel-(2019)
2 Mar 22, 2019 2 ... 3 Captain-Marvel-(2019)
3 Mar 29, 2019 3 ... 4 Captain-Marvel-(2019)
4 Apr 5, 2019 5 ... 5 Captain-Marvel-(2019)
5 Apr 12, 2019 6 ... 6 Captain-Marvel-(2019)
6 Apr 19, 2019 4 ... 7 Captain-Marvel-(2019)
7 Apr 26, 2019 2 ... 8 Captain-Marvel-(2019)
8 Apr 27, 2018 1 ... 1 Avengers-Infinity-War
9 May 4, 2018 1 ... 2 Avengers-Infinity-War
10 May 11, 2018 1 ... 3 Avengers-Infinity-War
11 May 18, 2018 2 ... 4 Avengers-Infinity-War
12 May 25, 2018 3 ... 5 Avengers-Infinity-War
13 Jun 1, 2018 4 ... 6 Avengers-Infinity-War
14 Jun 8, 2018 5 ... 7 Avengers-Infinity-War
15 Jun 15, 2018 8 ... 8 Avengers-Infinity-War
16 Jun 22, 2018 9 ... 9 Avengers-Infinity-War
17 Jun 29, 2018 12 ... 10 Avengers-Infinity-War
18 Jul 6, 2018 15 ... 11 Avengers-Infinity-War
19 Jul 13, 2018 16 ... 12 Avengers-Infinity-War
20 Jul 20, 2018 20 ... 13 Avengers-Infinity-War
21 Jul 27, 2018 21 ... 14 Avengers-Infinity-War
22 Aug 3, 2018 24 ... 15 Avengers-Infinity-War
23 Aug 10, 2018 26 ... 16 Avengers-Infinity-War
24 Aug 17, 2018 37 ... 17 Avengers-Infinity-War
25 Aug 24, 2018 42 ... 18 Avengers-Infinity-War
26 Aug 31, 2018 44 ... 19 Avengers-Infinity-War
27 Sep 7, 2018 52 ... 20 Avengers-Infinity-War
28 Apr 26, 2019 1 ... 1 Avengers-Endgame-(2019)
29 May 5, 2017 1 ... 1 Guardians-of-the-Galaxy-Vol-2
.. ... ... ... ... ...
367 Aug 1, 2008 16 ... 14 Iron-Man
368 Aug 8, 2008 19 ... 15 Iron-Man
369 Aug 15, 2008 23 ... 16 Iron-Man
370 Aug 22, 2008 29 ... 17 Iron-Man
371 Aug 29, 2008 33 ... 18 Iron-Man
372 Sep 5, 2008 38 ... 19 Iron-Man
373 Sep 12, 2008 39 ... 20 Iron-Man
374 Sep 19, 2008 43 ... 21 Iron-Man
375 Sep 26, 2008 48 ... 22 Iron-Man
376 Jul 7, 2017 1 ... 1 Spider-Man-Homecoming
377 Jul 14, 2017 2 ... 2 Spider-Man-Homecoming
378 Jul 21, 2017 3 ... 3 Spider-Man-Homecoming
379 Jul 28, 2017 5 ... 4 Spider-Man-Homecoming
380 Aug 4, 2017 6 ... 5 Spider-Man-Homecoming
381 Aug 11, 2017 7 ... 6 Spider-Man-Homecoming
382 Aug 18, 2017 7 ... 7 Spider-Man-Homecoming
383 Aug 25, 2017 7 ... 8 Spider-Man-Homecoming
384 Sep 1, 2017 7 ... 9 Spider-Man-Homecoming
385 Sep 8, 2017 7 ... 10 Spider-Man-Homecoming
386 Sep 15, 2017 9 ... 11 Spider-Man-Homecoming
387 Sep 22, 2017 11 ... 12 Spider-Man-Homecoming
388 Sep 29, 2017 18 ... 13 Spider-Man-Homecoming
389 Oct 6, 2017 20 ... 14 Spider-Man-Homecoming
390 Oct 13, 2017 20 ... 15 Spider-Man-Homecoming
391 Oct 20, 2017 27 ... 16 Spider-Man-Homecoming
392 Oct 27, 2017 33 ... 17 Spider-Man-Homecoming
393 Nov 3, 2017 37 ... 18 Spider-Man-Homecoming
394 Nov 10, 2017 42 ... 19 Spider-Man-Homecoming
395 Nov 17, 2017 46 ... 20 Spider-Man-Homecoming
396 Nov 24, 2017 51 ... 21 Spider-Man-Homecoming
[397 rows x 9 columns]
Upvotes: 1