Reputation: 31
I want to scrape the genre and length of the movie(running time) for a list of 250 films. A list named 'links' contains URLs of these 250 movie pages. I have written a code to extract genre and length from a single URL in the list 'links' which contains 250 URLs.
links=['https://www.imdb.com/title/tt0093603/','https://www.imdb.com/title/tt8176054/','https://www.imdb.com/title/tt0367495/','https://www.imdb.com/title/tt0048473/','https://www.imdb.com/title/tt0079221/','https://www.imdb.com/title/tt7391996/','https://www.imdb.com/title/tt0052572/','https://www.imdb.com/title/tt0237376/','https://www.imdb.com/title/tt0214915/','https://www.imdb.com/title/tt5311546/','https://www.imdb.com/title/tt7019842/','https://www.imdb.com/title/tt0105575/','https://www.imdb.com/title/tt0400234/','https://www.imdb.com/title/tt8413338/','https://www.imdb.com/title/tt12361178/','https://www.imdb.com/title/tt4991384/','https://www.imdb.com/title/tt1187043/','https://www.imdb.com/title/tt8948790/','https://www.imdb.com/title/tt0986264/','https://www.imdb.com/title/tt10189514/','https://www.imdb.com/title/tt0101649/','https://www.imdb.com/title/tt5074352/','https://www.imdb.com/title/tt9477520/','https://www.imdb.com/title/tt7060344/','https://www.imdb.com/title/tt9900782/','https://www.imdb.com/title/tt0291855/','https://www.imdb.com/title/tt0048956/','https://www.imdb.com/title/tt0085743/','https://www.imdb.com/title/tt0050870/','https://www.imdb.com/title/tt7738784/','https://www.imdb.com/title/tt5959980/','https://www.imdb.com/title/tt0059246/','https://www.imdb.com/title/tt4987556/','https://www.imdb.com/title/tt0312859/','https://www.imdb.com/title/tt0072783/','https://www.imdb.com/title/tt0119385/','https://www.imdb.com/title/tt0292246/','https://www.imdb.com/title/tt10214826/','https://www.imdb.com/title/tt7019942/','https://www.imdb.com/title/tt3417422/','https://www.imdb.com/title/tt7465992/','https://www.imdb.com/title/tt5867800/','https://www.imdb.com/title/tt6148156/','https://www.imdb.com/title/tt8239946/',
'https://www.imdb.com/title/tt0466460/','https://www.imdb.com/title/tt0459516/','https://www.imdb.com/title/tt4679210/','https://www.imdb.com/title/tt0376127/','https://www.imdb.com/title/tt0066763/','https://www.imdb.com/title/tt3973410/','https://www.imdb.com/title/tt3668162/','https://www.imdb.com/title/tt0220656/','https://www.imdb.com/title/tt6380520/','https://www.imdb.com/title/tt0195231/','https://www.imdb.com/title/tt8108198/','https://www.imdb.com/title/tt4429128/','https://www.imdb.com/title/tt2877108/','https://www.imdb.com/title/tt2181831/','https://www.imdb.com/title/tt3569782/','https://www.imdb.com/title/tt0376076/','https://www.imdb.com/title/tt1954470/','https://www.imdb.com/title/tt1620933/','https://www.imdb.com/title/tt5312232/','https://www.imdb.com/title/tt2356180/','https://www.imdb.com/title/tt0242519/','https://www.imdb.com/title/tt4934950/','https://www.imdb.com/title/tt0367110/','https://www.imdb.com/title/tt0073707/','https://www.imdb.com/title/tt2218988/','https://www.imdb.com/title/tt0871510/','https://www.imdb.com/title/tt0375611/','https://www.imdb.com/title/tt0104561/','https://www.imdb.com/title/tt0054098/','https://www.imdb.com/title/tt1562872/','https://www.imdb.com/title/tt4430212/','https://www.imdb.com/title/tt4851630/','https://www.imdb.com/title/tt5005684/','https://www.imdb.com/title/tt10324144/','https://www.imdb.com/title/tt1639426/','https://www.imdb.com/title/tt0057935/','https://www.imdb.com/title/tt7060460/','https://www.imdb.com/title/tt1280558/','https://www.imdb.com/title/tt3322420/','https://www.imdb.com/title/tt4635372/','https://www.imdb.com/title/tt0242256/','https://www.imdb.com/title/tt0200087/','https://www.imdb.com/title/tt0374887/','https://www.imdb.com/title/tt0139876/','https://www.imdb.com/title/tt0292490/','https://www.imdb.com/title/tt0105271/','https://www.imdb.com/title/tt9052870/','https://www.imdb.com/title/tt2283748/','https://www.imdb.com/title/tt0405508/','https://www.imdb.com/title/tt0364647/','https://www.imdb.com/title/tt0169102/','https://www.imdb.com/title/tt1821480/','https://www.imdb.com/title/tt0109117/','https://www.imdb.com/title/tt8291224/','https://www.imdb.com/title/tt2338151/','https://www.imdb.com/title/tt2358592/','https://www.imdb.com/title/tt0453729/','https://www.imdb.com/title/tt0319736/','https://www.imdb.com/title/tt0843326/','https://www.imdb.com/title/tt2082197/','https://www.imdb.com/title/tt5571734/','https://www.imdb.com/title/tt0112553/','https://www.imdb.com/title/tt0379370/','https://www.imdb.com/title/tt8144834/','https://www.imdb.com/title/tt0488414/','https://www.imdb.com/title/tt0116630/','https://www.imdb.com/title/tt13299890/','https://www.imdb.com/title/tt0456144/','https://www.imdb.com/title/tt7822438/','https://www.imdb.com/title/tt5824826/','https://www.imdb.com/title/tt4849438/','https://www.imdb.com/title/tt0072860/','https://www.imdb.com/title/tt1695800/','https://www.imdb.com/title/tt2564144/','https://www.imdb.com/title/tt1261047/','https://www.imdb.com/title/tt0063404/','https://www.imdb.com/title/tt0471571/','https://www.imdb.com/title/tt7392212/','https://www.imdb.com/title/tt3390572/','https://www.imdb.com/title/tt0112870/','https://www.imdb.com/title/tt6315524/','https://www.imdb.com/title/tt5906392/','https://www.imdb.com/title/tt0213969/','https://www.imdb.com/title/tt2882328/','https://www.imdb.com/title/tt0050188/','https://www.imdb.com/title/tt1821317/','https://www.imdb.com/title/tt2377938/','https://www.imdb.com/title/tt7838252/','https://www.imdb.com/title/tt10919240/','https://www.imdb.com/title/tt1180583/','https://www.imdb.com/title/tt1773764/','https://www.imdb.com/title/tt3394420/','https://www.imdb.com/title/tt7725596/','https://www.imdb.com/title/tt2395469/','https://www.imdb.com/title/tt1327035/','https://www.imdb.com/title/tt3863552/','https://www.imdb.com/title/tt1649431/','https://www.imdb.com/title/tt0051792/','https://www.imdb.com/title/tt0220832/','https://www.imdb.com/title/tt1857670/','https://www.imdb.com/title/tt3614516/','https://www.imdb.com/title/tt7180544/','https://www.imdb.com/title/tt0296574/','https://www.imdb.com/title/tt7294534/','https://www.imdb.com/title/tt3449292/','https://www.imdb.com/title/tt11581174/','https://www.imdb.com/title/tt2585562/','https://www.imdb.com/title/tt1188996/','https://www.imdb.com/title/tt5082014/','https://www.imdb.com/title/tt3124456/',
'https://www.imdb.com/title/tt8110330/',
'https://www.imdb.com/title/tt0347304/',
'https://www.imdb.com/title/tt1093370/',
'https://www.imdb.com/title/tt2924472/',
'https://www.imdb.com/title/tt1609168/',
'https://www.imdb.com/title/tt6167894/',
'https://www.imdb.com/title/tt0118751/',
'https://www.imdb.com/title/tt7485048/',
'https://www.imdb.com/title/tt2325915/',
'https://www.imdb.com/title/tt0375878/',
'https://www.imdb.com/title/tt1417299/',
'https://www.imdb.com/title/tt7218518/',
'https://www.imdb.com/title/tt0323013/',
'https://www.imdb.com/title/tt8108200/',
'https://www.imdb.com/title/tt2631186/',
'https://www.imdb.com/title/tt0455829/',
'https://www.imdb.com/title/tt0824316/',
'https://www.imdb.com/title/tt0222012/',
'https://www.imdb.com/title/tt11322920/',
'https://www.imdb.com/title/tt3848892/',
'https://www.imdb.com/title/tt10717738/',
'https://www.imdb.com/title/tt4387040/',
'https://www.imdb.com/title/tt5764096/',
'https://www.imdb.com/title/tt0366840/',
'https://www.imdb.com/title/tt2181931/',
'https://www.imdb.com/title/tt1517561/',
'https://www.imdb.com/title/tt0373856/',
'https://www.imdb.com/title/tt2926068/',
'https://www.imdb.com/title/tt2350496/',
'https://www.imdb.com/title/tt1077248/',
'https://www.imdb.com/title/tt0402014/',
'https://www.imdb.com/title/tt13206926/',
'https://www.imdb.com/title/tt8130968/',
'https://www.imdb.com/title/tt0816258/',
'https://www.imdb.com/title/tt6108090/',
'https://www.imdb.com/title/tt4169250/',
'https://www.imdb.com/title/tt0291376/',
'https://www.imdb.com/title/tt2317337/',
'https://www.imdb.com/title/tt0093578/',
'https://www.imdb.com/title/tt7098658/',
'https://www.imdb.com/title/tt4434004/',
'https://www.imdb.com/title/tt1907761/',
'https://www.imdb.com/title/tt7758160/',
'https://www.imdb.com/title/tt0077451/',
'https://www.imdb.com/title/tt4432480/',
'https://www.imdb.com/title/tt1230165/',
'https://www.imdb.com/title/tt0420332/',
'https://www.imdb.com/title/tt3822396/',
'https://www.imdb.com/title/tt1851988/',
'https://www.imdb.com/title/tt5121000/',
'https://www.imdb.com/title/tt1288638/',
'https://www.imdb.com/title/tt0499375/',
'https://www.imdb.com/title/tt0431619/',
'https://www.imdb.com/title/tt2187153/',
'https://www.imdb.com/title/tt0196069/',
'https://www.imdb.com/title/tt2213054/',
'https://www.imdb.com/title/tt3801314/',
'https://www.imdb.com/title/tt1292703/',
'https://www.imdb.com/title/tt4981966/',
'https://www.imdb.com/title/tt1266583/',
'https://www.imdb.com/title/tt1839596/',
'https://www.imdb.com/title/tt0422320/',
'https://www.imdb.com/title/tt7998242/',
'https://www.imdb.com/title/tt2258337/',
'https://www.imdb.com/title/tt0110222/',
'https://www.imdb.com/title/tt0109555/',
'https://www.imdb.com/title/tt6484982/',
'https://www.imdb.com/title/tt4900716/',
'https://www.imdb.com/title/tt3320542/',
'https://www.imdb.com/title/tt7142506/',
'https://www.imdb.com/title/tt1241195/',
'https://www.imdb.com/title/tt8108268/',
'https://www.imdb.com/title/tt0150433/',
'https://www.imdb.com/title/tt2855648/',
'https://www.imdb.com/title/tt0098999/',
'https://www.imdb.com/title/tt0432047/',
'https://www.imdb.com/title/tt3447364/',
'https://www.imdb.com/title/tt1014672/',
'https://www.imdb.com/title/tt1926313/',
'https://www.imdb.com/title/tt5286444/',
'https://www.imdb.com/title/tt2980794/',
'https://www.imdb.com/title/tt8042292/',
'https://www.imdb.com/title/tt1447500/',
'https://www.imdb.com/title/tt0106333/',
'https://www.imdb.com/title/tt2140465/',
'https://www.imdb.com/title/tt0920464/',
'https://www.imdb.com/title/tt5310090/',
'https://www.imdb.com/title/tt7212754/',
'https://www.imdb.com/title/tt1324059/',
'https://www.imdb.com/title/tt3767372/',
'https://www.imdb.com/title/tt2375559/',
'https://www.imdb.com/title/tt6027478/',
'https://www.imdb.com/title/tt8590896/',
'https://www.imdb.com/title/tt0172684/',
'https://www.imdb.com/title/tt6206564/',
'https://www.imdb.com/title/tt0449994/']]
Now I have to do that for all 250 URLs in that list. When a looped this process, I got only the last URLs info only.
Here is the code I have written for 1 URL,
def get_movie_info(a_tag, div_tag):
# returns all the required info about a movie
span_tags1 = a_tag.find_all('span')
genre=span_tags1[0].text.strip()
li_tags = div_tag.find_all('li')
length_of_film=li_tags[1].text.strip()
return genre, length_of_film
movie_page_url = links[0] #1st url in the list
response = requests.get(movie_page_url)
#get a tags
a_tags = movie_doc.find_all('a', attrs={'class':"GenresAndPlot__GenreChip-cum89p-3 fzmeux ipc-chip ipc-chip--on-baseAlt"})
#get div tags
div_tags = movie_doc.find_all('div', attrs={'class':"TitleBlock__TitleMetaDataContainer-sc-1nlhx7j-2 hWHMKr"})
movie_dict = {
'genre1' : [],
'length_of_movie' : []}
a_tag = a_tags[0]
div_tag = div_tags[0]
movie_info = get_movie_info(a_tag,div_tag)
movie_dict['genre1'].append(movie_info[0])
movie_dict['length_of_movie'].append(movie_info[1])
Output is
movie_dict = {'genre1': ['Crime'], 'length_of_movie': ['2h 25min']}
Output should be dataframe with columns 'genre1' and 'length_of_movie' and 250 rows with respective movie's genre and length
Upvotes: 0
Views: 609
Reputation: 195573
Loop over your list with movie URLs and put the result to dictionary values. As a last step, create the dataframe:
import requests
from bs4 import BeautifulSoup
links = [
"https://www.imdb.com/title/tt0093603/",
"https://www.imdb.com/title/tt8176054/",
"https://www.imdb.com/title/tt0367495/",
# ... rest of your URLs
]
def get_movie_info(a_tag, div_tag):
span_tags1 = a_tag.find_all("span")
genre = span_tags1[0].text.strip()
li_tag = div_tag.find(lambda tag: tag.name == "li" and "min" in tag.text)
length_of_film = li_tag.text.strip()
return genre, length_of_film
movie_dict = {"genre1": [], "length_of_movie": []}
for movie_page_url in links:
response = requests.get(movie_page_url)
movie_doc = BeautifulSoup(response.content, "html.parser")
# get a tags
a_tags = movie_doc.find_all(
"a",
attrs={
"class": "GenresAndPlot__GenreChip-cum89p-3 fzmeux ipc-chip ipc-chip--on-baseAlt"
},
)
# get div tags
div_tags = movie_doc.find_all(
"div",
attrs={
"class": "TitleBlock__TitleMetaDataContainer-sc-1nlhx7j-2 hWHMKr"
},
)
a_tag = a_tags[0]
div_tag = div_tags[0]
movie_info = get_movie_info(a_tag, div_tag)
movie_dict["genre1"].append(movie_info[0])
movie_dict["length_of_movie"].append(movie_info[1])
df = pd.DataFrame(movie_dict)
print(df)
Prints:
genre1 length_of_movie
0 Crime 2h 25min
1 Drama 2h 34min
2 Adventure 2h 40min
Upvotes: 0