Felipe Mezzarana
Felipe Mezzarana

Reputation: 97

How to make Web Scraping faster?

I made this code to extrat lyrics from a website informing the artist and the music name.

The code is working, the problem is that I have a DataFrame (named years_1920_2020) with 10000 musics, and it took 1:30h to retrieve all these lyrics .

Is there a way to do it faster?

def url_lyric(music,artist):
 url_list = ("https://www.letras.mus.br/", str(artist),"/", str(music),"/")
 url = ''.join(url_list)
 req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
 try:
   webpage = urlopen(req).read()
   bs = BeautifulSoup(webpage, 'html.parser')
   lines =bs.find('div', {'class':'cnt-letra p402_premium'})
   final_lines = lines.find_all('p')
   return final_lines
 except:
     return 0


final_lyric_series = pd.Series(name = "lyrics")

for year in range (1920,2021):
  lyrics_serie = lyrics_from_year(year)
  final_lyric_series = pd.concat([final_lyric_series, lyrics_serie])
  print(year)

the function lyrics_from_year(year) uses the function url_lyric, perform some re tasks and return a pd.series with all the lyrics of the chosen year

Upvotes: 0

Views: 1647

Answers (2)

mama
mama

Reputation: 2227

Here is a simple example of how you could do it:

import aiohttp
import asyncio
import requests, bs4

async def main():
    async with aiohttp.ClientSession() as session:
        urls = [f"https://www.letras.mus.br{x['href']}" for x in bs4.BeautifulSoup(requests.get(
            url = 'https://www.letras.mus.br/adele/mais-tocadas.html'
        ).content, 'html.parser').find_all('a', {'class':'song-name'})]

        for url in urls:
            async with session.get(url) as r:
                lyrics = bs4.BeautifulSoup(await r.text(), 'html.parser').find('div', {'class':'cnt-letra'}).text
                print('\n'.join(x.strip() for x in lyrics.strip().split('\n')))

loop = asyncio.get_event_loop()
loop.run_until_complete(main())

Upvotes: 2

raviraj
raviraj

Reputation: 375

We can get the solution using the pythons asyncio module. Please refer to this Article It's not an exact solution but similar to your problem.

import asyncio
from concurrent.futures import ThreadPoolExecutor
import pandas as pd


def url_lyric(music, artist):
    pass


def lyrics_from_year(year):
    music = None
    artist = None
    return url_lyric(music, artist)


async def get_work_done():
    with ThreadPoolExecutor(max_workers=10) as executor:
        loop = asyncio.get_event_loop()
        tasks = [
            loop.run_in_executor(
                executor,
                lyrics_from_year,
                *(year)  # Allows us to pass in arguments to `lyrics_from_year`
            )
            for year in range(1920, 2021)
        ]

    return await asyncio.gather(*tasks)

loop = asyncio.get_event_loop()
future = asyncio.ensure_future(get_work_done())
loop.run_until_complete(future)

final_lyric_series = pd.Series(name="lyrics")


for result in future:
    final_lyric_series = pd.concat([final_lyric_series, result])
    print(result)

Upvotes: 2

Related Questions