scalamardo
scalamardo

Reputation: 27

How to do web scraping in Python?

I want to web scrape a particular web of finances. But in my entire life I do that. I don't understand HTML, so it's very difficult for me. I want to learn because I need to have an example to start to web scraping a lot of tables. The web is of a institution of Chile, named "Comisión para el Mercado financiero". The url is: "http://www.cmfchile.cl/institucional/inc/valores_cuota/valor_serie.php?v1=C1KB5&v2=LPKA0ISQAKEHITB64IBM&v3=4ABCIV864AJ35MN64IBM&v4=V864A4ABCI&v5=J35MNS8IYM&v6=4ABCIV864A4ABCIV864A&v7=V864AISQAK&v8=V864A64IBM&v9=37G70LN68AGLD87IEAIXGLD87OL18863409LN68AOL188JKT99QHFLBMLXL410163LN68A&v10=21QYE48BCX99KWAEF88BWM6YB&v11=63409LN68AGLD8737GH0J35MN&v12=63409LN68AGLD8737GH04ABCI"

Can someone tell me how to do that? I know that I can do with BeautifulSoup and requests modules, but nothing more. And a book on web scraping in Python would be very helpful if there is one.

Upvotes: 0

Views: 192

Answers (3)

shekhar chander
shekhar chander

Reputation: 618

Fully working code. You have to wait a lot (approx 10 minutes or less) to get your results. So please, reply after it is done

As your link has dynamic data, it takes time to load the data. Beautiful soup has some timeout due to which it says no response. The good idea will be to use selenium here as it waits until the page is fully loaded.

After trying hundreds of ways to get the data, here's the final solution.

from selenium import webdriver
rows = []
driver= webdriver.Chrome('P:/selenium/driver/chromedriver.exe')
driver.get('YOUR LINK HERE')
data = driver.find_elements_by_xpath('//*[@id="main"]/div/div[2]/table/tbody/tr')
for i in range(len(data)):
    if(i==1): # Because 1st row contains irrelevant data
        pass
    else:
        rows.append(data[i].text.split(" "))
print(rows)

Upvotes: 0

import requests
import pandas as pd

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:76.0) Gecko/20100101 Firefox/76.0'
}


def main(url):
    r = requests.get(url, headers=headers)
    df = pd.read_html(r.content)[0]
    print(df)


main("http://www.cmfchile.cl/institucional/inc/valores_cuota/valor_serie.php?v1=C1KB5&v2=LPKA0ISQAKEHITB64IBM&v3=4ABCIV864AJ35MN64IBM&v4=V864A4ABCI&v5=J35MNS8IYM&v6=4ABCIV864A4ABCIV864A&v7=V864AISQAK&v8=V864A64IBM&v9=37G70LN68AGLD87IEAIXGLD87OL18863409LN68AOL188JKT99QHFLBMLXL410163LN68A&v10=21QYE48BCX99KWAEF88BWM6YB&v11=63409LN68AGLD8737GH0J35MN&v12=63409LN68AGLD8737GH04ABCI")

Upvotes: 0

srinivas-vaddi
srinivas-vaddi

Reputation: 140

As you have mentioned it rightly this is "Web Scraping" and python has amazing modules for the same. It is important for us to understand the technicalities before we proceed further.

One of the most used module is -> BeautifulSoup

So, to get the info from any webpage,

  • you would need to first understand the structure of the webpage.
  • Also, in some cases this might not be fully legal considering that we are further using this info from webpage for other reasons.
  • the bigger challenge is, does the webpage support scraping? This is more important to proceed further.
    • How can you find it? this can be figured out by looking at the source of the webpage.
    • if the text/info you want to grab is viewable in the source or in one of the hrefs, then it should be possible to scrape it using Beautifulsoup.

Solution -

  • Before you arrive at a solution you must understand the HTML structure and the ways in which you can identify any element on a webpage
  • there are many ways, like

    • using the "id" of any element on the webpage
    • using the class or tagname directly
    • using the xpath of the element
    • or also, a combination of any o all of the above
  • once you reach this point, by now it must be clear for you on the way we are gonna proceed further on

#make a request to the webpage, and grab the html respone
page = requests.get("your url here").content

#pass it on to beautifulsoup 
from bs4 import BeautifulSoup
soup = BeautifulSoup(page.content, 'html.parser')

#Depending on how you want to find, you can use  findbyclass, findbytag, and #other methods 
soup.findAll('your tag')

Upvotes: 1

Related Questions