Reputation: 37

Need to clean web scraped data using python

I am trying to write a code for scraping data from http://goldpricez.com/gold/history/lkr/years-3. The code that I have written follows below. The code works and gives me my intended results.

import pandas as pd

url = "http://goldpricez.com/gold/history/lkr/years-3"

df = pd.read_html(url)

print(df)

But result is with some unwanted data and I want only the data in the table. Please can some help me with this.

Here I have added the image of the output with unwanted data (red circled)

Upvotes: 2

Answers (3)

Bernad Peter

Reputation: 514

    import pandas as pd



   url = "http://goldpricez.com/gold/history/lkr/years-3"

   df = pd.read_html(url)# this will give you a list of dataframes from html

  print(df[3])

Upvotes: 2

Prayson W. Daniel

Reputation: 15568

The way you used .read_html will return a list of all tables. Your table is at index 3

import pandas as pd

url = "http://goldpricez.com/gold/history/lkr/years-3"

df = pd.read_html(url)[3]

print(df)

.read_html makes a call to the URL, and uses BeautifulSoup to parse the response under the hood. You can change the parse, the name of the table, pass header as you would in .read_csv. Check .read_html for more details.

For speed, you can use lxml e.g. pd.read_html(url, flavor='lxml')[3]. By default, html5lib, which is the second slowest, is used. Another flavor is html.parser. It is the slowest of them all.

Upvotes: 0

yashetty29

Reputation: 53

Use BeautifulSoup for this the below code works perfectly

import requests
from bs4 import BeautifulSoup
url = "http://goldpricez.com/gold/history/lkr/years-3"
r = requests.get(url)
s = BeautifulSoup(r.text, "html.parser")
data = s.find_all("td")
data = data[11:]
for i in range(0, len(data), 2):
    print(data[i].text.strip(), "      ", data[i+1].text.strip())

This other advantage of using BeautifulSoup is that it is way faster that your code

Upvotes: 0

Need to clean web scraped data using python

Answers (3)

Related Questions