Reputation: 37
I am trying to write a code for scraping data from http://goldpricez.com/gold/history/lkr/years-3. The code that I have written follows below. The code works and gives me my intended results.
import pandas as pd
url = "http://goldpricez.com/gold/history/lkr/years-3"
df = pd.read_html(url)
print(df)
But result is with some unwanted data and I want only the data in the table. Please can some help me with this.
Here I have added the image of the output with unwanted data (red circled)
Upvotes: 2
Views: 781
Reputation: 514
import pandas as pd
url = "http://goldpricez.com/gold/history/lkr/years-3"
df = pd.read_html(url)# this will give you a list of dataframes from html
print(df[3])
Upvotes: 2
Reputation: 15568
The way you used .read_html
will return a list of all tables. Your table is at index 3
import pandas as pd
url = "http://goldpricez.com/gold/history/lkr/years-3"
df = pd.read_html(url)[3]
print(df)
.read_html
makes a call to the URL, and uses BeautifulSoup to parse the response under the hood. You can change the parse, the name of the table, pass header as you would in .read_csv
. Check .read_html for more details.
For speed, you can use lxml
e.g. pd.read_html(url, flavor='lxml')[3]
. By default, html5lib
, which is the second slowest, is used. Another flavor is html.parser
. It is the slowest of them all.
Upvotes: 0
Reputation: 53
Use BeautifulSoup for this the below code works perfectly
import requests
from bs4 import BeautifulSoup
url = "http://goldpricez.com/gold/history/lkr/years-3"
r = requests.get(url)
s = BeautifulSoup(r.text, "html.parser")
data = s.find_all("td")
data = data[11:]
for i in range(0, len(data), 2):
print(data[i].text.strip(), " ", data[i+1].text.strip())
This other advantage of using BeautifulSoup is that it is way faster that your code
Upvotes: 0