kbgo
kbgo

Reputation: 75

How to modify Pandas's Read_html user-agent?

I'm trying to scrape English football stats from various html tables via the Transfetmarkt website using the pandas.read_html() function.

Example:

import pandas as pd
url = r'http://www.transfermarkt.co.uk/en/premier-league/gegentorminuten/wettbewerb_GB1.html'
df = pd.read_html(url)

However this code generates a "ValueError: Invalid URL" error.

I then attempted to parse the same website using the urllib2.urlopen() function. This time i got a "HTTPError: HTTP Error 404: Not Found". After the usual trial and error fault finding, it turns that the urllib2 header presents a python like agent to the webserver, which i presumed it doesn't recognize.

Now if I modify urllib2's agent and read its contents using beautifulsoup, i'm able to read the table without a problem.

Example:

from BeautifulSoup import BeautifulSoup
import urllib2
opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
url = r'http://www.transfermarkt.co.uk/en/premier-league/gegentorminuten/wettbewerb_GB1.html'
response = opener.open(url)
html = response.read()
soup = BeautifulSoup(html)
table = soup.find("table")

How do I modify pandas's urllib2 header to allow python to scrape this website?

Thanks

Upvotes: 6

Views: 15717

Answers (2)

user459872
user459872

Reputation: 24827

Starting from pandas 2.1.0 a custom header can be sent alongside HTTP(s) requests by passing a dictionary of header key value mappings to the storage_options keyword argument.

import pandas as pd

headers = {"User-Agent": "Mozilla/5.0"}
df = pd.read_html(
    "http://www.transfermarkt.co.uk/en/premier-league/gegentorminuten/wettbewerb_GB1.html",
    storage_options=headers
)

See also

Upvotes: 2

Viktor Kerkez
Viktor Kerkez

Reputation: 46636

Currently you cannot. Relevant piece of code:

if _is_url(io): # io is the url
    try:
        with urlopen(io) as url:
            raw_text = url.read()
    except urllib2.URLError:
        raise ValueError('Invalid URL: "{0}"'.format(io))

As you see, it just passes the url to urlopen and reads the data. You can file an issue requesting this feature, but I assume you don't have time to wait for it to be solved so I would suggest using BeautifulSoup to parse the html data and then load it into a DataFrame.

import urllib2

url = 'http://www.transfermarkt.co.uk/en/premier-league/gegentorminuten/wettbewerb_GB1.html'
opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
response = opener.open(url)
tables = pd.read_html(response.read(), attrs={"class":"tabelle_grafik"})[0]

Or if you can use requests:

tables = pd.read_html(requests.get(url,
                                   headers={'User-agent': 'Mozilla/5.0'}).text,
                      attrs={"class":"tabelle_grafik"})[0]

Upvotes: 10

Related Questions