The Novice
The Novice

Reputation: 144

Python: Scrape Data from Web after Inputing Info

Could anyone help me revise this Python program to correctly submit information to the "Date Range" query, and then extract the "Close" return data. I am scraping data from the following url:

http://finance.yahoo.com/q/hp?s=%5EGSPC+Historical+Prices

And this is my current code, which returns "[ ]".

from lxml import html
import requests


def historic_quotes(symbol, stMonth, stDate, stYear, enMonth, enDate, enYear):
    url = 'https://finance.yahoo.com/q/hp?s=%s+Historical+Prices' % (symbol)

    form_data = {
        'a': stMonth,  #00 is January, 01 is Feb., etc.
        'b': stDate,
        'c': stYear,
        'd': enMonth,  #00 is January, 01 is Feb., etc.
        'e': enDate,
        'f': enYear,
        'submit': 'submit',
    }
response = requests.post(url, data=form_data)

tree = html.document_fromstring(response.content)
p = tree.xpath('//*[@id="yfncsumtab"]/tbody/tr[2]/td[1]/table[4]/tbody/tr/td/table/tbody/tr[2]/td[7]/text()')
print p

historic_quotes('baba',00,11,2010,00,11,2012)

I am an overall Python novice, and greatly appreciate any and all help. Thanks for reading!

Also, I realize now the html source may be of help, but it is huge - so here's an XPATH to it:

//*[@id="daterange"]/table

Expected output is a list of the "Close" Values from the different dates. As previously stated, current output is just "[ ]". I believe something may been incorrect in the form_data, perhaps the "submit".

Upvotes: 1

Views: 208

Answers (2)

alecxe
alecxe

Reputation: 474171

The main issue was that you needed to make a GET request, not a POST.

Plus, @Paul Lo is right about the date ranges. For the sake of example, I'm querying from 2010 to 2015.

Also, you have to pass query parameters as strings. 00 evaluated to 0, requests converted int 0 to a "0" string. As a result, instead of 00 for a month, you had 0 sent as a parameter value.

Here is a fixed version with a modified part that gets the amounts:

from lxml import html
import requests

def historic_quotes(symbol, stMonth, stDate, stYear, enMonth, enDate, enYear):
    url = 'https://finance.yahoo.com/q/hp?s=%s+Historical+Prices' % symbol

    params = {
        'a': stMonth,
        'b': stDate,
        'c': stYear,
        'd': enMonth,
        'e': enDate,
        'f': enYear,
        'submit': 'submit',
    }
    response = requests.get(url, params=params)

    tree = html.document_fromstring(response.content)
    for amount in tree.xpath('//table[@class="yfnc_datamodoutline1"]//tr[td[@class="yfnc_tabledata1"]]//td[5]/text()'):
        print amount

historic_quotes('baba', '00', '11', '2010', '00', '11', '2015')

Prints:

105.95
105.95
105.52
108.77
110.65
109.25
109.02
105.77
104.70
105.11
104.97
103.88
107.48
105.07
107.90
...
90.57

Upvotes: 2

Paul Lo
Paul Lo

Reputation: 6148

I doubt that Alibaba (BABA) has data during 2010/1/11 to 2012/1/11 since it just IPO recently. You might need to check the raw data in response.content first, and try change the range ex: historic_quotes('baba',00,11,2014,00,11,2015)

Upvotes: 1

Related Questions