claudiaann1
claudiaann1

Reputation: 237

Scrape webpage after form fill

Im trying to scrape the response to this website using a pre-filled zip: zip who (i.e. the zip code is already filled in.) I tried to do this using the scrapy shell as follows

scrapy shell http://zipwho.com/?zip=77098&mode=zip

but the response does not contain the form-filled page, but only the content from the main zipwho.com page and without the details specific to that zip code. I tried filling in the form information using requests and lxml, but clearly I am doing something wrong.

import requests
import lxml.html as lh
url = 'http://zipwho.com'

form_data = {
    'zip': '77098'
    }
response = requests.post(url, data=form_data)
tree = lh.document_fromstring(response.content)
tree.xpath('//td[@class="keysplit"]')

and the table element for the data (td where class = 'keysplit') still does not exist. If you have ideas to get this working (hopefully with something simple in like requests & lxml) that would be best.

Upvotes: 2

Views: 435

Answers (3)

claudiaann1
claudiaann1

Reputation: 237

With thanks and a bit of both previous answers, a fully functioning solution is as follows:

url = 'http://zipwho.com/?zip=77098&mode=zip'
response = requests.post(url)

tree = lh.document_fromstring(response.content)

scriptText = tree.xpath("//script[contains(., 'function getData()')]")[0].text

splitVals = scriptText.split('"')[1].split('\\n')

if len(splitVals) >= 2:
    headers =splitVals[0].split(',')
    data = splitVals[1].split(',')      

Upvotes: 0

Padraic Cunningham
Padraic Cunningham

Reputation: 180411

The data is inside a script tag which you can parse with a regex but your approach is not going to work in scrapy or using requests, there is nothing posted to the page, the data is retrieved with a get where the params passed are mode and zip, a working example:

import requests
import lxml.html as lh
import re

url = 'http://zipwho.com'

params = {
    'zip': '77098',
    "mode":"zip"
    }
response = requests.get(url, params=params)
tree = lh.document_fromstring(response.content)
script = tree.xpath("/script[contains(., 'function getData()')]//text()")[0]
data = re.search('"(.*?)"', script).group(1)

Upvotes: 2

kmaork
kmaork

Reputation: 6012

The reason that you can't find this data in the HTML is that it's generated dynamically with a script. If you look at the first script in the HTML, you'll see a function called getData that contains the data that you want. Another script later uses this function to build what you see in your browser.

So to scrape this data I'd just extract it directly from the script: get the string that the function returns, split it by , and so on.

Good luck!

Upvotes: 1

Related Questions