Reputation: 237
Im trying to scrape the response to this website using a pre-filled zip: zip who (i.e. the zip code is already filled in.) I tried to do this using the scrapy shell as follows
scrapy shell http://zipwho.com/?zip=77098&mode=zip
but the response does not contain the form-filled page, but only the content from the main zipwho.com page and without the details specific to that zip code. I tried filling in the form information using requests and lxml, but clearly I am doing something wrong.
import requests
import lxml.html as lh
url = 'http://zipwho.com'
form_data = {
'zip': '77098'
}
response = requests.post(url, data=form_data)
tree = lh.document_fromstring(response.content)
tree.xpath('//td[@class="keysplit"]')
and the table element for the data (td where class = 'keysplit') still does not exist. If you have ideas to get this working (hopefully with something simple in like requests & lxml) that would be best.
Upvotes: 2
Views: 435
Reputation: 237
With thanks and a bit of both previous answers, a fully functioning solution is as follows:
url = 'http://zipwho.com/?zip=77098&mode=zip'
response = requests.post(url)
tree = lh.document_fromstring(response.content)
scriptText = tree.xpath("//script[contains(., 'function getData()')]")[0].text
splitVals = scriptText.split('"')[1].split('\\n')
if len(splitVals) >= 2:
headers =splitVals[0].split(',')
data = splitVals[1].split(',')
Upvotes: 0
Reputation: 180411
The data is inside a script tag which you can parse with a regex but your approach is not going to work in scrapy or using requests, there is nothing posted to the page, the data is retrieved with a get where the params passed are mode and zip, a working example:
import requests
import lxml.html as lh
import re
url = 'http://zipwho.com'
params = {
'zip': '77098',
"mode":"zip"
}
response = requests.get(url, params=params)
tree = lh.document_fromstring(response.content)
script = tree.xpath("/script[contains(., 'function getData()')]//text()")[0]
data = re.search('"(.*?)"', script).group(1)
Upvotes: 2
Reputation: 6012
The reason that you can't find this data in the HTML is that it's generated dynamically with a script. If you look at the first script in the HTML, you'll see a function called getData
that contains the data that you want. Another script later uses this function to build what you see in your browser.
So to scrape this data I'd just extract it directly from the script: get the string that the function returns, split it by ,
and so on.
Good luck!
Upvotes: 1