jerry9855
jerry9855

Reputation: 3

How to use Beautiful Soup to extract function string in <script> tag from a website?

In website page, How can I use beautiful soup to extract the "return" information under "function getData() in html source code" ? I got error like this :

print(pattern.search(script.text).group(1)) AttributeError: 'NoneType' object has no attribute 'text'

import os, sys, urllib, urllib2
from urllib2 import urlopen, Request
from bs4 import BeautifulSoup


url = "http://zipwho.com/?zip=91709&city=&filters=--_--_--_--&state=&mode=zip"
data = urlopen(url).read()
soup = BeautifulSoup(data, "html.parser")
pattern = re.compile(r'return "(.*?)";$', re.MULTILINE | re.DOTALL)
script = soup.find("script", text=pattern)
print(pattern.search(script.text).group(1))

Upvotes: 0

Views: 1724

Answers (1)

roy
roy

Reputation: 97

Tried it on my computer (with requests, not urllib2) and got this

print(script)
>>> None

This is why you get the

AttributeError: 'NoneType' object has no attribute 'text'

Im not sure what your regex is trying to achieve but check it again. Maybe test it on the string which u expect to get first

edit: try this

url = "http://zipwho.com/?zip=91709&city=&filters=--_--_--_--&state=&mode=zip"
data = urlopen(url).read()
soup = BeautifulSoup(data, "html.parser")
script = soup.find("script")
print(script.text)

the output:

function getData()
{
    return "zip,city,state,MedianIncome,MedianIncomeRank,CostOfLivingIndex,CostOfLivingRank,MedianMortgageToIncomeRatio,MedianMortgageToIncomeRank,OwnerOccupiedHomesPercent,OwnerOccupiedHomesRank,MedianRoomsInHome,MedianRoomsInHomeRank,CollegeDegreePercent,CollegeDegreeRank,ProfessionalPercent,ProfessionalRank,Population,PopulationRank,AverageHouseholdSize,AverageHouseholdSizeRank,MedianAge,MedianAgeRank,MaleToFemaleRatio,MaleToFemaleRank,MarriedPercent,MarriedRank,DivorcedPercent,DivorcedRank,WhitePercent,WhiteRank,BlackPercent,BlackRank,AsianPercent,AsianRank,HispanicEthnicityPercent,HispanicEthnicityRank\n91709,Chino Hills,CA,78336,96,260.8,93,25.6,92,84.9,81,6.4,90,37.5,87,44.9,88,66693,99,3.3,96,32.3,13,93.6,57,66.9,83,6.3,11,43.7,10,5.4,68,21.0,98,25.6,92";
}

function getResultsCount()
{
    return "1";
}

its a string

type(script.text)
>>><class 'str'>

so now you can easily match a regex against it to get the result you want

my code

import requests
from bs4 import BeautifulSoup

url = "http://zipwho.com/?zip=91709&city=&filters=--_--_--_--&state=&mode=zip"
data = requests.get(url)
soup = BeautifulSoup(data.content, "html.parser")
script = soup.find('script')
print(script.text)

notice that im using requests instad of urllib2 (go ahead and try it)

Upvotes: 2

Related Questions