Reputation: 3
In a given .html page, I have a script tag like this: How can I use beautiful soup to extract the "retrun" information under "function getData()" ?
<script>
function getData()
{
return "zip,city,state,MedianIncome,MedianIncomeRank,CostOfLivingIndex,CostOfLivingRank\n10452,Bronx,NY,20606,2,147.7,74";
}
function getResultsCount()
{
return "1";
}
</script>
Upvotes: 0
Views: 2740
Reputation: 473763
One way, arguably the simplest, is to use a regular expression to both locate the element and to extract the desired string:
import re
from bs4 import BeautifulSoup
data = """
<script>
function getData()
{
return "zip,city,state,MedianIncome,MedianIncomeRank,CostOfLivingIndex,CostOfLivingRank\n10452,Bronx,NY,20606,2,147.7,74";
}
function getResultsCount()
{
return "1";
}
</script>
"""
soup = BeautifulSoup(data, "html.parser")
pattern = re.compile(r'return "(.*?)";$', re.MULTILINE | re.DOTALL)
script = soup.find("script", text=pattern)
print(pattern.search(script.text).group(1))
Prints:
zip,city,state,MedianIncome,MedianIncomeRank,CostOfLivingIndex,CostOfLivingRank
10452,Bronx,NY,20606,2,147.7,74
Or, you can also use a JavaScript parser, like slimit
, example here.
Upvotes: 1