Reputation: 11
I'm trying to scrape some data with Python and Beautifulsoup. I know how to get the text from the script tag. The data between [ ] is valid json.
<script>
dataLayer =
[
{
"p":{
"t":"text1",
"lng":"text2",
"vurl":"text3"
},
"c":{ },
"u":{ },
"d":{ },
"a":{ }
}
]
</script>
I've read this response and it almost does what I want: Extract content of <Script with BeautifulSoup
Here is my code:
import urllib.request
from bs4 import BeautifulSoup
import json
url = "www.example.com"
html = urllib.request.urlopen(url)
soup = BeautifulSoup(html, "html.parser")
raw_data = soup.find("script")
I would then ideally do:
json_dict = json.loads(raw_data)
And access the data through the dictionary. But this is not working because of
"<script> dataLayer ="
preceding the valid json and the script tag at the end. I've tried trimming the raw_data as a string, like this:
raw_data[20:]
But this didn't work because the soup object is not a string.
How can I get the raw_data variable to contain ONLY the text between the block quotes [ ]?
EDIT: this seems to work. It avoids regex and solves the problem of the trailing chars as well. Thanks for your suggestions.
url = "www.example.com"
html = urllib.request.urlopen(url)
soup = BeautifulSoup(html, "html.parser")
# get the script tag data and convert soup into a string
data = str(soup.find("script"))
# cut the <script> tag and some other things from the beginning and end to get valid JSON
cut = data[27:-13]
# load the data as a json dictionary
jsoned = json.loads(cut)
Upvotes: 0
Views: 2062
Reputation: 827
>>> import re
>>> soup.find_all(re.compile("\[(.*?)\]"))
you would do that with regex
You will have to create a regex norm that only takes text between []
here a link of common regex usage within beautifulsoup
here the regex to extract from between square brackets
Upvotes: 0
Reputation: 19154
use .text
to get content inside <script>
tag then replace dataLayer =
raw_data = soup.find("script")
raw_data = raw_data.text.replace('dataLayer =', '')
json_dict = json.loads(raw_data)
Upvotes: 1