Reputation: 33
So far I managed to make this:
from bs4 import BeautifulSoup
import requests
def function():
url = 'https://dynasty-scans.com/chapters/liar_satsuki_can_see_death_ch28_6#6'
soup = BeautifulSoup(requests.get(url).text, 'html.parser')
script = soup.find_all('script')
print(script[1])
output:
<script>
//<![CDATA[
var pages = [{"image":"/system/releases/000/036/945/1.png","name":"1"},{"image":"/system/releases/000/036/945/2.png","name":"2"},{"image":"/system/releases/000/036/945/3.png","name":"3"},{"image":"/system/releases/000/036/945/4.png","name":"4"},{"image":"/system/releases/000/036/945/5.png","name":"5"},{"image":"/system/releases/000/036/945/6.png","name":"6"},{"image":"/system/releases/000/036/945/7.png","name":"7"},{"image":"/system/releases/000/036/945/credits.png","name":"credits"}];
//]]>
</script>
I'm trying to extract values of "image" as strings
example: "/system/releases/000/036/945/7.png"
How can I do it ?
Upvotes: 1
Views: 51
Reputation: 1146
you can use a regular expression to extract the variable "pages"
import re, json, requests
url = 'https://dynasty-scans.com/chapters/liar_satsuki_can_see_death_ch28_6#6'
r = requests.get(url)
# extract the data
match = re.search('var pages = (\[.*?\]);', r.text).group(1)
# parse it into json
match_json = json.loads(match)
# iterate through it to get the links
images = [img['image'] for img in match_json]
output:
['/system/releases/000/036/945/1.png',
'/system/releases/000/036/945/2.png',
'/system/releases/000/036/945/3.png',
'/system/releases/000/036/945/4.png',
'/system/releases/000/036/945/5.png',
'/system/releases/000/036/945/6.png',
'/system/releases/000/036/945/7.png',
'/system/releases/000/036/945/credits.png']
Upvotes: 2