Reputation: 1367
I am trying to make a python script that reads crunchyroll's page and gives me the ssid of the subtitle.
For example :- http://www.crunchyroll.com/i-cant-understand-what-my-husband-is-saying/episode-1-wriggling-memories-678035
Go to the source code and look for ssid
,I want to extract the numbers after ssid of this element
<a href="/i-cant-understand-what-my-husband-is-saying/episode-1-wriggling-memories-678035?ssid=154757" title="English (US)">English (US)</a>
I want to extract "154757", but I can't seem to get my script working
This is my current script:
import feedparser
import re
import urllib2
from urllib2 import urlopen
from bs4 import BeautifulSoup
feed = feedparser.parse('http://www.crunchyroll.com/rss/anime')
url1 = feed['entries'][0]['link']
soup = BeautifulSoup(urlopen(url1), 'html.parser')
How can I modify my code to search and extract that particular number?
Upvotes: 0
Views: 1214
Reputation: 4389
This should get you started with being able to extract the ssid
for each entry. Note that some of those link don't have any ssid
so you'll have to account for that with some error catching. No need for re
or the urllib2
modules here.
import feedparser
import requests
from bs4 import BeautifulSoup
d = feedparser.parse('http://www.crunchyroll.com/rss/anime')
for url in d.entries:
#print url.link
r = requests.get(url.link)
soup = BeautifulSoup(r.text)
#print soup
subtitles = soup.find_all('span',{'class':'showmedia-subtitle-text'})
for ssid in subtitles:
x = ssid.findAll('a')
for a in x:
print a['href']
Output:
--snip--
/i-cant-understand-what-my-husband-is-saying/episode-12-baby-skip-beat-678057?ssid=166035
/i-cant-understand-what-my-husband-is-saying/episode-12-baby-skip-beat-678057?ssid=165817
/i-cant-understand-what-my-husband-is-saying/episode-12-baby-skip-beat-678057?ssid=165819
/i-cant-understand-what-my-husband-is-saying/episode-12-baby-skip-beat-678057?ssid=166783
/i-cant-understand-what-my-husband-is-saying/episode-12-baby-skip-beat-678057?ssid=165839
/i-cant-understand-what-my-husband-is-saying/episode-12-baby-skip-beat-678057?ssid=165989
/i-cant-understand-what-my-husband-is-saying/episode-12-baby-skip-beat-678057?ssid=166051
/urawa-no-usagi-chan/episode-11-if-i-retort-i-lose-678873?ssid=166011
/urawa-no-usagi-chan/episode-11-if-i-retort-i-lose-678873?ssid=165995
/urawa-no-usagi-chan/episode-11-if-i-retort-i-lose-678873?ssid=165997
/urawa-no-usagi-chan/episode-11-if-i-retort-i-lose-678873?ssid=166033
/urawa-no-usagi-chan/episode-11-if-i-retort-i-lose-678873?ssid=165825
/urawa-no-usagi-chan/episode-11-if-i-retort-i-lose-678873?ssid=166013
/urawa-no-usagi-chan/episode-11-if-i-retort-i-lose-678873?ssid=166009
/urawa-no-usagi-chan/episode-11-if-i-retort-i-lose-678873?ssid=166003
/etotama/episode-11-catrat-shuffle-678659?ssid=166007
/etotama/episode-11-catrat-shuffle-678659?ssid=165969
/etotama/episode-11-catrat-shuffle-678659?ssid=166489
/etotama/episode-11-catrat-shuffle-678659?ssid=166023
/etotama/episode-11-catrat-shuffle-678659?ssid=166015
/etotama/episode-11-catrat-shuffle-678659?ssid=166049
/etotama/episode-11-catrat-shuffle-678659?ssid=165993
/etotama/episode-11-catrat-shuffle-678659?ssid=165981
--snip--
There are more but I left them out for brevity. From these results you should be able to easily parse out the ssid
with some slicing since it looks like the ssid are all 6 digits long. Doing something like:
print a['href'][-6:]
would do the trick and get you just the ssid
.
Upvotes: 1