Xonshiz
Xonshiz

Reputation: 1367

search a specific word in BeautifulSoup python

I am trying to make a python script that reads crunchyroll's page and gives me the ssid of the subtitle.

For example :- http://www.crunchyroll.com/i-cant-understand-what-my-husband-is-saying/episode-1-wriggling-memories-678035

Go to the source code and look for ssid,I want to extract the numbers after ssid of this element

 <a href="/i-cant-understand-what-my-husband-is-saying/episode-1-wriggling-memories-678035?ssid=154757" title="English (US)">English (US)</a>

I want to extract "154757", but I can't seem to get my script working

This is my current script:

import feedparser
import re
import urllib2
from urllib2 import urlopen
from bs4 import BeautifulSoup


feed = feedparser.parse('http://www.crunchyroll.com/rss/anime')
url1 = feed['entries'][0]['link']
soup = BeautifulSoup(urlopen(url1), 'html.parser')

How can I modify my code to search and extract that particular number?

Upvotes: 0

Views: 1214

Answers (1)

serk
serk

Reputation: 4389

This should get you started with being able to extract the ssid for each entry. Note that some of those link don't have any ssid so you'll have to account for that with some error catching. No need for re or the urllib2 modules here.

import feedparser
import requests
from bs4 import BeautifulSoup


d = feedparser.parse('http://www.crunchyroll.com/rss/anime')
for url in d.entries:
    #print url.link
    r = requests.get(url.link)
    soup = BeautifulSoup(r.text)
    #print soup
    subtitles = soup.find_all('span',{'class':'showmedia-subtitle-text'})
    for ssid in subtitles:
        x = ssid.findAll('a')
        for a in x:
            print a['href']

Output:

--snip--
/i-cant-understand-what-my-husband-is-saying/episode-12-baby-skip-beat-678057?ssid=166035
/i-cant-understand-what-my-husband-is-saying/episode-12-baby-skip-beat-678057?ssid=165817
/i-cant-understand-what-my-husband-is-saying/episode-12-baby-skip-beat-678057?ssid=165819
/i-cant-understand-what-my-husband-is-saying/episode-12-baby-skip-beat-678057?ssid=166783
/i-cant-understand-what-my-husband-is-saying/episode-12-baby-skip-beat-678057?ssid=165839
/i-cant-understand-what-my-husband-is-saying/episode-12-baby-skip-beat-678057?ssid=165989
/i-cant-understand-what-my-husband-is-saying/episode-12-baby-skip-beat-678057?ssid=166051
/urawa-no-usagi-chan/episode-11-if-i-retort-i-lose-678873?ssid=166011
/urawa-no-usagi-chan/episode-11-if-i-retort-i-lose-678873?ssid=165995
/urawa-no-usagi-chan/episode-11-if-i-retort-i-lose-678873?ssid=165997
/urawa-no-usagi-chan/episode-11-if-i-retort-i-lose-678873?ssid=166033
/urawa-no-usagi-chan/episode-11-if-i-retort-i-lose-678873?ssid=165825
/urawa-no-usagi-chan/episode-11-if-i-retort-i-lose-678873?ssid=166013
/urawa-no-usagi-chan/episode-11-if-i-retort-i-lose-678873?ssid=166009
/urawa-no-usagi-chan/episode-11-if-i-retort-i-lose-678873?ssid=166003
/etotama/episode-11-catrat-shuffle-678659?ssid=166007
/etotama/episode-11-catrat-shuffle-678659?ssid=165969
/etotama/episode-11-catrat-shuffle-678659?ssid=166489
/etotama/episode-11-catrat-shuffle-678659?ssid=166023
/etotama/episode-11-catrat-shuffle-678659?ssid=166015
/etotama/episode-11-catrat-shuffle-678659?ssid=166049
/etotama/episode-11-catrat-shuffle-678659?ssid=165993
/etotama/episode-11-catrat-shuffle-678659?ssid=165981
--snip--

There are more but I left them out for brevity. From these results you should be able to easily parse out the ssid with some slicing since it looks like the ssid are all 6 digits long. Doing something like:

print a['href'][-6:]

would do the trick and get you just the ssid.

Upvotes: 1

Related Questions