Reputation: 933
I am trying to extract some links from a specific filehoster on watchseriesfree.to website. In the following case I want rapidvideo links, so I use regex to filter out those tags with text containing rapidvideo
import re
import urllib2
from bs4 import BeautifulSoup
def gethtml(link):
req = urllib2.Request(link, headers={'User-Agent': "Magic Browser"})
con = urllib2.urlopen(req)
html = con.read()
return html
def findLatest():
url = "https://watchseriesfree.to/serie/Madam-Secretary"
head = "https://watchseriesfree.to"
soup = BeautifulSoup(gethtml(url), 'html.parser')
latep = soup.find("a", title=re.compile('Latest Episode'))
soup = BeautifulSoup(gethtml(head + latep['href']), 'html.parser')
firstVod = soup.findAll("tr",text=re.compile('rapidvideo'))
return firstVod
print(findLatest())
However, the above code returns a blank list. What am I doing wrong?
Upvotes: 3
Views: 943
Reputation: 473873
The problem is here:
firstVod = soup.findAll("tr",text=re.compile('rapidvideo'))
When BeautifulSoup
will apply your text regex pattern, it would use .string
attribute values of all the matched tr
elements. Now, the .string
has this important caveat - when an element has multiple children, .string
is None
:
If a tag contains more than one thing, then it’s not clear what
.string
should refer to, so.string
is defined to beNone
.
Hence, you have no results.
What you can do is to check the actual text of the tr
elements by using a searching function and calling .get_text()
:
soup.find_all(lambda tag: tag.name == 'tr' and 'rapidvideo' in tag.get_text())
Upvotes: 6