Reputation: 53
i'm extracting iframe from website using Beautifulsoup
iframes = soup.find_all('iframe')
i want to find all the src tag in that iframes that contain 2 or 3 words
let's say that i have the src link look like this "https://xyz.co/embed/TNagkx3oHj8/The.Tale.S001.true.72p.x264-QuebecRules"
i know how to extract the links that's containe the word "xyz"
srcs = []
iframes = soup.find_all('iframe')
for iframe in iframes:
try:
if iframe['src'].find('xyz')>=0: srcs.append(iframe['src'])
except KeyError: continue
my question is how to extract all the links that contain 2 words like "xyz" and "true" or 3 words it's like filter if this 2 words don't exist in that link don't scrap it
Upvotes: 1
Views: 202
Reputation: 7238
You can use a custom function to check whether the src
contains all the words you want.
For example, you can use something like this:
soup.find_all('iframe', src=lambda s: all(word in s for word in ('xyz', 'true')))
Demo:
html = '''
<iframe src="https://xyz.co/embed/TNagkx3oHj8/The.Tale.S001.true.72p.x264-QuebecRules">...</iframe>
<iframe src="foo">...</iframe>
<iframe src="xyz">...</iframe>
<iframe src="xyz.true">...</iframe>
'''
soup = BeautifulSoup(html, 'html.parser')
iframes = soup.find_all('iframe', src=lambda s: all(word in s for word in ('xyz', 'true')))
print(iframes)
Output:
[<iframe src="https://xyz.co/embed/TNagkx3oHj8/The.Tale.S001.true.72p.x264-QuebecRules">...</iframe>, <iframe src="xyz.true">...</iframe>]
Note:
If any of the <iframe>
tags does not contain a src
attribute, the above function will raise an error. In that case, change the function to:
soup.find_all('iframe', src=lambda s: s and all(word in s for word in ('xyz', 'true')))
Upvotes: 0