Mouhcin Iuoanid
Mouhcin Iuoanid

Reputation: 53

How to find all links that contain 2 or 3 words in tag using Beautiful Soup in Python

i'm extracting iframe from website using Beautifulsoup iframes = soup.find_all('iframe') i want to find all the src tag in that iframes that contain 2 or 3 words let's say that i have the src link look like this "https://xyz.co/embed/TNagkx3oHj8/The.Tale.S001.true.72p.x264-QuebecRules" i know how to extract the links that's containe the word "xyz"

srcs = []
 iframes = soup.find_all('iframe')
            for iframe in iframes:
                try:
                    if iframe['src'].find('xyz')>=0: srcs.append(iframe['src'])                 
                except KeyError: continue

my question is how to extract all the links that contain 2 words like "xyz" and "true" or 3 words it's like filter if this 2 words don't exist in that link don't scrap it

Upvotes: 1

Views: 202

Answers (1)

Keyur Potdar
Keyur Potdar

Reputation: 7238

You can use a custom function to check whether the src contains all the words you want.

For example, you can use something like this:

soup.find_all('iframe', src=lambda s: all(word in s for word in ('xyz', 'true')))

Demo:

html = '''
    <iframe src="https://xyz.co/embed/TNagkx3oHj8/The.Tale.S001.true.72p.x264-QuebecRules">...</iframe>
    <iframe src="foo">...</iframe>
    <iframe src="xyz">...</iframe>
    <iframe src="xyz.true">...</iframe>
'''

soup = BeautifulSoup(html, 'html.parser')
iframes = soup.find_all('iframe', src=lambda s: all(word in s for word in ('xyz', 'true')))
print(iframes)

Output:

[<iframe src="https://xyz.co/embed/TNagkx3oHj8/The.Tale.S001.true.72p.x264-QuebecRules">...</iframe>, <iframe src="xyz.true">...</iframe>]

Note:

If any of the <iframe> tags does not contain a src attribute, the above function will raise an error. In that case, change the function to:

soup.find_all('iframe', src=lambda s: s and all(word in s for word in ('xyz', 'true')))

Upvotes: 0

Related Questions