Kapilan Navaratnam
Kapilan Navaratnam

Reputation: 33

web scraping for specific word and few words before and after it

import bs4 as bs
import urllib.request
import re

sauce = urllib.request.urlopen('url').read()
soup = bs.BeautifulSoup(sauce, 'lxml')

print (soup.text)

test  = soup.findAll (text = re.compile('risk'))
print (test)

I am looking for a specific word 'risk' within a paragraph. Can someone help me to code to check wheather the word exist within the paragraph and if it exists, I just want to extract 6 words before and after the key word. Thanks in advance.

Upvotes: 0

Views: 3088

Answers (2)

Bitto
Bitto

Reputation: 8255

I think this solution should work. This also gives you an output if there is less than 6 words before/after in the string. Also it matches 'risk' properly and won't match to something like 'risky'.

You'll have to do some modifications to match your use case.

from bs4 import BeautifulSoup
import urllib.request
import re
url='https://www.investing.com/analysis/2-reasons-merck-200373488'
req = urllib.request.Request(
    url,
    data=None,
    headers={
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'
    }
)
sauce = urllib.request.urlopen(req).read()
soup=BeautifulSoup(sauce,'html.parser')
pattern=re.compile(r'risk[\.| ]',re.IGNORECASE)#'Risk', 'risk.', 'risk'  but NOT 'risky'
no_of_words=6
for elem in soup(text=pattern):
    str=elem.parent.text
    list=str.split(' ')
    list_indices=[i for i,x in enumerate(list) if re.match(pattern,x.strip()+' ')]# +' ' to conform with our pattern
    for index in list_indices:
        start=index-no_of_words
        end=index+no_of_words+1
        if start<0:
            start=0
        print(' '.join(list[start:end]).strip()) #end will not affect o/p if > len(list)
        print("List of Word Before: ",list[start:index])# words before
        print("List of Words After: ",list[index+1:end])#word after
        print()

Output

Risk Warning
List of Word Before:  []
List of Words After:  ['Warning']

Risk Disclosure:
List of Word Before:  []
List of Words After:  ['Disclosure:']

Risk Disclosure: Trading in financial instruments and/or
List of Word Before:  []
List of Words After:  ['Disclosure:', 'Trading', 'in', 'financial', 'instruments', 'and/or']

cryptocurrencies involves high risks including the risk of losing some, or all, of
List of Word Before:  ['cryptocurrencies', 'involves', 'high', 'risks', 'including', 'the']
List of Words After:  ['of', 'losing', 'some,', 'or', 'all,', 'of']

investment objectives, level of experience, and risk appetite, and seek professional advice where
List of Word Before:  ['investment', 'objectives,', 'level', 'of', 'experience,', 'and']
List of Words After:  ['appetite,', 'and', 'seek', 'professional', 'advice', 'where']

investment objectives, level of experience, and risk appetite, and seek professional advice where
List of Word Before:  ['investment', 'objectives,', 'level', 'of', 'experience,', 'and']
List of Words After:  ['appetite,', 'and', 'seek', 'professional', 'advice', 'where']

Upvotes: 1

chitown88
chitown88

Reputation: 28650

Heres a quick example. Note though that I didn't account for a situation if there are less than 6 words before/after the keyword. But this gives you the general start/idea

from bs4 import BeautifulSoup
import requests
import re

key_word = 'risk'
url = 'https://www.investing.com/analysis/2-reasons-merck-200373488'

with requests.Session() as s: 
    s.headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36",
        "Accept-Encoding": "gzip, deflate",
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
        "Accept-Language": "en"
    } 


response = s.get(url)
soup = BeautifulSoup(response.text,"html.parser")

paragraphs  = soup.findAll(text = re.compile(key_word)) 

if len(paragraphs) == 0:
    print ('"%s" not found.' %(key_word))

else:
    for paragraph in paragraphs:
        #print (paragraph.strip())
        alpha = paragraph.strip().split(' ')

        try:
            idx = alpha.index(key_word)

            six_words = alpha[idx-6: idx] + alpha[idx: idx+7]
            print (' '.join(six_words) + '\n')
        except:
            continue

Output:

cryptocurrencies involves high risks including the risk of losing some, or all, of

investment objectives, level of experience, and risk appetite, and seek professional advice where

Upvotes: 0

Related Questions