removing additional data (html tags) from output?

Question

I scraped a list of stocks and appended the items to a list, but doing so also added extra html elements due to my bs4 query.

Here is my reproducible code:

from bs4 import BeautifulSoup
from urllib.request import Request, urlopen

url = 'https://bullishbears.com/russell-2000-stocks-list/'
hdr = {'User-Agent': 'Mozilla/5.0'}
req = Request(url,headers=hdr)
page = urlopen(req)
soup = BeautifulSoup(page)

divTag = soup.find_all("div", {"class": "thrv_wrapper thrv_text_element"})

stock_list = []
for tag in divTag:
    strongTags = tag.find_all("strong")
    for tag in strongTags:
        for x in tag:      
            stock_list.append(x)

Looking at the outcome of the list, I'm happy with the stock string format followed by a comma after every stock (list of strings). As you can see, I'm also getting other HTML elements that I want removed and .

stock_list =

[RUSSELL 2000 STOCKS LIST,
  We provide you a list of Russell 2000 stocks and companies below. ,
  We provide you a list of Russell 2000 stocks and companies below. ,
  We provide you a list of Russell 2000 stocks and companies below. ,
  We provide you a list of Russell 2000 stocks and companies below,
 . ,
 'List of Russell 2000 Stocks & Updated Chart',
 'IWM',
 
,
 'SPSM',
 
,
 'VTWO',
 '/RTY',
 
,
 '/M2K',
 'AAN',
 
,
 'AAOI',
 
,
 'AAON',
 
,
 'AAT',
 
,
 'AAWW',
 
,
 'AAXN',
 
,
 'ABCB',
 
,
 'ABEO',
 
,
 'ABG',
 
,
 'ABM',
 
,
 'ABTX',
 
,
 'AC',
 
,
 'ACA',
 
,
 'ACAD',
 
,
 'ACBI',
 
,
 'ACCO',
# More to the list but for brevity I removed the rest.

How can I properly fine tune my bs4 query to only get a list of stocks?

uingtea · Accepted Answer

you need to split the value because multiple stocks are inside strong tags

AAN
AAOI
AAON
AAT
....

the code

# better and easier using CSS selector
strongTags = soup.select('.tcb-col .thrv_wrapper.thrv_text_element strong')

stock_list = []
for s in strongTags:
    # .decode_contents() to get innerHTML
    stocks = s.decode_contents().split('
');
    for stock in stocks:
        stock_list.append(stock)

print(stock_list)

removing additional data (html tags) from output?

Answers (1)

Related Questions