user15215612
user15215612

Reputation:

Not able to scrape the description properly using Beautiful Soup and Python

I am web-scraping this link : https://www.americanexpress.com/in/credit-cards/smart-earn-credit-card/?linknav=in-amex-cardshop-allcards-learn-SmartEarnCreditCard-carousel using bs4 and python.

I am basically grabbing the key benefits from that website using the following code.

link = 'https://www.americanexpress.com/in/credit-cards/smart-earn-credit-card/?linknav=in-amex-cardshop-allcards-learn-SmartEarnCreditCard-carousel'
html = urlopen(link)
soup = BeautifulSoup(html, 'lxml')

details = []

for span in soup.select(".why-amex__subtitle span"):
    details.append(f'{span.get_text(strip=True)}: {span.find_next("span").get_text(strip=True)}')



print(details)

Output

['Accelerated Earn Rate: Earn 10X Membership Rewards® Points2on your spending on Flipkart and Uber and earn 5X Membership Rewards Points2on Amazon, Swiggy, BookMyShow and more.', 'Welcome Bonus: Rs. 500 cashback as Welcome Gift on eligible spends1of Rs. 10,000 in the first 90 days of Cardmembership', 'Renewal Fee Waiver: Get a renewal fee waiver on eligible spends3of Rs.40,000 and above in the previous year of Cardmembership', 'AMERICAN EXPRESS EMI: Convert purchases into']


The last item in this list is not scraped properly, it is incomplete. Because there is a hyperlink in the middle of the text.

Below is the html code corresponding to that problem:

<div class="why-amex__col"><span class="icons  why-amex__lrgIcon icon-Amex-Icons-2016-85"></span><h4 class="why-amex__subtitle"><div><span>AMERICAN EXPRESS EMI</span></div></h4><div class="why-amex__copy"><div class="description_text"><div><span>Convert purchases into </span><a href="https://www.americanexpress.com/india/membershiprewards/cardmember_offers/viewmore.html" target="_blank">EMI</a><span> at the point of sale with an interest rate as low as 12% p.a. and zero foreclosure charges</span></div></div></div></div>

I'd like to get the full description of the last item without missing out the text.

Upvotes: 0

Views: 80

Answers (1)

Cyrus
Cyrus

Reputation: 691

Just append the innerHTML into details and then loop through the tags to construct your text.

Something like:


texts = []
for i, detail in enumerate(details):

    texts.append('')
    for tag in detail.findChildren(recursive=False):

        texts[i] += tag.get_text(strip=True)

Upvotes: 1

Related Questions