Kenneth Choo
Kenneth Choo

Reputation: 13

Python BeautifulSoup 'NavigableString' object has no attribute 'get_text'

This might seem simple, however i couldn't get this to work. Just started to learn scraping recently and have encountered this problem. Tried the code in python REPL and it seems to be working, however not sure why when i coded it, it wouldn't work.

This is my code below btw. So what i'm trying to do is to extract out the article title, link and picture for my program and this is what i have below.

from urllib.request import urlopen
from bs4 import BeautifulSoup
import requests
import json

beauty_result=[]

def scrape_b2():
    soup = BeautifulSoup(urlopen('https://www.instyle.com/beauty'), 'lxml')
    url = 'https://www.instyle.com'
    for article in soup.find_all('article',class_='component tile media image-top type-article'):
        for img in article.find_all('div',class_='component lazy-image thumbnail'):
            for a in article.find('h3'):
                beauty_result.append(json.dumps({
                    'title':a.get_text(strip=True),
                    'link':url+article.find('a')['href'],
                    'image':img.get('data-src')
                }))
    print(beauty_result)

if __name__ == '__main__':
    scrape_b2()

And this is the whole traceback of the error that I got:

D:\Coding\Python\webscrape env>python app.py
Traceback (most recent call last):
File "app.py", line 37, in <module> scrape_b2()
File "app.py", line 28, in scrape_b2 'title':a.get_text(strip=True),
File "D:\Coding\Tools\Anaconda3\envs\webscraper_practice\lib\site-packages\bs4\element.py", line 742, in getattr self.__class__.__name__, attr))
AttributeError: 'NavigableString' object has no attribute 'get_text' 

This is what i solved it with:

def scrape_b2():
    soup = BeautifulSoup(urlopen('https://www.instyle.com/beauty'), 'lxml')
    url = 'https://www.instyle.com'
    for article in soup.find_all('article',class_='component tile media image-top type-article'):
        for img in article.find_all('div',class_='component lazy-image thumbnail'):
            h3 = article.find('h3')
            a_link = h3.find('a')
            beauty_result.append(json.dumps({
                'title': a_link.get_text(strip=True),
                'link': url + a_link.get('href'),
                'image': img.get('data-src')
                }))
    print(beauty_result)

Upvotes: 1

Views: 1465

Answers (2)

SIM
SIM

Reputation: 22440

The following script will give you the different article titles and their concerning links from that site. It looks like the specific content of that page are generated dynamically but in reality they are not. They are present in the page source with different class names.

import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin

URL = "https://www.instyle.com/beauty"

def get_article_info(link):
    res = requests.get(link)
    soup = BeautifulSoup(res.text, 'lxml')
    for article in soup.select('.media-body h3.headline a[href^="/"]'):
        title = article.get_text().strip()
        link = urljoin(link,article.get("href").strip())
        yield {"title":title,"url":link}

if __name__ == '__main__':
    for item in get_article_info(URL):
        print(item['title'],item['url'])

Upvotes: 0

Maaz
Maaz

Reputation: 2445

Your error is because you cannot use the get_text() method, which is specific to Bs4 object.

What you can do is:

h3 = article.find('h3')
a_link = h3.find('a')
beauty_result.append(json.dumps({
    'title': a_link.get_text(strip=True),
    'link': url + a_link.get('href'),
    'image': img.get('data-src')
     }))

The previous code replace the loop for a in article.find('h3'):

Upvotes: 1

Related Questions