Reputation:
Below is my code
import pandas as pd
import requests
from bs4 import BeautifulSoup
source = requests.get('https://www.vanglaini.org/').text
soup = BeautifulSoup(source, 'lxml')
for article in soup.find_all('article'):
headline = article.a.text
summary=article.p.text
link = "https://www.vanglaini.org" +article.a['href']
#print(headline)
#print(summary)
#print(link)
#print()
news_csv = pd.DataFrame({'Headline': headline,
'Summary': summary,
'Link' : link,
})
print(news_csv)
i got this error headline = article.a.text AttributeError: 'NoneType' object has no attribute 'text'
Help!
Upvotes: 0
Views: 106
Reputation: 142734
As you already get in my comments and in @AmiTavory (deleted) answer - not all articles have link and sometimes article.a
gives None
so you have None.text
which gives you error.
You have to check if article.a
is not None
like
import requests
from bs4 import BeautifulSoup
source = requests.get('https://www.vanglaini.org/').text
soup = BeautifulSoup(source, 'lxml')
for article in soup.find_all('article'):
if article.a is None:
continue
headline = article.a.text
summary = article.p.text
link = "https://www.vanglaini.org" + article.a['href']
print(headline)
print(summary)
print(link)
and it works.
EDIT: You can get error
raise ValueError("If using all scalar values, you must pass an index") ValueError: If using all scalar values, you must pass an index
for totally different reason and you should create new question on new page.
It is problem in DataFrame
because you have only last value in headline
, summary
, link
but DataFrame
expects lists in
{
'Headline': list_with_headlines,
'Summary': list_with_summaries,
'Link' : list_with_links,
}
You should create empty lists before for
-loop
list_with_headlines = []
list_with_summaries = []
list_with_links = []
and inside for
-loop you shouldappend()
values to lists
list_with_headlines.append(headline)
list_with_summaries.append(summary)
list_with_links.append(link)
and later create DataFrame
using lists
news_csv = pd.DataFrame({
'Headline': list_with_headlines,
'Summary': list_with_summaries,
'Link' : list_with_links,
})
Full code:
import pandas as pd
import requests
from bs4 import BeautifulSoup
source = requests.get('https://www.vanglaini.org/').text
soup = BeautifulSoup(source, 'lxml')
list_with_headlines = []
list_with_summaries = []
list_with_links = []
for article in soup.find_all('article'):
if article.a is None:
continue
headline = article.a.text.strip()
summary = article.p.text.strip()
link = "https://www.vanglaini.org" + article.a['href']
list_with_headlines.append(headline)
list_with_summaries.append(summary)
list_with_links.append(link)
news_csv = pd.DataFrame({
'Headline': list_with_headlines,
'Summary': list_with_summaries,
'Link' : list_with_links,
})
print(news_csv)
Upvotes: 1