Reputation: 979
I am extracting the text and some other info from webpage by using this script:
r = requests.get('https://www.horizont.net/marketing/nachrichten/anzeige.-digitalisierung-wie-software-die-kreativitaet-steigert-178413')
c = r.content
soup = BeautifulSoup(c, 'html.parser')
print(soup.prettify())
and then just defined what i needed:
all = soup.select('.PageArticle')
title = []
author = []
publish_date = []
article_main_content = []
article_body = []
for item in all:
t = item.find_all('h1')[0].text
title.append(t)
a = item.find_all('span')[2].text
author.append(a)
p = item.find_all('span')[5].text
publish_date.append(p)
amc = item.select('.PageArticle_lead-content')[0].text
article_main_content.append(amc)
a_body = item.select('.PageArticle_body')[0].text
article_body.append(article_body)
and put them into df like this:
df = pd.DataFrame({"Title":title, "Author": author, "Publish_date": publish_date,
"Article_Main_Content": article_main_content, "Article_Body": article_body })
I am having two problem:
First Problem: When i try to get the contents from the Article which consists of about 500-800 words. I am getting empty string.. Is there any limit problem?
is there any way to solve this?
Second Problem:
I have list of URL where i want to do same procedure and want to store all the info in same df.. How i can use list if urls for such data?
Upvotes: 0
Views: 57
Reputation: 1872
First problem You have a typo in the last line:
# Change this article_body.append(article_body)
article_body.append(a_body)
Second problem Loop over the list.
for url in url_list:
# Your code
r = requests.get(url)
c = r.content
soup = BeautifulSoup(c, 'html.parser')
print(soup.prettify())
# The rest of your code...
Upvotes: 1