Reputation: 21
I am trying to do scraping from Google News with pygooglenews
.
I am trying to scrape more than 100 articles at a time (as google sets limit at 100) by changing the target dates using for loop. The below is what I have so far but I keep getting error message
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-84-4ada7169ebe7> in <module>
----> 1 df = pd.DataFrame(get_news('Banana'))
2 writer = pd.ExcelWriter('My Result.xlsx', engine='xlsxwriter')
3 df.to_excel(writer, sheet_name='Results', index=False)
4 writer.save()
<ipython-input-79-c5266f97934d> in get_titles(search)
9
10 for date in date_list[:-1]:
---> 11 search = gn.search(search, from_=date, to_=date_list[date_list.index(date)])
12 newsitem = search['entries']
13
~\AppData\Roaming\Python\Python37\site-packages\pygooglenews\__init__.py in search(self, query, helper, when, from_, to_, proxies, scraping_bee)
140 if from_ and not when:
141 from_ = self.__from_to_helper(validate=from_)
--> 142 query += ' after:' + from_
143
144 if to_ and not when:
TypeError: unsupported operand type(s) for +=: 'dict' and 'str'
import pandas as pd
from pygooglenews import GoogleNews
import datetime
gn = GoogleNews()
def get_news(search):
stories = []
start_date = datetime.date(2021,3,1)
end_date = datetime.date(2021,3,5)
delta = datetime.timedelta(days=1)
date_list = pd.date_range(start_date, end_date).tolist()
for date in date_list[:-1]:
search = gn.search(search, from_=date.strftime('%Y-%m-%d'), to_=(date+delta).strftime('%Y-%m-%d'))
newsitem = search['entries']
for item in newsitem:
story = {
'title':item.title,
'link':item.link,
'published':item.published
}
stories.append(story)
return stories
df = pd.DataFrame(get_news('Banana'))
Thank you in advance.
Upvotes: 2
Views: 5010
Reputation: 3907
It looks like you are correctly passing in a string into get_news()
which is then passed on as the first argument (search
) into gn.search()
.
However, you're reassigning search
to the result of gn.search()
in the line:
search = gn.search(search, from_=date.strftime('%Y-%m-%d'), to_=(date+delta).strftime('%Y-%m-%d'))
# ^^^^^^
# gets overwritten with the result of gn.search()
In the next iteration this reassigned search
is passed into gn.search()
which it doesn't like.
If you look at the code in pygooglenews
, it looks like gn.search()
is returning a dict
which would explain the error.
To fix this, simply use a different variable, e.g.:
result = gn.search(search, from_=date.strftime('%Y-%m-%d'), to_=(date+delta).strftime('%Y-%m-%d'))
newsitem = result['entries']
Upvotes: 1
Reputation: 31
I know that pygooglenews has a limit of 100 articles, so you must to make a loop in which it will scrape every day separately.
Upvotes: 0