Adriano Patruno
Adriano Patruno

Reputation: 39

Python Web Scraping - Find only n items

I am running a scraping script made with Beautiful Soup. I scrape results from Google News and i want to get only the first n results added in a variable as tuple. The tuple is made by news title and news link. In the full script i have a list of keywords like ['crisis','finance'] and so on, you can disregard that part. That's the code.

import bs4,requests
articles_list = []

base_url = 'https://news.google.com/search?q=TEST%20when%3A3d&hl=en-US&gl=US&ceid=US%3Aen' 
request = requests.get(base_url) 
webcontent = bs4.BeautifulSoup(request.content,'lxml') 
        
for i in webcontent.findAll('div',{'jslog':'93789'}):  
            for link in i.findAll('a', attrs={'href': re.compile("/articles/")},limit=1): 
                if any(keyword in i.select_one('h3').getText() for keyword in keyword_list): 
                    articles_list.append((i.select_one('h3').getText(),"https://news.google.com"+str(link.get('href')))) 

Written like this is adding as tuple all the news and link that fulfill the if statement, which may result in a long list. I'd like to take only the first n news, let's suppose five, then i would like the script to stop.

I tried:

for _

 in range(5):

but i don't understand where to add it exactly, because either the code is not running or is appending the same news 5 times.

I also tried :

while len(articles_list)<5:

but as the statement is part of a for loop and the variable articles_list is global then it stops appending also for the next object of the scraping.

and finally i tried:

for tuples in (articles_list[0:5]): #Iterate in the tuple,
        for element in tuples: #Print title, link and a divisor
            print(element)
        print('-'*80)

I am ok to do this last one if there are no alternatives, but i'd avoid as the variable articles_list would anyway contain more elements than i need.

Can you please help me understand what i am missing?

Thanks!

Upvotes: 0

Views: 123

Answers (1)

Mike67
Mike67

Reputation: 11342

You have a double loop in your code. To exit both of them, you will need to use break twice, once for each loop. You can break on the same condition in both loops.

Try this code:

import re
import bs4,requests

keyword_list = ['health','Coronavirus','travel']
articles_list = []

base_url = 'https://news.google.com/search?q=TEST%20when%3A3d&hl=en-US&gl=US&ceid=US%3Aen' 
request = requests.get(base_url) 
webcontent = bs4.BeautifulSoup(request.content,'lxml') 

maxcnt = 5  # max number of articles
     
for ictr,i in enumerate(webcontent.findAll('div',{'jslog':'93789'})):
   if len(articles_list) == maxcnt: break   # exit outer loop
   for link in i.findAll('a', attrs={'href': re.compile("/articles/")},limit=1): 
       if any(keyword in i.select_one('h3').getText() for keyword in keyword_list): 
           articles_list.append((i.select_one('h3').getText(),"https://news.google.com"+str(link.get('href')))) 
           if len(articles_list) == maxcnt: break  # exit inner loop

print(str(len(articles_list)), 'articles')
print('\n'.join(['> '+a[0] for a in articles_list]))  # article titles

Output

5 articles
> Why Coronavirus Tests Come With Surprise Bills
> It’s Not Easy to Get a Coronavirus Test for a Child
> Britain’s health secretary says the asymptomatic don’t need tests. Critics say that sends a mixed message.
> Coronavirus testing shifts focus from precision to rapidity
> Coronavirus testing at Boston lab suspended after nearly 400 false positives

Upvotes: 1

Related Questions