Jimmy
Jimmy

Reputation: 71

How to scrape google news results in puppeteer JS?

I am currently working on scraping Google News pages. I am trying to scrape these pages with puppeteer but when I try to scrape it always returns me an empty result.

Here is my code:

const puppeteer = require('puppeteer')
const cheerio = require('cheerio')

const getNewsData = async (query) => {
  let title = [] , url = [] , snippet = [] , imgSrc = [] , lastUpdated = [] , source = []; 
  const browser = await puppeteer.connect({
    browserWSEndpoint: `wss://chrome-us.browsercloud.io?token=hided`,
});
    const page = await browser.newPage();


try {
  await page.goto("https://www.google.com/search?q="+query+"&tbm=nws&gl=us")
  const elmHandle = await page.$("div.iRPxbe > div.mCBkyc");

  title.push(elmHandle.textContent)
  
  await browser.close();
  console.log(title);
} catch (error) {
  console.log("Error : " +error)
}
return [];
// Remember to catch errors and close!
};

getNewsData("football");

Please also help me to scrape news source, thumbnail and date.

Upvotes: 1

Views: 1177

Answers (2)

Darshan
Darshan

Reputation: 122

Check this answer, to get Google News Results:

const unirest = require("unirest");
const cheerio = require("cheerio");

const getNewsData = () => {
  return unirest
    .get("https://www.google.com/search?q=football&gl=us&tbm=nws")
    .headers({
      "User-Agent":
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.54 Safari/537.36",
    })
    .then((response) => {
      let $ = cheerio.load(response.body);

      let news_results = []; 

  $(".BGxR7d").each((i,el) => {
    news_results.push({
     link: $(el).find("a").attr('href'),
     title: $(el).find("div.mCBkyc").text(),
     snippet: $(el).find(".GI74Re").text(),
     date: $(el).find(".ZE0LJd span").text(),
     thumbnail: $(el).find(".NUnG9d img").attr("src")
    })
  })
  
console.log(news_results)
});
};

getNewsData();

If you need an explanation of this code, I have written a blog also on how to scrape Google News Results: https://serpdog.io/blog/web-scraping-google-news-results-with-node-js.html

Alternative:

You can use Google News API by Serpdog. Serpdog also offers 100 free credits on the first signup.

Scraping can be time-consuming sometimes, but you can use this pre-cooked structured JSON data which makes your work easier and also you don't have to maintain the Google CSS selectors from time to time which is a big headache.

How to use:

const axios = require('axios');

axios.get('https://api.serpdog.io/news?api_key=APIKEY&q=football&gl=us')
  .then(response => {
    console.log(response.data);
  })
  .catch(error => {
    console.log(error);
  });

Results:

"news_results": [
{
  "title": "Martin Bengtsson: football’s Swedish wonderkid whose dream died at Inter",
  "snippet": "If Martin Bengtsson feels stressed he kicks a football around on his own and, almost immediately, the tension begins to ebb away.",
  "source": "The Guardian",
  "imgSrc": "",
  "lastUpdated": "3 hours ago",
  "rank": "1"
},
.....

Disclaimer: I am the founder of serpdog.io

Upvotes: 0

Mikhail Zub
Mikhail Zub

Reputation: 474

You don't need any browser automation to get your information because it can get from a simple request, which needs fewer resources to do this. Check how to do this in the online IDE:

const cheerio = require("cheerio");
const axios = require("axios");

const searchString = "football";                     // what we want to search
const encodedString = encodeURI(searchString);      // what we want to search for in a browser-friendly language

const AXIOS_OPTIONS = {
    headers: {
        "User-Agent":
            "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.64 Safari/537.36",
    },                                                  // adding the User-Agent header as one way to prevent the request from being blocked
    params: {
        q: encodedString,                                // our encoded search string        
        tbm: "nws",                                     // parameter defines the type of search you want to do ("nws" means news)
        hl: 'en',                                       // Parameter defines the language to use for the Google search
        gl: 'us'                                        // parameter defines the country to use for the Google search
    },
};

function getNewsInfo() {
    return axios
        .get(`http://google.com/search`, AXIOS_OPTIONS)
        .then(function ({ data }) {
            let $ = cheerio.load(data);

            const pattern = /s='(?<img>[^']+)';\w+\s\w+=\['(?<id>\w+_\d+)'];/gm;
            const images = [...data.matchAll(pattern)].map(({ groups }) => ({ id: groups.id, img: groups.img.replace('\\x3d', '') }))

            const allNewsInfo = Array.from($('.WlydOe')).map((el) => {
                return {
                    link: $(el).attr('href'),
                    source: $(el).find('.CEMjEf span').text().trim(),
                    title: $(el).find('.mCBkyc').text().trim().replace('\n', ''),
                    snippet: $(el).find('.GI74Re').text().trim().replace('\n', ''),
                    image: images.find(({ id, img }) => id === $(el).find('.uhHOwf img').attr('id'))?.img || "No image",
                    date: $(el).find('.ZE0LJd span').text().trim(),
                }
            });

            return allNewsInfo;
        });
}

getNewsInfo();

Output:

[
   {
      "link":"https://www.cardchronicle.com/2022/7/11/23077819/madden-sanker-commits-to-louisville-football",
      "source":"Card Chronicle",
      "title":"Madden Sanker Commits to Louisville Football",
      "snippet":"Louisville lands their highest rated offensive line recruit in program history.",
      "image":"",
      "date":"8 hours ago"
   },
   ...and other results
]

You can also check my blog post Web Scraping Google News with Nodejs if you want to know more about this topic.

Upvotes: 3

Related Questions