Anthony Madle
Anthony Madle

Reputation: 393

How to loop through URLs hosted in a Google Sheet

It's been a step by step process getting the code to this point, the goal of it was to visit a list of URLs and scrape specific data. This has been accomplished with the script below:

import requests
from bs4 import BeautifulSoup as bs
import json

headers = {
    'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.5112.79 Safari/537.36'
}

urls = ['https://www.nba.com/game/bkn-vs-phi-0022100993',
        'https://www.nba.com/game/was-vs-lac-0022100992']


for url in urls:

    r = requests.get(url, headers=headers)

    soup = bs(r.text, 'html.parser')

    page_obj = soup.select_one('script#__NEXT_DATA__')
    json_obj = json.loads(page_obj.text)

    print('Title:', json_obj['props']['pageProps']
          ['story']['header']['headline'])
    print('Date:', json_obj['props']['pageProps']['story']['date'])
    print('Content:', json_obj['props']['pageProps']['story']['content'])

I had an idea I hoped to implement -- I feel I'm very close but not sure why it's not running. Basically, rather than having the static list of URLs, I wanted to use a Google Sheet as the source of URLs. Meaning, a column on this tab will have the URL list that needs to be scraped.

From there, when run, the script will PULL the URLS from the first tab, the data will get scraped, then the info will be pushed to the data in the second tab.

I've been able to print the URLs in terminal with the code above - basically, getting to the source, and requesting all records.

I thought then, I'd be able to still loop through those links in the same way (new code):

from unittest import skip
import requests
from bs4 import BeautifulSoup as bs
import json
import gspread

gc = gspread.service_account(filename='creds.json')
sh = gc.open_by_key('1NFrhsJT7T0zm3dRaP5J8OY0FryBHy5W_wEEGvwBg58I')
worksheet = sh.sheet1
freshurls = gc.open("NBA Stories").get_worksheet(1)

headers = {
    'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.5112.79 Safari/537.36'
}

urls = freshurls.get_all_records()


for url in urls:
    try:
        r = requests.get(url, headers=headers)

        soup = bs(r.text, 'html.parser')

        page_obj = soup.select_one('script#__NEXT_DATA__')
        json_obj = json.loads(page_obj.text)

        title = (json_obj['props']['pageProps']['story']['header']['headline'])
        date = (json_obj['props']['pageProps']['story']['date'])
        content = str(json_obj['props']['pageProps']['story']['content'])
        AddData = [url, title, date, content]
        worksheet.append_row(AddData)

    except:
        skip

Even if I switch the ending actions (AddData & append rows) to just print the results, I'm not seeing anything.

Seems like I'm missing a step? Is there something I could do differently here to leverage those URLs right from the sheet, instead of having to paste them in the script every time?

Upvotes: 0

Views: 581

Answers (2)

SputnikDrunk2
SputnikDrunk2

Reputation: 4038

SUGGESTION

You can try using the batch_get method in a separate script file to get the URL data from a sheet tab and then just call the URL data to your scraping script file in your looping method to reduce complexity and for the readability of your script. For more context, see the sample script below.

In my understanding, here is your goal:

  • Put a list of URLs on a specific sheet tab in a spreadsheet file.
  • Get the URL data from that Sheet tab in Python
  • Loop through it in your Python script and scrape the data per URL
  • Append each scrape data to a second sheet tab.

Sample Script

The getURLsFromSpreadsheet.py file

import gspread

gc = gspread.service_account(filename='creds.json')

# Open a spreadsheet by ID
sh = gc.open_by_key('1XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX')

# Get the sheets
wk = sh.worksheet("Sheet1")
apprendWk = sh.worksheet("Sheet2")

# E.G. the URLs are listed on Sheet 1 on Column A
urls = wk.batch_get(('A2:A',) )[0]

The scrapeScript.py file

from getURLsFromSpreadsheet import *
import requests
from bs4 import BeautifulSoup as bs
import json

headers = {
    'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.5112.79 Safari/537.36'
}


for url in urls:

    r = requests.get(url[0], headers=headers)

    soup = bs(r.text, 'html.parser')

    page_obj = soup.select_one('script#__NEXT_DATA__')
    json_obj = json.loads(page_obj.text)
    samplelist = [[str(json_obj['props']['pageProps']['story']['header']['headline']),
                   str(json_obj['props']['pageProps']['story']['date']),
                   str(json_obj['props']['pageProps']['story']['content'])[2:-1]
                 ]]
    apprendWk.append_rows(samplelist)

Demonstration

Sample spreadsheet file. URLs are listed on Column A

enter image description here

The Sheet 2 tab after running the scrapeScript.py file

enter image description here

In action: enter image description here

Reference

Upvotes: 1

Tanaike
Tanaike

Reputation: 201378

At the document of gspread, it seems that get_all_records() returns a list of dictionaries. Ref Under this condition, when for url in urls: is run, url is {"header1": "value1",,,}. I thought that this might be the reason for your issue.

Unfortunately, although, from your question, I couldn't know the column where the URLs are put, for example, when column "A" has the URLs you want to use, how about the following modification?

From:

urls = freshurls.get_all_records()


for url in urls:

To:

column = 1  # This means column "A".
urls = freshurls.get_all_values()
for e in urls[1:]:
    url = e[column - 1]
    # print(url) # You can check the URL.
  • In this modification, the values are retrieved from column "A" using get_all_values. And, the 1st header row is skipped. It seems that get_all_values returns the values as the 2-dimensional array.

  • If your actual situation uses a different column from column "A", please modify the script.

Reference:

Upvotes: 1

Related Questions