Reputation: 393
It's been a step by step process getting the code to this point, the goal of it was to visit a list of URLs and scrape specific data. This has been accomplished with the script below:
import requests
from bs4 import BeautifulSoup as bs
import json
headers = {
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.5112.79 Safari/537.36'
}
urls = ['https://www.nba.com/game/bkn-vs-phi-0022100993',
'https://www.nba.com/game/was-vs-lac-0022100992']
for url in urls:
r = requests.get(url, headers=headers)
soup = bs(r.text, 'html.parser')
page_obj = soup.select_one('script#__NEXT_DATA__')
json_obj = json.loads(page_obj.text)
print('Title:', json_obj['props']['pageProps']
['story']['header']['headline'])
print('Date:', json_obj['props']['pageProps']['story']['date'])
print('Content:', json_obj['props']['pageProps']['story']['content'])
I had an idea I hoped to implement -- I feel I'm very close but not sure why it's not running. Basically, rather than having the static list of URLs, I wanted to use a Google Sheet as the source of URLs. Meaning, a column on this tab will have the URL list that needs to be scraped.
From there, when run, the script will PULL the URLS from the first tab, the data will get scraped, then the info will be pushed to the data in the second tab.
I've been able to print the URLs in terminal with the code above - basically, getting to the source, and requesting all records.
I thought then, I'd be able to still loop through those links in the same way (new code):
from unittest import skip
import requests
from bs4 import BeautifulSoup as bs
import json
import gspread
gc = gspread.service_account(filename='creds.json')
sh = gc.open_by_key('1NFrhsJT7T0zm3dRaP5J8OY0FryBHy5W_wEEGvwBg58I')
worksheet = sh.sheet1
freshurls = gc.open("NBA Stories").get_worksheet(1)
headers = {
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.5112.79 Safari/537.36'
}
urls = freshurls.get_all_records()
for url in urls:
try:
r = requests.get(url, headers=headers)
soup = bs(r.text, 'html.parser')
page_obj = soup.select_one('script#__NEXT_DATA__')
json_obj = json.loads(page_obj.text)
title = (json_obj['props']['pageProps']['story']['header']['headline'])
date = (json_obj['props']['pageProps']['story']['date'])
content = str(json_obj['props']['pageProps']['story']['content'])
AddData = [url, title, date, content]
worksheet.append_row(AddData)
except:
skip
Even if I switch the ending actions (AddData & append rows) to just print the results, I'm not seeing anything.
Seems like I'm missing a step? Is there something I could do differently here to leverage those URLs right from the sheet, instead of having to paste them in the script every time?
Upvotes: 0
Views: 581
Reputation: 4038
You can try using the batch_get
method in a separate script file to get the URL data from a sheet tab and then just call the URL data to your scraping script file in your looping method to reduce complexity and for the readability of your script. For more context, see the sample script below.
In my understanding, here is your goal:
The
getURLsFromSpreadsheet.py
file
import gspread
gc = gspread.service_account(filename='creds.json')
# Open a spreadsheet by ID
sh = gc.open_by_key('1XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX')
# Get the sheets
wk = sh.worksheet("Sheet1")
apprendWk = sh.worksheet("Sheet2")
# E.G. the URLs are listed on Sheet 1 on Column A
urls = wk.batch_get(('A2:A',) )[0]
The
scrapeScript.py
file
from getURLsFromSpreadsheet import *
import requests
from bs4 import BeautifulSoup as bs
import json
headers = {
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.5112.79 Safari/537.36'
}
for url in urls:
r = requests.get(url[0], headers=headers)
soup = bs(r.text, 'html.parser')
page_obj = soup.select_one('script#__NEXT_DATA__')
json_obj = json.loads(page_obj.text)
samplelist = [[str(json_obj['props']['pageProps']['story']['header']['headline']),
str(json_obj['props']['pageProps']['story']['date']),
str(json_obj['props']['pageProps']['story']['content'])[2:-1]
]]
apprendWk.append_rows(samplelist)
Sample spreadsheet file. URLs are listed on
Column A
The Sheet 2 tab after running the
scrapeScript.py
file
Upvotes: 1
Reputation: 201378
At the document of gspread, it seems that get_all_records()
returns a list of dictionaries. Ref Under this condition, when for url in urls:
is run, url
is {"header1": "value1",,,}
. I thought that this might be the reason for your issue.
Unfortunately, although, from your question, I couldn't know the column where the URLs are put, for example, when column "A" has the URLs you want to use, how about the following modification?
urls = freshurls.get_all_records()
for url in urls:
column = 1 # This means column "A".
urls = freshurls.get_all_values()
for e in urls[1:]:
url = e[column - 1]
# print(url) # You can check the URL.
In this modification, the values are retrieved from column "A" using get_all_values
. And, the 1st header row is skipped. It seems that get_all_values
returns the values as the 2-dimensional array.
If your actual situation uses a different column from column "A", please modify the script.
Upvotes: 1