Reputation: 521
I have scraped a couple of things from a list of websites and I am trying to save it in a pandas dataframe. While the dataframe is getting created but I am not getting the expected the dataframe. For eg: This is the dataframe I am getting
Date News URLs
13th July 2021 [someurl, someurl12,...]
Date News URLs
11th July 2021 [someurl58675, someurl12979,...]
Date News URLs
12th July 2021 [someurl47539, someurl657637,...]
I intend to get an dataframe like this:
Date News URLs
13th July 2021 [someurl, someurl12,...]
11th July 2021 [someurl58675, someurl12979,...]
12th July 2021 [someurl47539, someurl657637,...]
And here's my code for the same.
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
import numpy as np
import pandas as pd
import time
options = webdriver.ChromeOptions()
options.add_argument('start-maximized')
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option("useAutomationExtension", False)
browser = webdriver.Chrome(ChromeDriverManager().install(), options=options)
def sb_rum():
websites = ['https://www.thespiritsbusiness.com/tag/rum/','https://www.thespiritsbusiness.com/tag/gin/']
for spirits in websites:
browser.get(spirits)
time.sleep(1)
news_links = browser.find_elements_by_xpath("//div[@id='archivewrapper']")
newsdata = {}
for news in news_links:
datecheck = news.find_element_by_tag_name("small").get_attribute("innerText").replace(' ','')
links = news.find_element_by_xpath("//div[@id='archivewrapper']//h3/a").get_attribute("href")
if datecheck in newsdata:
newsdata[datecheck].append(links)
else:
newsdata[datecheck] = [links]
dates = "July{}th,2021"
for i in range(1, 12):
if dates.format(i) in newsdata:
df = pd.DataFrame({'News urls': newsdata[dates.format(i)], 'Dates': dates.format(i)})
print(("{} : {}".format(dates.format(i), newsdata[dates.format(i)])))
print(df.head())
Upvotes: 0
Views: 269
Reputation: 120479
Can you try the function below. I can give some explanation if the expected outcome is right.
def sb_rum():
websites = ['https://www.thespiritsbusiness.com/tag/rum/','https://www.thespiritsbusiness.com/tag/gin/']
newsdata = []
for spirits in websites:
browser.get(spirits)
time.sleep(1)
news_links = browser.find_elements_by_xpath("//div[@id='archivewrapper']")
for news in news_links:
datecheck = news.find_element_by_tag_name("small").get_attribute("innerText").replace(' ','')
link = news.find_element_by_xpath("//div[@id='archivewrapper']//h3/a").get_attribute("href")
newsdata.append((datecheck, link))
df = pd.DataFrame(newsdata, columns=['Date', 'News urls']).drop_duplicates()
df['Date'] = pd.to_datetime(df['Date'])
df = df.groupby('Date')['News urls'].apply(list).reset_index()
return df
df = sb_rum()
>>> df
Date News urls
0 2021-05-24 [https://www.thespiritsbusiness.com/2021/07/ru...
1 2021-05-26 [https://www.thespiritsbusiness.com/2021/07/ru...
2 2021-05-28 [https://www.thespiritsbusiness.com/2021/07/ru...
3 2021-06-02 [https://www.thespiritsbusiness.com/2021/07/ru...
4 2021-06-04 [https://www.thespiritsbusiness.com/2021/07/ru...
5 2021-06-07 [https://www.thespiritsbusiness.com/2021/07/ru...
6 2021-06-11 [https://www.thespiritsbusiness.com/2021/07/ru...
7 2021-06-18 [https://www.thespiritsbusiness.com/2021/07/ru...
8 2021-06-21 [https://www.thespiritsbusiness.com/2021/07/ru...
9 2021-06-24 [https://www.thespiritsbusiness.com/2021/07/ru...
10 2021-06-29 [https://www.thespiritsbusiness.com/2021/07/ru...
11 2021-06-30 [https://www.thespiritsbusiness.com/2021/07/ru...
12 2021-07-06 [https://www.thespiritsbusiness.com/2021/07/ru...
13 2021-07-08 [https://www.thespiritsbusiness.com/2021/07/ru...
14 2021-07-13 [https://www.thespiritsbusiness.com/2021/07/ru...
15 2021-07-19 [https://www.thespiritsbusiness.com/2021/07/ru...
I do not want all the urls from the website, that is the reason I had created a forloop which would specifically fetch urls fetched from 1st to 12th of July.
>>> df[df['Date'].between('2021-07-01', '2021-07-12')]
Date News urls
12 2021-07-06 [https://www.thespiritsbusiness.com/2021/07/ru...
13 2021-07-08 [https://www.thespiritsbusiness.com/2021/07/ru...
Upvotes: 1