technophile_3
technophile_3

Reputation: 521

Pandas Dataframe Column Names getting displayed multiple times in output

I have scraped a couple of things from a list of websites and I am trying to save it in a pandas dataframe. While the dataframe is getting created but I am not getting the expected the dataframe. For eg: This is the dataframe I am getting

Date News URLs
13th July 2021 [someurl, someurl12,...]
Date News URLs
11th July 2021 [someurl58675, someurl12979,...]
Date News URLs
12th July 2021 [someurl47539, someurl657637,...]

I intend to get an dataframe like this:

Date News URLs
13th July 2021 [someurl, someurl12,...]
11th July 2021 [someurl58675, someurl12979,...]
12th July 2021 [someurl47539, someurl657637,...]

And here's my code for the same.

from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
import numpy as np
import pandas as pd
import time

options = webdriver.ChromeOptions()
options.add_argument('start-maximized')
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option("useAutomationExtension", False)
browser = webdriver.Chrome(ChromeDriverManager().install(), options=options)

def sb_rum():
    websites = ['https://www.thespiritsbusiness.com/tag/rum/','https://www.thespiritsbusiness.com/tag/gin/']
    for spirits in websites:
        browser.get(spirits)
        time.sleep(1)

        news_links = browser.find_elements_by_xpath("//div[@id='archivewrapper']")
        newsdata = {}
        for news in news_links:
            datecheck = news.find_element_by_tag_name("small").get_attribute("innerText").replace(' ','')
            links = news.find_element_by_xpath("//div[@id='archivewrapper']//h3/a").get_attribute("href")
            if datecheck in newsdata:
                newsdata[datecheck].append(links)
            else:
                newsdata[datecheck] = [links]
        dates = "July{}th,2021"
        for i in range(1, 12):
            if dates.format(i) in newsdata:
                df = pd.DataFrame({'News urls': newsdata[dates.format(i)], 'Dates': dates.format(i)})
                print(("{} : {}".format(dates.format(i), newsdata[dates.format(i)])))
        print(df.head())

Upvotes: 0

Views: 269

Answers (1)

Corralien
Corralien

Reputation: 120479

Can you try the function below. I can give some explanation if the expected outcome is right.

def sb_rum():
    websites = ['https://www.thespiritsbusiness.com/tag/rum/','https://www.thespiritsbusiness.com/tag/gin/']
    newsdata = []
    for spirits in websites:
        browser.get(spirits)
        time.sleep(1)

        news_links = browser.find_elements_by_xpath("//div[@id='archivewrapper']")
        for news in news_links:
            datecheck = news.find_element_by_tag_name("small").get_attribute("innerText").replace(' ','')
            link = news.find_element_by_xpath("//div[@id='archivewrapper']//h3/a").get_attribute("href")
            newsdata.append((datecheck, link))

    df = pd.DataFrame(newsdata, columns=['Date', 'News urls']).drop_duplicates()
    df['Date'] = pd.to_datetime(df['Date'])
    df = df.groupby('Date')['News urls'].apply(list).reset_index()
    return df

df = sb_rum()

>>> df
         Date                                          News urls
0  2021-05-24  [https://www.thespiritsbusiness.com/2021/07/ru...
1  2021-05-26  [https://www.thespiritsbusiness.com/2021/07/ru...
2  2021-05-28  [https://www.thespiritsbusiness.com/2021/07/ru...
3  2021-06-02  [https://www.thespiritsbusiness.com/2021/07/ru...
4  2021-06-04  [https://www.thespiritsbusiness.com/2021/07/ru...
5  2021-06-07  [https://www.thespiritsbusiness.com/2021/07/ru...
6  2021-06-11  [https://www.thespiritsbusiness.com/2021/07/ru...
7  2021-06-18  [https://www.thespiritsbusiness.com/2021/07/ru...
8  2021-06-21  [https://www.thespiritsbusiness.com/2021/07/ru...
9  2021-06-24  [https://www.thespiritsbusiness.com/2021/07/ru...
10 2021-06-29  [https://www.thespiritsbusiness.com/2021/07/ru...
11 2021-06-30  [https://www.thespiritsbusiness.com/2021/07/ru...
12 2021-07-06  [https://www.thespiritsbusiness.com/2021/07/ru...
13 2021-07-08  [https://www.thespiritsbusiness.com/2021/07/ru...
14 2021-07-13  [https://www.thespiritsbusiness.com/2021/07/ru...
15 2021-07-19  [https://www.thespiritsbusiness.com/2021/07/ru...

I do not want all the urls from the website, that is the reason I had created a forloop which would specifically fetch urls fetched from 1st to 12th of July.

>>> df[df['Date'].between('2021-07-01', '2021-07-12')]
         Date                                          News urls
12 2021-07-06  [https://www.thespiritsbusiness.com/2021/07/ru...
13 2021-07-08  [https://www.thespiritsbusiness.com/2021/07/ru...

Upvotes: 1

Related Questions