AATU
AATU

Reputation: 87

Advice seeked to scrape website with Python

I'm trying to scrape the following website from which I want to scrape three things: 1. href(hyperlink) 2. Publishing date 3. Article description.

website I have managed to scrape the "href" but I'm struggling to scrape publishing date and article description. Please see below for the code I used:

import requests
from bs4 import BeautifulSoup

page = requests.get('https://orangecyberdefense.com/global/blog/')
soup = BeautifulSoup(page.content, 'html.parser')

main_table = soup.find('section', attrs={'class':'section articles'})
links = main_table.find_all('a')

Hyperlinks = []
Date = []
Description = []

for link in links:
    Hyperlinks.append(link.attrs['href'])
    Date.append(link.attrs['time'])
    Description.append(link.attrs['description'])

How should I go about extracting the "date" and "description"?

Upvotes: 0

Views: 56

Answers (2)

Pirate X
Pirate X

Reputation: 3093

We simply find all time tags using find_all(['time'])

# find time tags & adding all the dates in the list
date_list = main_table.find_all(['time'])
for date in date_list:
    Dates.append(date.get_text())

For Description you can use class tag.

desc = main_table.find_all('div', {'class' : 'description'})

for i in desc:
        Description.append(i.get_text(strip=True))

Output for Dates

['07 May. 2020',
 '07 May. 2020',
 '06 May. 2020',
 '04 May. 2020',
 '04 May. 2020',
 '30 Apr. 2020']

Output for Description

['While these concerns are warranted, we feel that there has also been a fair amount of hyperbole involved, which was part of our motivation for writing this report.',
 'In this final piece, we’ll look at how the impact of this pandemic and our collective response hold valuable lessons for security practitioners.',
 'Videoconferencing is an essential tool, especially with the COVID-19-lockdown. Zoom, Teams, Webex, Skype: we have checked 10 business solutions for security.',
 'Back to normality: these are the three main things we expect businesses will see when employees make the exodus back to their respective workplaces.',
 'Discover our experts’ ploys to hack the galaxy’s most secure datacenter.',
 'We can’t control the threat, but we can control the vulnerability, so we should focus on that. Our guidelines for responding to the cyber crisis.']   ​

Full Code

import requests
from bs4 import BeautifulSoup
page = requests.get('https://orangecyberdefense.com/global/blog/')
soup = BeautifulSoup(page.content, 'html.parser')

Hyperlinks = []
dates = []
Description = []

main_table = soup.find('section', attrs={'class':'section articles'})
links = main_table.find_all(['a'])

for link in links:
    Hyperlinks.append(link.attrs['href'])

#find time tags
date_list = main_table.find_all(['time'])

for date in date_list:
    dates.append(date.get_text())

#find class with description
desc = main_table.find_all('div', {'class' : 'description'})

for i in desc:
        Description.append(i.get_text(strip=True))

Upvotes: 1

Andrej Kesely
Andrej Kesely

Reputation: 195428

You can use zip() in this case.

For example:

import requests
from bs4 import BeautifulSoup

url = 'https://orangecyberdefense.com/global/blog/'

soup = BeautifulSoup(requests.get(url).content, 'html.parser')

for title, tm, desc in zip(soup.select('p.card-title'), soup.select('time'), soup.select('.description')):
    print(title.get_text(strip=True), tm.get_text(strip=True))
    print('-' * 80)
    print(desc.get_text(strip=True))
    print()

Prints:

Let's examine Cisco Webex - A visionary player 21 May. 2020
--------------------------------------------------------------------------------
CISCO WebEx is a common solution for webinars and videoconferencing. Does it live up to its reputation regarding security?

In-depth product analysis - Zoom & Microsoft Teams 07 May. 2020
--------------------------------------------------------------------------------
While these concerns are warranted, we feel that there has also been a fair amount of hyperbole involved, which was part of our motivation for writing this report.

Lessons learned: How COVID-19 has had a knock-on effect on our businesses 07 May. 2020
--------------------------------------------------------------------------------
In this final piece, we’ll look at how the impact of this pandemic and our collective response hold valuable lessons for security practitioners.

Video killed the conferencing star 06 May. 2020
--------------------------------------------------------------------------------
Videoconferencing is an essential tool, especially with the COVID-19-lockdown. Zoom, Teams, Webex, Skype: we have checked 10 business solutions for security.

COVID-19: when it’s all over 04 May. 2020
--------------------------------------------------------------------------------
Back to normality: these are the three main things we expect businesses will see when employees make the exodus back to their respective workplaces.

Star Wars Day: Orange Cyberdefense hacks the Death Star 04 May. 2020
--------------------------------------------------------------------------------
Discover our experts’ ploys to hack the galaxy’s most secure datacenter.

COVID-19: responding to the cyber part of the crisis 30 Apr. 2020
--------------------------------------------------------------------------------
We can’t control the threat, but we can control the vulnerability, so we should focus on that. Our guidelines for responding to the cyber crisis.

Upvotes: 0

Related Questions