Christopher
Christopher

Reputation: 2232

Webscraping: Crawling Pages and Storing Content in DataFrame

Following code can be used to reproduce a web scraping task for three given example urls:

Code:

import pandas as pd
import requests
import urllib.request
from bs4 import BeautifulSoup

# Would otherwise load a csv file with 100+ urls into a DataFrame
# Example data:
links = {'url': ['https://www.apple.com/education/', 'https://www.apple.com/business/', 'https://www.apple.com/environment/']}
urls = pd.DataFrame(data=links)

def scrape_content(url):

    r = requests.get(url)
    html = r.content
    soup = BeautifulSoup(html,"lxml")

    # Get page title
    title = soup.find("meta",attrs={"property":"og:title"})["content"].strip()
    # Get content from paragraphs
    content = soup.find("div", {"class":"section-content"}).find_all('p')

    print(title)

    for p in content:
        p = p.get_text(strip=True)
        print(p)

Apply scraping to each url:

urls['url'].apply(scrape_content)

Out:

Education
Every child is born full of creativity. Nurturing it is one of the most important things educators do. Creativity makes your students better communicators and problem solvers. It prepares them to thrive in today’s world — and to shape tomorrow’s. For 40 years, Apple has helped teachers unleash the creative potential in every student. And today, we do that in more ways than ever. Not only with powerful products, but also with tools, inspiration, and curricula to help you create magical learning experiences.
Watch the keynote
Business
Apple products have always been designed for the way we work as much as for the way we live. Today they help employees to work more simply and productively, solve problems creatively, and collaborate with a shared purpose. And they’re all designed to work together beautifully. When people have access to iPhone, iPad, and Mac, they can do their best work and reimagine the future of their business.
Environment
We strive to create products that are the best in the world and the best for the world. And we continue to make progress toward our environmental priorities. Like powering all Apple facilities worldwide with 100% renewable energy. Creating the next innovation in recycling with Daisy, our newest disassembly robot. And leading the industry in making our materials safer for people and for the earth. In every product we make, in every innovation we create, our goal is to leave the planet better than we found it. Read the 2018 Progress Report

0    None
1    None
2    None
Name: url, dtype: object

Problems:

  1. The code currently only outputs content for the first paragraph of every page. I like to get data for every p in the given selector.
  2. For the final data, I need a data frame that contains the url, title, and content. Therefore, I like to know how I can write the scraped information into a data frame.

Thank you for your help.

Upvotes: 0

Views: 142

Answers (1)

ASGM
ASGM

Reputation: 11381

Your problem is in this line:

content = soup.find("div", {"class":"section-content"}).find_all('p')

find_all() is getting all the <p> tags, but only in the results .find() - which just returns the first example which meets the criteria. So you're getting all the <p> tags in the first div.section_content. It's not exactly clear what the right criteria are for your use case, but if you just want all the <p> tags you can use:

content = soup.find_all('p')

Then you can make scrape_urls() merge the <p> tag text and return it along with the title:

content = '\r'.join([p.get_text(strip=True) for p in content])
return title, content

Outside the function, you can build the dataframe:

url_list = urls['url'].tolist()
results = [scrape_url(url) for url in url_list]
title_list = [r[0] for r in results]
content_list = [r[1] for r in results]
df = pd.DataFrame({'url': url_list, 'title': title_list, 'content': content_list})

Upvotes: 1

Related Questions