Why doesn't this function return the same output in both situations(webscraping project)?

import requests
import re
from bs4 import BeautifulSoup

#The website I like to get, converts the contents of the web page to lxml format
base_url = "https://festivalfans.nl/event/dominator-festival"
url = requests.get(base_url)
soup = BeautifulSoup(url.content, "lxml")

#Modifies the given string to look visually good. Like this:
#['21 / JulZaterdag2018'] becomes 21 Jul 2018
def remove_char(string):
    #All blacklisted characters and words
    blacklist = ["/", "[", "]", "'", "Maandag", "Dinsdag", "Woensdag",
                 "Donderdag", "Vrijdag", "Zaterdag", "Zondag"]

    #Replace every blacklisted character with white space
    for char in blacklist:
        string = string.replace(char,' ')

    #Replace more than 2 consecutive white spaces
    string = re.sub("\s\s+", " ", string)


#Gets the date of the festival I'm interested in
def get_date_info():
    #Makes a list for the data
    raw_info = []

    #Adds every "div" with a certain name to list, and converts it to text
    for link in soup.find_all("div", {"class": "event-single-data"}):
        raw_info.append(link.text)

    #Converts list into string, because remove_char() only accepts strings    
    raw_info = str(raw_info)

    #Modifies the string as explained above
    final_date = remove_char(raw_info)

    #Prints the date in this format: 21 Jul 2018(example)
    print(final_date)


get_date_info()

Hi there! So I'm currently working on a little webscraping project. I thought I had a good idea and I wanted to get more experienced with Python. What it basically does is it gets festival information like date, time and price and puts it in a little text file. I'm using BeautifulSoup to navigate and edit the web page. Link is down there!

But now I'm kinda running into a problem. I can't figure out what's wrong. Maybe I'm totally looking over it. So when I run this program it should give me this: 21 Jul 2018. But instead it returns 'None'. For some reason every character in the string gets removed.

I tried running remove_char() on its own, with the same list(converted it to string first) as input. This worked perfectly. It returned "21 Jul 2018" like it was supposed to do. So I'm quite sure the error is not in this function.

So somehow I'm missing something. Maybe it has to do with BeautifulSoup and how it handles things?

Hope someone can help me out!

BeautifulSoup: https://www.crummy.com/software/BeautifulSoup/bs4/doc/

Web page: https://festivalfans.nl/event/dominator-festival

Upvotes: 0

Answers (3)

ThunderHorn

Reputation: 2035

import requests

from bs4 import BeautifulSoup

base_url = "https://festivalfans.nl/event/dominator-festival"
url = requests.get(base_url)
soup = BeautifulSoup(url.content , "html.parser")

def get_date_info():

    for link in soup.find_all("div", {"class": "event-single-data"}):
        day = link.find('div', {"class":"event-single-day"}).text.replace(" ", '')
        month = link.find('div', {"class": "event-single-month"}).text.replace('/', "").replace(' ', '')
        year = link.find('div', {"class": "event-single-year"}).text.replace(" ", '')
        print(day, month, year)

get_date_info()

here is an easier code no need of re

Upvotes: 0