Reputation: 31
import requests
import re
from bs4 import BeautifulSoup
#The website I like to get, converts the contents of the web page to lxml format
base_url = "https://festivalfans.nl/event/dominator-festival"
url = requests.get(base_url)
soup = BeautifulSoup(url.content, "lxml")
#Modifies the given string to look visually good. Like this:
#['21 / JulZaterdag2018'] becomes 21 Jul 2018
def remove_char(string):
#All blacklisted characters and words
blacklist = ["/", "[", "]", "'", "Maandag", "Dinsdag", "Woensdag",
"Donderdag", "Vrijdag", "Zaterdag", "Zondag"]
#Replace every blacklisted character with white space
for char in blacklist:
string = string.replace(char,' ')
#Replace more than 2 consecutive white spaces
string = re.sub("\s\s+", " ", string)
#Gets the date of the festival I'm interested in
def get_date_info():
#Makes a list for the data
raw_info = []
#Adds every "div" with a certain name to list, and converts it to text
for link in soup.find_all("div", {"class": "event-single-data"}):
raw_info.append(link.text)
#Converts list into string, because remove_char() only accepts strings
raw_info = str(raw_info)
#Modifies the string as explained above
final_date = remove_char(raw_info)
#Prints the date in this format: 21 Jul 2018(example)
print(final_date)
get_date_info()
Hi there! So I'm currently working on a little webscraping project. I thought I had a good idea and I wanted to get more experienced with Python. What it basically does is it gets festival information like date, time and price and puts it in a little text file. I'm using BeautifulSoup to navigate and edit the web page. Link is down there!
But now I'm kinda running into a problem. I can't figure out what's wrong. Maybe I'm totally looking over it. So when I run this program it should give me this: 21 Jul 2018. But instead it returns 'None'. For some reason every character in the string gets removed.
I tried running remove_char() on its own, with the same list(converted it to string first) as input. This worked perfectly. It returned "21 Jul 2018" like it was supposed to do. So I'm quite sure the error is not in this function.
So somehow I'm missing something. Maybe it has to do with BeautifulSoup and how it handles things?
Hope someone can help me out!
BeautifulSoup: https://www.crummy.com/software/BeautifulSoup/bs4/doc/
Web page: https://festivalfans.nl/event/dominator-festival
Upvotes: 0
Views: 61
Reputation: 2035
import requests
from bs4 import BeautifulSoup
base_url = "https://festivalfans.nl/event/dominator-festival"
url = requests.get(base_url)
soup = BeautifulSoup(url.content , "html.parser")
def get_date_info():
for link in soup.find_all("div", {"class": "event-single-data"}):
day = link.find('div', {"class":"event-single-day"}).text.replace(" ", '')
month = link.find('div', {"class": "event-single-month"}).text.replace('/', "").replace(' ', '')
year = link.find('div', {"class": "event-single-year"}).text.replace(" ", '')
print(day, month, year)
get_date_info()
here is an easier code no need of re
Upvotes: 0
Reputation: 2485
You forgot to return the value in the remove_char()
function.
That's it!
Upvotes: 2
Reputation: 184200
Neither of your functions has a return
statement, and so return None
by default. remove_char()
should end with return string
for example.
Upvotes: 0