Reputation: 284
I have scraped data from a url using beautiful soup. But after cleaning there are a number of blankspaces/ whitespaces/newlines in the cleaned data. I tried .strip()
function to remove those. But it is still present.
Code
from bs4 import BeautifulSoup
import requests
import re
URL="https://www.flagstaffsymphony.org/event/a-flag-on-fourth/"
html_content = requests.get(URL).text
cleantext = BeautifulSoup(html_content, "lxml").text
cleanr = re.compile('<.*?>')
clean_data = re.sub(cleanr, ' ', cleantext)
text = re.sub('([^\x00-\x7F]+)|(\n)|(\t)',' ', clean_data)
with open('read.txt', 'w') as file:
file.writelines(text)
Output
America the Beautiful: A Virtual Patriotic Salute Flagstaff Symphony Orchestra Contact Hit enter to search or ESC to close About Our Team Our Conductor Orchestra Members Concerts & Events Season 72 Concerts Subscribe Venue, Parking & Concerts FAQs Support The FSO Donate to FSO Sponsor a Chair Funding and Impact Videos Donate Subscription Tickets All Events This event has passed. America the Beautiful: A Virtual Patriotic Salute July 4, 2020 Violin Virtuoso Beethoven Virtual 5k In place of our traditional 4th of July concert at the Pepsi Amphitheater, the Flagstaff Symphony Orchestra will present a virtual patriotic salute to be released HERE and our Facebook page at on July 4, 2020 at 11am. The FSO is proud to offer a special rendition of America the Beautiful performed by 60 of their professional musicians, coming together virtually, to celebrate our nation s independence. CLICK HERE FOR DETAILS + Google Calendar+ iCal Export Details Date: July 4, 2020 Event Category: Concerts and Events Violin Virtuoso Beethoven Virtual 5k Concert InfoConcerts Concerts and Events FAQs FSO InfoAbout FSO Mission and History Our Team Our Conductor Orchestra Members Support FSOMake a Donation Underwriting a Concert Sponsor a Chair Advertise with FSO Volunteer Leave a Legacy Donor Bill of Rights Code of Ethical Standards (Used by permission of the Association of Fundraising Professionals) ResourcesCommunity & Education For Musicians For Board Members 2021 Flagstaff Symphony Orchestra.
Copyright 2019 Flagstaff Symphony Association About Our Team Our Conductor Orchestra Members Concerts & Events Season 72 Concerts Subscribe Venue, Parking & Concerts FAQs Support The FSO Donate to FSO Sponsor a Chair Funding and Impact Videos Donate Subscription Tickets Contact
In the above code I replaced unicode characters with ' ' (blankspace). If i didnt replace with blank space then several words will be joined together. What i am trying to obtain is a string data type with no unnecessary spaces and new line data.
Added Question
I tried every methods like strip(), re.sub()
etc to replace the space at the beginning of some lines in a text. But nothing works for the following data
Subscription Tickets
All Events
This event has passed.
America the Beautiful: A Virtual Patriotic Salute
July 4, 2020
Violin Virtuoso
Beethoven Virtual 5k
How can we remove those spaces
Upvotes: 0
Views: 84
Reputation: 333
It's not clear whether you want to retain some whitespaces for readability. In case you do, you can try this approach:
Update: Added code to only retain alpha-numeric characters except for a character exclusion list.
Code:
from bs4 import BeautifulSoup
import requests
def clean_scraped_text(raw_text):
# strip whitespaces from start and end of raw text
stripped_text = raw_text.strip()
processed_text = ''
for i, char in enumerate(stripped_text):
# add a single '\n' to processed_text for every sequence of '\n'
if char == '\n':
if stripped_text[i - 1] != '\n':
processed_text += '\n'
else:
# if character is not '\n' add it to new_text
processed_text += char
# clean whitespaces from each line in new_text
cleaned_text = ''
for line in processed_text.splitlines():
# only retain alphanumeric characters and listed characters
exclude_list = [' ', '\xa0', '-']
line = ''.join(x for x in line if x.isalnum() or (x in exclude_list))
cleaned_text += line.strip() + '\n'
return cleaned_text
URL="https://www.flagstaffsymphony.org/event/a-flag-on-fourth/"
html_content = requests.get(URL).text
text = BeautifulSoup(html_content, "lxml").text
print(clean_scraped_text(text))
Output:
America the Beautiful A Virtual Patriotic Salute Flagstaff Symphony Orchestra
Contact
Hit enter to search or ESC to close
About
Our Team
Our Conductor
Orchestra Members
Concerts Events
Season 72 Concerts
Subscribe
Venue Parking Concerts FAQs
Support The FSO
Donate to FSO
Sponsor a Chair
Funding and Impact
Videos
Donate
Subscription Tickets
All Events
This event has passed
America the Beautiful A Virtual Patriotic Salute
July 4 2020
Violin Virtuoso
Beethoven Virtual 5k
In place of our traditional 4th of July concert at the Pepsi Amphitheater the Flagstaff Symphony Orchestra will present a virtual patriotic salute to be released HERE and our Facebook page at on July 4 2020 at 11am The FSO is proud to offer a special rendition of America the Beautiful performed by 60 of their professional musicians coming together virtually to celebrate our nations independence
CLICK HERE FOR DETAILS
Google Calendar iCal Export
Details
Date
July 4 2020
Event Category Concerts and Events
Violin Virtuoso
Beethoven Virtual 5k
Concert InfoConcerts
Concerts and Events FAQs
FSO InfoAbout FSO Mission and History
Our Team
Our Conductor
Orchestra Members
Support FSOMake a Donation
Underwriting a Concert
Sponsor a Chair
Advertise with FSO
Volunteer
Leave a Legacy
Donor Bill of Rights
Code of Ethical Standards Used by permission of the Association of Fundraising Professionals
ResourcesCommunity Education
For Musicians
For Board Members
2021 Flagstaff Symphony Orchestra
Copyright 2019 Flagstaff Symphony Association
About
Our Team
Our Conductor
Orchestra Members
Concerts Events
Season 72 Concerts
Subscribe
Venue Parking Concerts FAQs
Support The FSO
Donate to FSO
Sponsor a Chair
Funding and Impact
Videos
Donate
Subscription Tickets
Contact
Upvotes: 1
Reputation: 4510
Try this:
from bs4 import BeautifulSoup
import requests
import re
URL="https://www.flagstaffsymphony.org/event/a-flag-on-fourth/"
html_content = requests.get(URL).text
cleantext = BeautifulSoup(html_content, "lxml").text
cleanr = re.compile('<.*?>')
clean_data = re.sub(cleanr, ' ', cleantext)
text = re.sub('\s+', ' ', clean_data)
print(text)
with open('read.txt', 'w') as file:
file.writelines(text)
Output:
America the Beautiful: A Virtual Patriotic Salute – Flagstaff Symphony Orchestra Contact Hit enter to search or ESC to close About Our Team Our Conductor Orchestra Members Concerts & Events Season 72 Concerts Subscribe Venue, Parking & Concerts FAQs Support The FSO Donate to FSO Sponsor a Chair Funding and Impact Videos Donate Subscription Tickets « All Events This event has passed. America the Beautiful: A Virtual Patriotic Salute July 4, 2020 « Violin Virtuoso Beethoven Virtual 5k » In place of our traditional 4th of July concert at the Pepsi Amphitheater, the Flagstaff Symphony Orchestra will present a virtual patriotic salute to be released HERE and our Facebook page at on July 4, 2020 at 11am. The FSO is proud to offer a special rendition of “America the Beautiful” performed by 60 of their professional musicians, coming together virtually, to celebrate our nation’s independence. CLICK HERE FOR DETAILS + Google Calendar+ iCal Export Details Date: July 4, 2020 Event Category: Concerts and Events « Violin Virtuoso Beethoven Virtual 5k » Concert InfoConcerts Concerts and Events FAQs FSO InfoAbout FSO Mission and History Our Team Our Conductor Orchestra Members Support FSOMake a Donation Underwriting a Concert Sponsor a Chair Advertise with FSO Volunteer Leave a Legacy Donor Bill of Rights Code of Ethical Standards (Used by permission of the Association of Fundraising Professionals) ResourcesCommunity & Education For Musicians For Board Members © 2021 Flagstaff Symphony Orchestra. © Copyright 2019 Flagstaff Symphony Association About Our Team Our Conductor Orchestra Members Concerts & Events Season 72 Concerts Subscribe Venue, Parking & Concerts FAQs Support The FSO Donate to FSO Sponsor a Chair Funding and Impact Videos Donate Subscription Tickets Contact
Upvotes: 2