Aniiya0978
Aniiya0978

Reputation: 284

Removing whitespaces/blankspaces/newlines from scraped data

I have scraped data from a url using beautiful soup. But after cleaning there are a number of blankspaces/ whitespaces/newlines in the cleaned data. I tried .strip() function to remove those. But it is still present.

Code

from bs4 import BeautifulSoup
import requests
import re
URL="https://www.flagstaffsymphony.org/event/a-flag-on-fourth/"
html_content = requests.get(URL).text
cleantext = BeautifulSoup(html_content, "lxml").text
cleanr = re.compile('<.*?>')
clean_data = re.sub(cleanr, ' ', cleantext)
text = re.sub('([^\x00-\x7F]+)|(\n)|(\t)',' ', clean_data)
with open('read.txt', 'w') as file:
    file.writelines(text)

Output

   America the Beautiful: A Virtual Patriotic Salute   Flagstaff Symphony Orchestra                                                                                           Contact             Hit enter to search or ESC to close                                     About  Our Team Our Conductor Orchestra Members   Concerts & Events  Season 72 Concerts Subscribe Venue, Parking & Concerts FAQs   Support The FSO  Donate to FSO Sponsor a Chair Funding and Impact   Videos Donate Subscription Tickets                  All Events   This event has passed. America the Beautiful: A Virtual Patriotic Salute  July 4, 2020         Violin Virtuoso Beethoven Virtual 5k             In place of our traditional 4th of July concert at the Pepsi Amphitheater, the Flagstaff Symphony Orchestra will present a virtual patriotic salute to be released HERE and our Facebook page at on July 4, 2020 at 11am. The FSO is proud to offer a special rendition of  America the Beautiful  performed by 60 of their professional musicians, coming together virtually, to celebrate our nation s independence. CLICK HERE FOR DETAILS   + Google Calendar+ iCal Export     Details    Date:    July 4, 2020   Event Category: Concerts and Events             Violin Virtuoso Beethoven Virtual 5k                   Concert InfoConcerts Concerts and Events FAQs     FSO InfoAbout FSO Mission and History Our Team Our Conductor Orchestra Members     Support FSOMake a Donation Underwriting a Concert Sponsor a Chair Advertise with FSO Volunteer Leave a Legacy Donor Bill of Rights Code of Ethical Standards  (Used by permission of the Association of Fundraising Professionals)     ResourcesCommunity & Education For Musicians For Board Members             2021 Flagstaff Symphony Orchestra. 
           Copyright 2019 Flagstaff Symphony Association                             About  Our Team Our Conductor Orchestra Members   Concerts & Events  Season 72 Concerts Subscribe Venue, Parking & Concerts FAQs   Support The FSO  Donate to FSO Sponsor a Chair Funding and Impact   Videos Donate Subscription Tickets   Contact  

In the above code I replaced unicode characters with ' ' (blankspace). If i didnt replace with blank space then several words will be joined together. What i am trying to obtain is a string data type with no unnecessary spaces and new line data.

Added Question

I tried every methods like strip(), re.sub() etc to replace the space at the beginning of some lines in a text. But nothing works for the following data

Subscription Tickets
 All Events
This event has passed.
America the Beautiful: A Virtual Patriotic Salute
July 4, 2020
 Violin Virtuoso
Beethoven Virtual 5k 

How can we remove those spaces

Upvotes: 0

Views: 84

Answers (3)

jollibobert
jollibobert

Reputation: 333

It's not clear whether you want to retain some whitespaces for readability. In case you do, you can try this approach:

Update: Added code to only retain alpha-numeric characters except for a character exclusion list.

Code:

from bs4 import BeautifulSoup
import requests


def clean_scraped_text(raw_text):

    # strip whitespaces from start and end of raw text
    stripped_text = raw_text.strip()

    processed_text = ''
    for i, char in enumerate(stripped_text):
        # add a single '\n' to processed_text for every sequence of '\n'
        if char == '\n':
            if stripped_text[i - 1] != '\n':
                processed_text += '\n'
        else:
            # if character is not '\n' add it to new_text
            processed_text += char

    # clean whitespaces from each line in new_text
    cleaned_text = ''
    for line in processed_text.splitlines():
        # only retain alphanumeric characters and listed characters 
        exclude_list = [' ', '\xa0', '-']
        line = ''.join(x for x in line if x.isalnum() or (x in exclude_list))
        cleaned_text += line.strip() + '\n'

    return cleaned_text

URL="https://www.flagstaffsymphony.org/event/a-flag-on-fourth/"
html_content = requests.get(URL).text
text = BeautifulSoup(html_content, "lxml").text
print(clean_scraped_text(text))

Output:

America the Beautiful A Virtual Patriotic Salute  Flagstaff Symphony Orchestra

Contact
Hit enter to search or ESC to close


About
Our Team
Our Conductor
Orchestra Members
Concerts  Events
Season 72 Concerts
Subscribe
Venue Parking  Concerts FAQs
Support The FSO
Donate to FSO
Sponsor a Chair
Funding and Impact
Videos
Donate
Subscription Tickets
All Events
This event has passed
America the Beautiful A Virtual Patriotic Salute
July 4 2020
Violin Virtuoso
Beethoven Virtual 5k
In place of our traditional 4th of July concert at the Pepsi Amphitheater the Flagstaff Symphony Orchestra will present a virtual patriotic salute to be released HERE and our Facebook page at on July 4 2020 at 11am The FSO is proud to offer a special rendition of America the Beautiful performed by 60 of their professional musicians coming together virtually to celebrate our nations independence
CLICK HERE FOR DETAILS
Google Calendar iCal Export
Details
Date
July 4 2020
Event Category Concerts and Events

Violin Virtuoso
Beethoven Virtual 5k

Concert InfoConcerts
Concerts and Events FAQs

FSO InfoAbout FSO Mission and History
Our Team
Our Conductor
Orchestra Members
Support FSOMake a Donation
Underwriting a Concert
Sponsor a Chair
Advertise with FSO
Volunteer
Leave a Legacy
Donor Bill of Rights
Code of Ethical Standards  Used by permission of the Association of Fundraising Professionals
ResourcesCommunity  Education
For Musicians
For Board Members
2021 Flagstaff Symphony Orchestra
Copyright 2019 Flagstaff Symphony Association


About
Our Team
Our Conductor
Orchestra Members
Concerts  Events
Season 72 Concerts
Subscribe
Venue Parking  Concerts FAQs
Support The FSO
Donate to FSO
Sponsor a Chair
Funding and Impact
Videos
Donate
Subscription Tickets
Contact

Upvotes: 1

Sabil
Sabil

Reputation: 4510

Try this:

from bs4 import BeautifulSoup
import requests
import re


URL="https://www.flagstaffsymphony.org/event/a-flag-on-fourth/"
html_content = requests.get(URL).text
cleantext = BeautifulSoup(html_content, "lxml").text
cleanr = re.compile('<.*?>')
clean_data = re.sub(cleanr, ' ', cleantext)
text = re.sub('\s+', ' ', clean_data)
print(text)
with open('read.txt', 'w') as file:
    file.writelines(text)

Output:

America the Beautiful: A Virtual Patriotic Salute – Flagstaff Symphony Orchestra Contact Hit enter to search or ESC to close About Our Team Our Conductor Orchestra Members Concerts & Events Season 72 Concerts Subscribe Venue, Parking & Concerts FAQs Support The FSO Donate to FSO Sponsor a Chair Funding and Impact Videos Donate Subscription Tickets « All Events This event has passed. America the Beautiful: A Virtual Patriotic Salute July 4, 2020 « Violin Virtuoso Beethoven Virtual 5k » In place of our traditional 4th of July concert at the Pepsi Amphitheater, the Flagstaff Symphony Orchestra will present a virtual patriotic salute to be released HERE and our Facebook page at on July 4, 2020 at 11am. The FSO is proud to offer a special rendition of “America the Beautiful” performed by 60 of their professional musicians, coming together virtually, to celebrate our nation’s independence. CLICK HERE FOR DETAILS + Google Calendar+ iCal Export Details Date: July 4, 2020 Event Category: Concerts and Events « Violin Virtuoso Beethoven Virtual 5k » Concert InfoConcerts Concerts and Events FAQs FSO InfoAbout FSO Mission and History Our Team Our Conductor Orchestra Members Support FSOMake a Donation Underwriting a Concert Sponsor a Chair Advertise with FSO Volunteer Leave a Legacy Donor Bill of Rights Code of Ethical Standards (Used by permission of the Association of Fundraising Professionals) ResourcesCommunity & Education For Musicians For Board Members © 2021 Flagstaff Symphony Orchestra. © Copyright 2019 Flagstaff Symphony Association About Our Team Our Conductor Orchestra Members Concerts & Events Season 72 Concerts Subscribe Venue, Parking & Concerts FAQs Support The FSO Donate to FSO Sponsor a Chair Funding and Impact Videos Donate Subscription Tickets Contact

Upvotes: 2

Muhammad Hassan
Muhammad Hassan

Reputation: 4229

You can try:

print(re.sub('\s+',' ', text))

Upvotes: 2

Related Questions