Reputation: 31
As part of Information Retrieval project in Python (building a mini search engine), I want to keep clean text from downloaded tweets (.csv data set of tweets - 27000 tweets to be exact), a tweet will look like:
"The basic longing to live with dignity...these yearnings are universal. They burn in every human heart 1234." —@POTUS https://twitter.com/OZRd5o4wRL
or
"Democracy...allows us to peacefully work through our differences, and move closer to our ideals" —@POTUS in Greece https://twitter.com/PIO9dG2qjX
I want, using regex, to remove unnecessary parts of the tweets, like URL, punctuation and etc
So the result will be:
"The basic longing to live with dignity these yearnings are universal They burn in every human heart POTUS"
and
"Democracy allows us to peacefully work through our differences and move closer to our ideals POTUS in Greece"
tried this: pattern = RegexpTokenizer(r'[A-Za-z]+|^[0-9]')
, but it doesn't do a perfect job, as parts of the URL for example is still present in the result.
Please help me find a regex pattern that will do what i want.
Upvotes: 3
Views: 193
Reputation: 82755
This might help.
Demo:
import re
s1 = """"Democracy...allows us to peacefully work through our differences, and move closer to our ideals" —@POTUS in Greece https://twitter.com/PIO9dG2qjX"""
s2 = """"The basic longing to live with dignity...these yearnings are universal. They burn in every human heart 1234." —@POTUS https://twitter.com/OZRd5o4wRL"""
def cleanString(text):
res = []
for i in text.strip().split():
if not re.search(r"(https?)", i): #Removes URL..Note: Works only if http or https in string.
res.append(re.sub(r"[^A-Za-z\.]", "", i).replace(".", " ")) #Strip everything that is not alphabet(Upper or Lower)
return " ".join(map(str.strip, res))
print(cleanString(s1))
print(cleanString(s2))
Upvotes: 1