Geralyn Feltner
Geralyn Feltner

Reputation: 31

Keep text clean from url

As part of Information Retrieval project in Python (building a mini search engine), I want to keep clean text from downloaded tweets (.csv data set of tweets - 27000 tweets to be exact), a tweet will look like:

"The basic longing to live with dignity...these yearnings are universal. They burn in every human heart 1234." —@POTUS https://twitter.com/OZRd5o4wRL

or

"Democracy...allows us to peacefully work through our differences, and move closer to our ideals" —@POTUS in Greece https://twitter.com/PIO9dG2qjX

I want, using regex, to remove unnecessary parts of the tweets, like URL, punctuation and etc

So the result will be:

"The basic longing to live with dignity these yearnings are universal They burn in every human heart POTUS"

and

"Democracy allows us to peacefully work through our differences and move closer to our ideals POTUS in Greece"

tried this: pattern = RegexpTokenizer(r'[A-Za-z]+|^[0-9]'), but it doesn't do a perfect job, as parts of the URL for example is still present in the result.

Please help me find a regex pattern that will do what i want.

Upvotes: 3

Views: 193

Answers (1)

Rakesh
Rakesh

Reputation: 82755

This might help.

Demo:

import re

s1 = """"Democracy...allows us to peacefully work through our differences, and move closer to our ideals" —@POTUS in Greece https://twitter.com/PIO9dG2qjX"""
s2 = """"The basic longing to live with dignity...these yearnings are universal. They burn in every human heart 1234." —@POTUS https://twitter.com/OZRd5o4wRL"""    

def cleanString(text):
    res = []
    for i in text.strip().split():
        if not re.search(r"(https?)", i):   #Removes URL..Note: Works only if http or https in string.
            res.append(re.sub(r"[^A-Za-z\.]", "", i).replace(".", " "))   #Strip everything that is not alphabet(Upper or Lower)
    return " ".join(map(str.strip, res))

print(cleanString(s1))
print(cleanString(s2))

Upvotes: 1

Related Questions