Keep text clean from url

Question

As part of Information Retrieval project in Python (building a mini search engine), I want to keep clean text from downloaded tweets (.csv data set of tweets - 27000 tweets to be exact), a tweet will look like:

"The basic longing to live with dignity...these yearnings are universal. They burn in every human heart 1234." ‚Äî@POTUS https://twitter.com/OZRd5o4wRL

or

"Democracy...allows us to peacefully work through our differences, and move closer to our ideals" ‚Äî@POTUS in Greece https://twitter.com/PIO9dG2qjX

I want, using regex, to remove unnecessary parts of the tweets, like URL, punctuation and etc

So the result will be:

"The basic longing to live with dignity these yearnings are universal They burn in every human heart POTUS"

and

"Democracy allows us to peacefully work through our differences and move closer to our ideals POTUS in Greece"

tried this: pattern = RegexpTokenizer(r'[A-Za-z]+|^[0-9]'), but it doesn't do a perfect job, as parts of the URL for example is still present in the result.

Please help me find a regex pattern that will do what i want.

Keep text clean from url

Answers (1)

Related Questions