Sudhanshu
Sudhanshu

Reputation: 732

How to remove urls between texts in pandas dataframe rows?

I am trying to solve a nlp problem, here in dataframe text column have lots of rows filled with urls like http.somethingsomething.some of the urls and other texts have no space between them for example- ':http:\\something',';http:\\something',',http:\\something'.

so there sometime , before url text without any space and sometime something else but mostly , ,. ,:, ;. and url either at the starting or at the end.

id text target
1 we always try to bring the heavy metal rt http:\\something11 1
4 on plus side look at the sky last night it was ablaze ;http:\\somethingdifferent 1
6 inec office in abia set ablaze :http:\\itsjustaurl 1
3 .http:\\something11 we always try to bring the heavy metal rt 1

so i want to know how can i remove these links. I am using python language for task.

Upvotes: 1

Views: 1536

Answers (1)

Tim Biegeleisen
Tim Biegeleisen

Reputation: 522817

A simple approach would be to just remove any URL starting with http or https:

df["text"] = df["text"].str.replace(r'\s*https?://\S+(\s+|$)', ' ').str.strip()

There is some subtle logic in the above line of code which merits some explanation. We capture a URL, with optional whitespace on the left and mandatory whitespace on the right (except for when the URL continues to the end). Then, we replace that with a single space, and use strip() in case this operation would leave dangling whitespace at the start/end.

Upvotes: 2

Related Questions