Reputation: 732
I am trying to solve a nlp problem, here in dataframe text column have lots of rows filled with urls
like http.somethingsomething
.some of the urls and other texts have no space between them for example- ':http:\\something'
,';http:\\something'
,',http:\\something'
.
so there sometime ,
before url
text without any space and sometime something else but mostly ,
,.
,:
, ;
. and url either at the starting or at the end.
id | text | target |
---|---|---|
1 | we always try to bring the heavy metal rt http:\\something11 |
1 |
4 | on plus side look at the sky last night it was ablaze ;http:\\somethingdifferent |
1 |
6 | inec office in abia set ablaze :http:\\itsjustaurl |
1 |
3 | .http:\\something11 we always try to bring the heavy metal rt |
1 |
so i want to know how can i remove these links. I am using python
language for task.
Upvotes: 1
Views: 1536
Reputation: 522817
A simple approach would be to just remove any URL starting with http
or https
:
df["text"] = df["text"].str.replace(r'\s*https?://\S+(\s+|$)', ' ').str.strip()
There is some subtle logic in the above line of code which merits some explanation. We capture a URL, with optional whitespace on the left and mandatory whitespace on the right (except for when the URL continues to the end). Then, we replace that with a single space, and use strip()
in case this operation would leave dangling whitespace at the start/end.
Upvotes: 2