Removing urls from a data-frame column with targetblank tag

Question

I want to remove url's from a column in a data-frame. The column I am interested in is called comment, and example entry in comment is:

|comment                                 |
|:--------------------------------------:|
| """Drone Strikes Up 432 Percent Under. |
|Donald Trump"" by Joe Wolverton, II,    |
|J.D.                                    |
|https://www.thenewamer|
|c                                       |
|an.com/usnews/foreign-policy/item/25604-|
|drone-st...
""Trump is weighing |
| major escalation in Yemen's devastating| 
|war
The war has already killed at   |
|least 10,000, displaced 3 million, and. | 
|left millions more at risk of famine."" |
|
"                                  |

This above entry shows the issue I am trying to solve. I want to completely remove:

https://www.thenewamerican.com/usnews/foreign-policy/item/25604-drone-st...

I've tried:

df['comment'] = df['comment'].replace(r'https\S+', ' ', regex=True).replace(r'www\S+', ' ', regex=True).replace(r'http\S+', ' ', regex=True)

However this likes with me in

href title targetblank com

Corralien · Accepted Answer

Try:

df['comment'] = df['comment'].str.replace(']*.*?<\/a>', '')

Output:

>>> df.loc[0, 'comment']

'Drone Strikes Up 432 Percent Under. Donald Trump"" by Joe Wolverton, II, J.D. 
""Trump is weighing  major escalation in Yemen\'s devastating war
The war has already killed at   least 10,000, displaced 3 million, and.  left millions more at risk of famine."" 
'

Removing urls from a data-frame column with targetblank tag

Answers (2)

Related Questions