Joe Smith
Joe Smith

Reputation: 63

Removing URL from a column in Pandas Dataframe

I have a small dataframe and am trying to remove the url from the end of the string in the Links column. I have tried the following code and it works on columns where the url is on its own. The problem is that as soon as there are sentences before the url the code won't remove those urls

Here is the data: https://docs.google.com/spreadsheets/d/10LV8BHgofXKTwG-MqRraj0YWez-1vcwzzTJpRhdWgew/edit?usp=sharing (link to spreadsheet)

import pandas as pd  

df = pd.read_csv('TestData.csv')    

df['Links'] = df['Links'].replace(to_replace=r'^https?:\/\/.*[\r\n]*',value='',regex=True)

df.head()

Thanks!

Upvotes: 4

Views: 19190

Answers (3)

Isurie
Isurie

Reputation: 320

For Dataframe df, URLs can be removed by using cleaner regex as follows:

df = pd.read_csv('./data-set.csv')
print(df['text'])

def clean_data(dataframe):
#replace URL of a text
    dataframe['text'] = dataframe['text'].str.replace('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', ' ')

clean_data(df)
print(df['text']);

Upvotes: 1

Philip DiSarro
Philip DiSarro

Reputation: 1025

Try a cleaner regex:

df['example'] = df['example'].replace(r'http\S+', '', regex=True).replace(r'www\S+', '', regex=True)

Before implementing regex in pandas .replace() or anywhere else for that matter you should test the pattern using re.sub() on a single basic string example. When faced with a big problem, break it down into a smaller one.

Additionally we could go with the str.replace method:

df['status_message'] = df['status_message'].str.replace('http\S+|www.\S+', '', case=False)

Upvotes: 8

Vishnu Kunchur
Vishnu Kunchur

Reputation: 1726

Try this:

import re
df['cleanLinks'] = df['Links'].apply(lambda x: re.split('https:\/\/.*', str(x))[0])

Output:

df['cleanLinks']

    cleanLinks
0   random words to see if it works now 
1   more stuff that doesn't mean anything 
2   one last try please work 

Upvotes: 8

Related Questions