Removing portions of string in Pandas: not working + errors

Question

I have a pandas DataFrame named full_list with a string-variable column named domains. Part of a snip shown here

  domains
0 naturalhealth365.com
1 truththeory.com
2 themillenniumreport.com
3 https://www.cernovich.com
4 https://www.christianpost.com
5 http://evolutionnews.org
6 http://www.greenmedinfo.com
7 http://www.magapill.com8
8 https://needtoknow.news

I need to remove the https:// OR http:// from the website names.

I checked multiple pandas post on SO dealing with vaguely similar issues and I have tried all of these methods:

full_list['domains'] = full_list['domains'].apply(lambda x: x.lstrip('http://')) but that erronoeusly removes the letters t, h and p as well i.e. "truththeory.com" (index 1) becomes "uththeory.com"
full_list['domains'] = full_list['domains'].replace(('http://', '')) and this makes no changes to the strings AT ALL. Like before and after the line run, the values in domains stay the same
full_list['domains'] = full_list['domains'].str.replace(('http://', '')) gives the error replace() missing 1 required positional argument: 'repl'
full_list['domains'] = full_list['domains'].str.lsplit('//', n=1).str.get(1) makes the first 3 rows (index 0, 1, 2) nan

For the world of me, I am unable to see what is it that I am doing wrong. Any help is appreciated.

jezrael · Accepted Answer

Use Series.str.replace with regex ^ for start of string and [s]* for optional s:

df['domains'] = df['domains'].str.replace(r'^http[s]*://', '', regex=True)
print (df)
                   domains
0     naturalhealth365.com
1          truththeory.com
2  themillenniumreport.com
3        www.cernovich.com
4    www.christianpost.com
5        evolutionnews.org
6     www.greenmedinfo.com
7        www.magapill.com8
8          needtoknow.news

Removing portions of string in Pandas: not working + errors

Answers (2)

Related Questions