NPD
NPD

Reputation: 337

Removing portions of string in Pandas: not working + errors

I have a pandas DataFrame named full_list with a string-variable column named domains. Part of a snip shown here

  domains
0 naturalhealth365.com
1 truththeory.com
2 themillenniumreport.com
3 https://www.cernovich.com
4 https://www.christianpost.com
5 http://evolutionnews.org
6 http://www.greenmedinfo.com
7 http://www.magapill.com8
8 https://needtoknow.news

I need to remove the https:// OR http:// from the website names.

I checked multiple pandas post on SO dealing with vaguely similar issues and I have tried all of these methods:

  1. full_list['domains'] = full_list['domains'].apply(lambda x: x.lstrip('http://')) but that erronoeusly removes the letters t, h and p as well i.e. "truththeory.com" (index 1) becomes "uththeory.com"

  2. full_list['domains'] = full_list['domains'].replace(('http://', '')) and this makes no changes to the strings AT ALL. Like before and after the line run, the values in domains stay the same

  3. full_list['domains'] = full_list['domains'].str.replace(('http://', '')) gives the error replace() missing 1 required positional argument: 'repl'

  4. full_list['domains'] = full_list['domains'].str.lsplit('//', n=1).str.get(1) makes the first 3 rows (index 0, 1, 2) nan

For the world of me, I am unable to see what is it that I am doing wrong. Any help is appreciated.

Upvotes: 2

Views: 155

Answers (2)

U13-Forward
U13-Forward

Reputation: 71610

Try str.replace with regex like the following:

>>> df['domains'].str.replace('http(s|)://', '')
0       naturalhealth365.com
1            truththeory.com
2    themillenniumreport.com
3          www.cernovich.com
4      www.christianpost.com
5          evolutionnews.org
6       www.greenmedinfo.com
7          www.magapill.com8
8            needtoknow.news
Name: domains, dtype: object
>>> 

Upvotes: 1

jezrael
jezrael

Reputation: 863281

Use Series.str.replace with regex ^ for start of string and [s]* for optional s:

df['domains'] = df['domains'].str.replace(r'^http[s]*://', '', regex=True)
print (df)
                   domains
0     naturalhealth365.com
1          truththeory.com
2  themillenniumreport.com
3        www.cernovich.com
4    www.christianpost.com
5        evolutionnews.org
6     www.greenmedinfo.com
7        www.magapill.com8
8          needtoknow.news

Upvotes: 1

Related Questions