Remove rows from a Pandas df that contain an IP address

Question

I am working with a request url dataset(strings) that looks like this in a pandas df:

df
  request_url                                  count
0 https://login.microsoftonline.com            24521
1 https://dt.adsafeprotected.com               11521
2 http://209.53.113.23/                        225211
3 https://googleads.g.doubleclick.net          6252
4 https://fls-na.amazon.com                    65225 
5 https://v10.vortex-win.data.microsoft.com    7852222 
6 https://ib.adnxs.com                         12
7 http://177.41.65.207/read.txt                188

Desired output:

newdf
  request_url                                  count
0 https://login.microsoftonline.com            24521
1 https://dt.adsafeprotected.com               11521
2 https://googleads.g.doubleclick.net          6252
3 https://fls-na.amazon.com                    65225
4 https://v10.vortex-win.data.microsoft.com    7852222
5 https://ib.adnxs.com                         12

I am then going to use the tld library on the data. The reason I want to get rid of these is because the tld library doesnt know what to do with the IP in the domain. Is there a easy way to remove rows from the dataframe that contain an IP address?

BENY · Accepted Answer

You may check with findall with regex [0-9]+(?:\.[0-9]+){3}, astype bool will convert all empty list to False

df[~df.request_url.str.findall(r'[0-9]+(?:\.[0-9]+){3}').astype(bool)]
Out[908]: 
                                 request_url
0          https://login.microsoftonline.com
1             https://dt.adsafeprotected.com
3        https://googleads.g.doubleclick.net
4                  https://fls-na.amazon.com
5  https://v10.vortex-win.data.microsoft.com
6                       https://ib.adnxs.com

Remove rows from a Pandas df that contain an IP address

Answers (2)

Related Questions