Reputation: 2117
I am working with a request url dataset(strings) that looks like this in a pandas df:
df
request_url count
0 https://login.microsoftonline.com 24521
1 https://dt.adsafeprotected.com 11521
2 http://209.53.113.23/ 225211
3 https://googleads.g.doubleclick.net 6252
4 https://fls-na.amazon.com 65225
5 https://v10.vortex-win.data.microsoft.com 7852222
6 https://ib.adnxs.com 12
7 http://177.41.65.207/read.txt 188
Desired output:
newdf
request_url count
0 https://login.microsoftonline.com 24521
1 https://dt.adsafeprotected.com 11521
2 https://googleads.g.doubleclick.net 6252
3 https://fls-na.amazon.com 65225
4 https://v10.vortex-win.data.microsoft.com 7852222
5 https://ib.adnxs.com 12
I am then going to use the tld library on the data. The reason I want to get rid of these is because the tld library doesnt know what to do with the IP in the domain. Is there a easy way to remove rows from the dataframe that contain an IP address?
Upvotes: 2
Views: 701
Reputation: 7221
Create a function to check each row and filter by the result:
import re
def hasip(row):
return re.match(r"http://\d+\.\d+\.\d+\.\d+", row["request_url"]) is None
newdf = df[df.apply(hasip, axis=1)]
Upvotes: 3
Reputation: 323326
You may check with findall
with regex [0-9]+(?:\.[0-9]+){3}
, astype
bool will convert all empty list to False
df[~df.request_url.str.findall(r'[0-9]+(?:\.[0-9]+){3}').astype(bool)]
Out[908]:
request_url
0 https://login.microsoftonline.com
1 https://dt.adsafeprotected.com
3 https://googleads.g.doubleclick.net
4 https://fls-na.amazon.com
5 https://v10.vortex-win.data.microsoft.com
6 https://ib.adnxs.com
Upvotes: 3