Reputation: 1713
I am trying to subset a dataframe using 'pandas' if the column matches a particular pattern. Below is a reproducible example for reference.
import pandas as pd
# Create Dataframe having 10 rows and 2 columns 'code' and 'URL'
df = pd.DataFrame({'code': [1,1,2,2,3,4,1,2,2,5],
'URL': ['www.abc.de','https://www.abc.fr/-de','www.abc.fr','www.abc.fr','www.abc.co.uk','www.abc.es','www.abc.de','www.abc.fr','www.abc.fr','www.abc.it']})
# Create new dataframe by filtering out all rows where the column 'code' is equal to 1
new_df = df[df['code'] == 1]
# Below is how the new dataframe looks like
print(new_df)
URL code
0 www.abc.de 1
1 https://www.abc.fr/-de 1
6 www.abc.de 1
Below are the dtypes for reference.
print(new_df.dtypes)
URL object
code int64
dtype: object
# Now I am trying to exclude all those rows where the 'URL' column does not have .de as the pattern. This should retain only the 2nd row in new_df from above output
new_df = new_df[~ new_df['URL'].str.contains(r".de", case = True)]
# Below is how the output looks like
print(new_df)
Empty DataFrame
Columns: [URL, code]
Index: []
Below are my questions.
1) Why is the 'URL'
column appearing first even though I defined the 'code'
column first?
2) What is wrong in my code when I am trying to remove all those rows where the 'URL' column does not have the pattern .de
? In R, I would simply use the below code to get the desired result easily.
new_df <- new_df[grep(".de",new_df$URL, fixed = TRUE, invert = TRUE), ]
Desired output should be as below.
# Desired output for new_df
URL code
https://www.abc.fr/-de 1
Any guidance on this would be really appreciated.
Upvotes: 2
Views: 799
Reputation: 403218
Why is the 'URL' column appearing first even though I defined the 'code' column first?
This is a consequence of the fact that dictionaries are not ordered. Columns are read in and created in any order, depending on the random hash initialization of the python interpreter.
What is wrong in my code when I am trying to remove all those rows where the 'URL' column does not have the pattern .de?
You'd need to escape the .
, because that's a special regex meta-character.
df[df.code.eq(1) & ~df.URL.str.contains(r'\.de$', case=True)]
URL code
1 https://www.abc.fr/-de 1
This may not be succifient if de
can be found anywhere after the TLD (and not at the very end). Here's a general solution addressing that limitation -
p = '''.* # match anything, greedily
\. # literal dot
de # "de"
(?!.* # negative lookahead
\. # literal dot (should not be found)
)'''
df[df.code.eq(1) & ~df.URL.str.contains(p, case=True, flags=re.VERBOSE)]
URL code
1 https://www.abc.fr/-de 1
Upvotes: 3