Reputation: 339
I have the following dataframe, called pop_df:
pop_df
Out[33]:
2014 2015 2016 2017 2018 2019
Geo
AL 2892394.0 2885796.0 2875592.0 2876591.0 2870324.0 2862427.0
AL0 2892394.0 2885796.0 2875592.0 2876591.0 2870324.0 2862427.0
AL01 844921.0 836448.0 830981.0 826904.0 819793.0 813758.0
AL011 134332.0 131054.0 129056.0 125579.0 120978.0 118948.0
AL012 276058.0 277989.0 280205.0 284823.0 289626.0 290126.0
... ... ... ... ... ...
UKN12 142028.0 142756.0 143363.0 143746.0 144105.0 144367.0
UKN13 139774.0 140222.0 140752.0 141368.0 141994.0 142565.0
UKN14 137722.0 139426.0 140691.0 141917.0 143286.0 144771.0
UKN15 136332.0 136904.0 137492.0 138000.0 138441.0 138948.0
UKN16 114696.0 115171.0 115581.0 116057.0 116612.0 117051.0
[2034 rows x 6 columns]
I also have a list of EU country codes:
EuropeanUnion=["BE","BG","CZ","DK","DE","EE","IE","EL","ES","FR","HR","IT","CY","LV","LT","LU","HU","MT","NL","AT","PL","PT","RO","SI","SK","FI","SE"]
I would like to filter the dataframe to keep only the rows starting by the country codes that are in the list. I tried with pop_df.index.str.startswith
but I couldn't make it work. I'd appreciate any help you can provide. Thanks in advance!
Upvotes: 3
Views: 476
Reputation: 92
I would recommend you to use slicing indexes (as string) up to two first letters and use pandas .isin method applied to country codes variable as a boolean mask:
eu_countries_filtered = pop_df[pop_df.index.str[:2].isin(EuropeanUnion)]
Upvotes: 2
Reputation: 61910
It seems that Geo
is the index, so you could do:
result = df[df.index.str.match(rf'\b{"|".join(EuropeanUnion)}')]
Output (dummy)
2014 2015 2016 2017 2018 2019
Geo
BE011 134332.0 131054.0 129056.0 125579.0 120978.0 118948.0
DE13 139774.0 140222.0 140752.0 141368.0 141994.0 142565.0
From the documentation on str.match:
Determine if each string starts with a match of a regular expression.
The expression rf'\b{"|".join(EuropeanUnion)}'
, builds a regular expression pattern that will match any of the country codes.
Upvotes: 3