Reputation: 1971
Set-up
I scrape housing ad data and analyse with pandas. I have computed average statistics and inserted them in a pandas dataframe: district_df
.
One of the district_df
columns contains district names: district_df['district']
.
Another of the district_df
columns contains subdistrict names: district_df['subdistrict']
They look like,
district subdistrict
Bergen-Enkheim Bergen-Enkheim
Bornheim/Ostend Bornheim
Bornheim/Ostend Ostend
Harheim Harheim
Innenstadt I Altstadt
Innenstadt I Bahnhofsviertel
Innenstadt I Gallus
Innenstadt II Bockenheim
Innenstadt II Westend-Nord
⋮ ⋮
Problem
I create a district table (district_table
) from district_df
per district. I.e. for the above I create five district tables. I do this by the following code,
for district in d_set: # d_set is a set containing all district names
district_table = district_df[district_df['district'].str.match(district)]
This code works, that is: a table per district is created.
However, the table for Innenstadt II
also contains the subdistricts of Innenstadt I
.
It seems to me that .str.match(district)
matches not exact, but partly. I.e. Innenstadt I
will match Innenstadt II
.
My actual district_df
columns contain more then what I display here – issue occurs for a variety of district names.
How do I get exact matches?
Upvotes: 1
Views: 2657
Reputation: 862661
I think you need boolean indexing
in loop:
d_set = district_df['district'].unique()
for district in d_set:
district_table = district_df[district_df['district'] == district]
print (district_table)
district subdistrict
0 Bergen-Enkheim Bergen-Enkheim
district subdistrict
1 Bornheim/Ostend Bornheim
2 Bornheim/Ostend Ostend
district subdistrict
3 Harheim Harheim
district subdistrict
4 Innenstadt I Altstadt
5 Innenstadt I Bahnhofsviertel
6 Innenstadt I Gallus
district subdistrict
7 Innenstadt II Bockenheim
8 Innenstadt II Westend-Nord
If need dict
of DataFrames
better is convert groupby
object:
a = dict(tuple(district_df.groupby('district')))
print (a['Innenstadt I'])
district subdistrict
4 Innenstadt I Altstadt
5 Innenstadt I Bahnhofsviertel
6 Innenstadt I Gallus
Upvotes: 2
Reputation: 249153
I'd do it this way:
{ dist: df[df.district == dist] for dist in df.district.unique() }
But then again you might be better off using a MultiIndex:
df.set_index(['district', 'subdistrict'], inplace=True)
This is a lot like the dict
solution, but downstream processing is likely to be faster.
Upvotes: 2