LucSpan
LucSpan

Reputation: 1971

Exact match string in panda column

Set-up

I scrape housing ad data and analyse with pandas. I have computed average statistics and inserted them in a pandas dataframe: district_df.

One of the district_df columns contains district names: district_df['district'].

Another of the district_df columns contains subdistrict names: district_df['subdistrict']

They look like,

        district           subdistrict      
     Bergen-Enkheim      Bergen-Enkheim    
    Bornheim/Ostend            Bornheim
    Bornheim/Ostend              Ostend
            Harheim             Harheim
       Innenstadt I            Altstadt
       Innenstadt I     Bahnhofsviertel
       Innenstadt I              Gallus
      Innenstadt II          Bockenheim 
      Innenstadt II        Westend-Nord
                  ⋮                   ⋮

Problem

I create a district table (district_table) from district_df per district. I.e. for the above I create five district tables. I do this by the following code,

for district in d_set: # d_set is a set containing all district names 
    district_table = district_df[district_df['district'].str.match(district)]

This code works, that is: a table per district is created.

However, the table for Innenstadt II also contains the subdistricts of Innenstadt I.

It seems to me that .str.match(district) matches not exact, but partly. I.e. Innenstadt I will match Innenstadt II.

My actual district_df columns contain more then what I display here – issue occurs for a variety of district names.

How do I get exact matches?

Upvotes: 1

Views: 2657

Answers (2)

jezrael
jezrael

Reputation: 862661

I think you need boolean indexing in loop:

d_set = district_df['district'].unique()

for district in d_set: 
    district_table = district_df[district_df['district'] == district]
    print (district_table)

         district     subdistrict
0  Bergen-Enkheim  Bergen-Enkheim
          district subdistrict
1  Bornheim/Ostend    Bornheim
2  Bornheim/Ostend      Ostend
  district subdistrict
3  Harheim     Harheim
       district      subdistrict
4  Innenstadt I         Altstadt
5  Innenstadt I  Bahnhofsviertel
6  Innenstadt I           Gallus
        district   subdistrict
7  Innenstadt II    Bockenheim
8  Innenstadt II  Westend-Nord

If need dict of DataFrames better is convert groupby object:

a = dict(tuple(district_df.groupby('district')))

print (a['Innenstadt I'])
       district      subdistrict
4  Innenstadt I         Altstadt
5  Innenstadt I  Bahnhofsviertel
6  Innenstadt I           Gallus

Upvotes: 2

John Zwinck
John Zwinck

Reputation: 249153

I'd do it this way:

{ dist: df[df.district == dist] for dist in df.district.unique() }

But then again you might be better off using a MultiIndex:

df.set_index(['district', 'subdistrict'], inplace=True)

This is a lot like the dict solution, but downstream processing is likely to be faster.

Upvotes: 2

Related Questions