Falc
Falc

Reputation: 317

For each column value [substring], find match in other column [string]

I'm quite new to python to this might be a basic question. If so, sorry in advance!

I'm trying to accomplish the following:

  1. For each row, search for the value of df3['court_short'] in column court_region_df[['court_long']].
  2. If there is a match in the column court_region_df[['court_long']] then select that row's value for court_region_df[['hmcts_region']] and copy it to df3['region']

The value that df3['court_short'] takes is a string, typically a town name something like that, whereas court_region_df[['court_long']] takes the value of a longer string for example a full address.

I'm a bit lost about first how to accomplish step 1. Most of the python search functions I've found seem don't seem suited to finding a particular pattern in a longer string.

Upvotes: 0

Views: 1410

Answers (1)

Josh Friedlander
Josh Friedlander

Reputation: 11657

This is a toy example, but it's roughly the same as yours:

d = pd.DataFrame([['aa', 'bb'], ['cc', 'dd']], columns=['a', 'b'])
e = pd.DataFrame([['a', 'E'], ['c', '.']], columns=['a', 'b'])

e['c'] = e['a'].apply(lambda x: (d[d['a'].str.contains(x)]['b']))[0]

Output:

    a   b   c
0   a   E   bb
1   c   .   NaN

EDIT: I tried with your data, and was able to come up with a non-vectorised solution, which is a little slow but manageable (takes a couple of seconds with the file you shared). Here it is:

court_region_df = pd.read_csv('courts (1).csv')
df3 = pd.read_csv('cjs-national-court-codes-9 (2).csv', names=['court_short'], header=0)
court_region_df = court_region_df.rename(columns={'court':'court_long'})

for idx in df3.index:
    x = df3.loc[idx,'court_short']
    region = court_region_df[court_region_df.iloc[:,0].str.contains(x, case=False, regex=False)]['hmcts_region']
    df3.loc[idx, 'region'] = [region.values[0] if len(region) > 0 else np.nan][0]

Slightly faster is to stash them in a dictionary and avoid using Pandas magic at all:

court_to_region = pd.Series(court_region_df.hmcts_region.values, index=court_region_df.court_long.str.lower()).to_dict()

regions = []
for i in df3.court_short.values:
    region = np.nan
    for key in court_to_region.keys():
        if i.lower() in key:
            region = court_to_region[key]
    regions.append(region)
df3['region'] = regions

Result:

df3[~df3.isna().any(axis=1)]

    court_short                 region
280     Birmingham Crown Court  midlands
301     Ipswich Crown Court     south_east

As you can see, given the data you shared, this returns just two matches. You might want to look into a fuzzier matching system.

Upvotes: 1

Related Questions