Bowen Liu
Bowen Liu

Reputation: 1127

Pandas DataFrame: How to extract the last two string type numbers from a column which doesn't always end with the two numbers

Sorry for the possible confusion in the title, here's what I'm trying to do:

I'm trying to merge my Parcels data frame with my Municipality Code look up table. The Parcels dataframe:

df1.head()

    PARID           OWNER1
0   B10 2 1 0131    WILSON ROBERT JR
1   B10 2 18B 0131  COMUNALE MICHAEL J & MARY ANN
2   B10 2 18D 0131  COMUNALE MICHAEL J & MARY ANN
3   B10 2 19F 0131  MONROE & JEFFERSON HOLDINGS LLC
4   B10 4 11 0131   NOEL JAMES H

The Municipality Code dataframe:

df_LU.head()
  PARID  Municipality
0   01  Allen Twp.
1   02  Bangor
2   03  Bath
3   04  Bethlehem
4   05  Bethlehem Twp.

The last two numbers in the first column of df1 ('31' in 'B10 2 1 0131') are the Municipality Code that I need to merge with the Municipality Code DataFrame. But in my 30,000 or so records, there are about 200 records end with letters as shown below:

        PARID           OWNER1  
299    D11 10 10 0131F  HOWARD THEODORE P & CLAUDIA S   
1007    F10 4 3 0134F   KNEEBONE JUDY ANN   
1011    F10 5 2 0134F   KNEEBONE JUDY ANN   
1114    F8 18 10 0626F  KNITTER WILBERT D JR & AMY J    
1115    F8 18 8 0626F   KNITTER DONALD  

For these rows, the two numbers before the last letter are the Code that I need to extract out (like '31' in 'D11 10 10 0131F')

If I just use pd.DataFrame(df1['PARID'].str[-2:]) This will give me:

PARID
...
299 1F
...

While what I need is:

PARID
...
299 31
...

My code of accomplishing this is pretty lengthy, which pretty much invloves:

  1. Join all the rows that end with 2 numbers.
  2. Find out the index of the rows that end with a letter in the 'PARID' field
  3. Join the results from step 2 again with the Municipality look up dataframe.

The code is there:

#Do the extraction and merge for the rows that end with numbers
df_2015= df1[['PARID','OWNER1']]
df_2015['PARID'] = df_2015['PARID'].str[-2:]
df_15r =pd.merge(df_2015, df_LU, how = 'left', on = 'PARID')
df_15r

#The pivot result for rows generated from above.
Result15_First = df_15r.groupby('Municipality').count()
Result15_First.to_clipboard()

#Check the ID field for rows that end with letters
check15 = df_2015['PARID'].unique()
check15
C = pd.DataFrame({'ID':check15})
NC = C.dropna()
LNC = NC[NC['ID'].str.endswith('F')]
MNC = NC[NC['ID'].str.endswith('A')]
F = [LNC, MNC]
NNC = pd.concat(F, axis = 0)


s = NNC['ID'].tolist()
s

# Identify the records in s

df_p15 = df_2015.loc[df_2015['PARID'].isin(s)]
df_p15

# Separate out a dataframe with just the rows that end with a letter
df15= df1[['PARID','OWNER1']]
df15c = df15[df15.index.isin(df_p15.index)]
df15c

#This step is to create the look up field from the new data frame, the two numbers before the ending letter.
df15c['PARID1'] = df15c['PARID'].str[-3:-1]
df15c

#Then I will join the look up table
df_15t =df15c.merge(df_LU.set_index('PARID'), left_on = 'PARID1', right_index = True)

df_15b = df_15t.groupby('Municipality').count()
df_15b

It wasn't until I finished that I realized how lengthy my code was for a seemingly simple task. If there is a better way to achieve, which is a sure thing, please let me know. Thanks.

Upvotes: 2

Views: 10281

Answers (3)

Vaishali
Vaishali

Reputation: 38415

You can use pandas string methods to extract the last two numbers

df1['PARID'].str.extract('.*(\d{2})', expand = False)

You get

0    31
1    31
2    13
3    13
4    31

Upvotes: 3

ababuji
ababuji

Reputation: 1731

import pandas as pd
df = pd.DataFrame([['M3N6V2 B7 13A 0131','M3N6V2 B7 13B 0131','Y2 7 B13 0213', 'Y2 7 B14 0213', 'M5 N4 12 0231A' ], ['Tom', 'Jerry', 'Jack', 'Chris', 'Alex']])
df = df.T
df.columns = ['PARID', 'Owner']
print(df)

prints your left DataFrame

                PARID  Owner
0  M3N6V2 B7 13A 0131    Tom
1  M3N6V2 B7 13B 0131  Jerry
2       Y2 7 B13 0213   Jack
3       Y2 7 B14 0213  Chris
4      M5 N4 12 0231A   Alex

and now for your right DataFrame

import numpy as np
df['IDPART'] = None
for row in df.index:

    if df.at[row, 'PARID'][-1].isalpha():
        df.at[row, 'IDPART'] = df.at[row, 'PARID'][-3:-1]

    else:
        df.at[row, 'IDPART'] = df.at[row, 'PARID'][-2:]

df['IDPART']=df['IDPART'].apply(int) #Converting the column to be joined to an integer column
print(df) 

gives:

                PARID  Owner  IDPART
0  M3N6V2 B7 13A 0131    Tom      31
1  M3N6V2 B7 13B 0131  Jerry      31
2       Y2 7 B13 0213   Jack      13
3       Y2 7 B14 0213  Chris      13
4      M5 N4 12 0231A   Alex      31

and then merge

merged = pd.merge(df, otherdf, how = 'left', left_on = 'IDPART', right_on = 'PARID', left_index=False, right_index=False)
print(merged)

gives:

              PARID_x  Owner  IDPART  PARID_y Municipality
0  M3N6V2 B7 13A 0131    Tom      31       31       Tatamy
1  M3N6V2 B7 13B 0131  Jerry      31       31       Tatamy
2       Y2 7 B13 0213   Jack      13       13    Allentown
3       Y2 7 B14 0213  Chris      13       13    Allentown
4      M5 N4 12 0231A   Alex      31       31       Tatamy

Upvotes: 1

Elke
Elke

Reputation: 511

You can use str.replace to remove all non-digits. After that, you should be able to use .str[-2:].

import pandas as pd

df1 = pd.DataFrame({ 'PARID' : pd.Series(["M3N6V2 B7 13A 0131", "M3N6V2 B7 13B 
0131", "Y2 7 B13 0213", "Y2 7 B14 0213", "M5 N4 12 0231A"]),
                 'Owner' : pd.Series(["Tom", "Jerry", "Jack", "Chris", "Alex"])})


df1['PARID'].str.replace(r'\D+', '').str[-2:]

Upvotes: 3

Related Questions