Reputation: 321

Split column data based on Condition Pandas Dataframe

I have a large data set and looking for something that will split my Street Address into two columns Street Number and Street Name.

I am trying to figure out how can I do this efficiently since I first need to process the street address and then check if the first index of the split has a digit or not.

So far I have a working code that looks like this. I created a two function one for extracting street number data from the street address, while the other one replaces the first occurrence of that street number from the street address.

def extract_street_number(row):
      if any(map(str.isdigit, row.split(" ")[0])):
        return row.split(" ")[0]


def extract_street_name(address, streetnumber):
    if streetnumber:
        return address.replace(streetnumber, "", 1)
    else:
        return address

Then using the apply function to have the two columns.

df[street_number] = df.apply(lambda row: extract_street_number(row[address_col]), axis=1)
df[street_name] = df.apply(lambda row: extract_street_name(row[address_col], row[street_number]), axis=1)

I'm wondering if there is a more efficient way to do this? Based on this current routine I need to build first the Street Number Column before I process the street name column.

I'm thinking of something like building the two series on the first iteration of the address column. The pseudo-code is something like this I just can't figure it out how can I code it in python.

Pseudocode:

Split Address into two columns based on first space that encounters a non-numeric character: street_data = address.split(" ", maxsplit=1)
If street_data[0] has digits then return the columns on this way:

      df[street_number] = street_data[0]
      df[street_name] = street_data[1]

Else if street_data[0] is not digit then return the columns on this way:

      df[street_number] = ""
      df[street_name] = street_data[0] + " " + street_data[1]
      # or just simply the address
      df[street_name] = address

By the way this is the working sample of the data:

# In
df = pd.DataFrame({'Address':['111 Rubin Center', 'Monroe St', '513 Banks St', '5600 77 Center Dr', '1013 1/2 E Main St', '1234C Main St', '37-01 Fair Lawn Ave']})

# Out
   Street_Number    Street_Name
0           111     Rubin Center
1                   Monroe St
2           513     Banks St
3        560 77     Center Dr
4      1013 1/2     E Main St
5         1234C     Main St
6         37-01     Fair Lawn Ave

Upvotes: 6

Answers (3)

rohetoric

Reputation: 354

TL;DR: This can be achieved in three steps-

Step 1-

df['Street Number'] = [street_num[0] if any(i.isdigit() for i in street_num[0]) else 'N/A' for street_num in df.Address.apply(lambda s: s.split(" ",1))]

Step 2-

df['Street Address'] = [street_num[1] if any(i.isdigit() for i in street_num[0]) else 'N/A' for street_num in df.Address.apply(lambda s: s.split(" ",1))]

Step 3-

df['Street Address'].loc[df['Street Address'].str.contains("N/A") == True] = df1['Address'].loc[df1['Street Address'].str.contains("N/A") == True]

Explanation-

Added two more test cases in the dataframe for code flexibility (Row 7,8)-

Step 1 - We separate the street numbers from the address here. This is done by slicing the first element from the list after splitting the address string and initialising to Street Number column. If the first element doesn't contain a number, N/A is appended in the Street Number column.

Step 2 - As the first element in the sliced string contains the Street Number, the second element has to be the Street Address hence is appended to the Street Address column.

Step 3 - Due to step two, the Street Address become 'N/A' for the 'Address` that do not contain a number and that is resolved by this -

Hence, we can solve this in three steps after hours of struggle put in.

Upvotes: 3

sammywemmy

Reputation: 28644

#mock test
df = pd.DataFrame({'Address':['111 Rubin Center', 'Monroe St',
                              '513 Banks St', 'Banks 513 St',
                              'Rub Cent 111']})

unless i'm missing something, a bit of regex should solve ur request:

#gets number only if it starts the line
df['Street_Number'] = df.Address.str.extract(r'(^\d+)')
#splits only if number is at the start of the line
df['Street_Name'] = df.Address.str.split('^\d+').str[-1]


  Address           street_number   street_name
0   111 Rubin Center    111         Rubin Center
1   Monroe St           NaN         Monroe St
2   513 Banks St        513         Banks St
3   Banks 513 St        NaN         Banks 513 St
4   Rub Cent 111        NaN         Rub Cent 111

let me know where this falls flat

Upvotes: 0

Oskar_U

Reputation: 482

solution reflecting your pseudocode is below. First lets divide "Address" and store is somewhere

new = df["Address"].str.split(" ", n = 1, expand = True)
df["First Part"]= new[0]
df["Last Part"]= new[1]

Next let's write down conditions

cond1 = df['First Part'].apply(str.isdigit)
cond2 = df['Last Part'].apply(str.isdigit)

Now check what meets given conditions

df.loc[cond1 & ~cond2, "Street"] = df.loc[cond1 & ~cond2, "Last Part"]
df.loc[cond1 & ~cond2, "Number"] = df.loc[cond1 & ~cond2, "First Part"]
df.loc[~cond1 & ~cond2, "Street"] = df.loc[~cond1 & ~cond2, ['First Part', 'Last Part']].apply(lambda x: x[0] + ' ' + x[1], axis = 1)

Finally let's clean-up those auxiliary columns

df.drop(["First Part", "Last Part"], axis = 1, inplace=True)
df

   Address            Street         Number
0  111 Rubin Center   Rubin Center   111
1  Monroe St          Monroe St      NaN
2  513 Banks St       Banks St       513

Upvotes: 0

Split column data based on Condition Pandas Dataframe

Answers (3)

Related Questions