user16170404
user16170404

Reputation: 3

Split DataFrame column based on regex expression with OR

I have a dataframe with several columns of user information where I have the columns "Contact 1" and "Contact 2".

d= {'Contact 1': ['1234567891 1234567891', '12345678 12345678', '12345678 1234567891', '1234567891 12345678','1234567 1234567891',
          '1234567891','123456789 12345678911', None],
    'Contact 2': [None, None, None, None, None, '12345678', None, None]}

df = pd.DataFrame(data=d)
Contact 1 Contact 2
1234567891 1234567891 None
12345678 12345678 None
12345678 1234567891 None
1234567891 12345678 None
1234567 1234567891 None
1234567891 12345678
123456789 12345678911 None
None None

I want to split the "Contact 1" column based on the space between numbers only if the contact numbers are 8 or 10 digits followed by space, then 8 or 10 digits. This while also preserving the few information I have on "Contact 2" column.

I tried the following code:


df[['Contact 1', 'Contact 2']]=df['Contact 1'].str.split(r'(?<=^\d{8}|\d{10})\s(?=\d{8}|\d{10}$)', n=1, expand=True)

but I get the error "re.error: look-behind requires fixed-width pattern"

I would like to get the following result:

Contact 1 Contact 2
1234567891 1234567891
12345678 12345678
12345678 1234567891
1234567891 12345678
1234567 1234567891 None
1234567891 12345678
123456789 12345678911 None
None None

Upvotes: 0

Views: 88

Answers (2)

Tim Biegeleisen
Tim Biegeleisen

Reputation: 521194

Using str.extract:

df["Contact 2"] = np.where(df["Contact 2"].isnull(),
                           df["Contact 1"].str.extract(r'^\d{8,10} (\d{8,10})$'),
                           df["Contact 2"])

Also we need to update the first column:

df["Contact 1"] = df["Contact 1"].str.replace(r'^(\d{8,10}) \d{8,10}$', r'\1')

Upvotes: 2

Chris
Chris

Reputation: 16147

If you are interested in a non-regex solution:

Create a mask or rows that meet your conditions

m = df['Contact 1'].str.split().apply(lambda x: all([len(n) in [8,10] for n in x]))

Update df with the split/expanded values

df.update(df.loc[m]['Contact 1'].str.split(expand=True).rename(columns={0:'Contact 1',
                                                                        1:'Contact 2'}), overwrite=True)

Upvotes: 0

Related Questions