Split DataFrame column based on regex expression with OR

Question

I have a dataframe with several columns of user information where I have the columns "Contact 1" and "Contact 2".

d= {'Contact 1': ['1234567891 1234567891', '12345678 12345678', '12345678 1234567891', '1234567891 12345678','1234567 1234567891',
          '1234567891','123456789 12345678911', None],
    'Contact 2': [None, None, None, None, None, '12345678', None, None]}

df = pd.DataFrame(data=d)

Contact 1	Contact 2
1234567891 1234567891	None
12345678 12345678	None
12345678 1234567891	None
1234567891 12345678	None
1234567 1234567891	None
1234567891	12345678
123456789 12345678911	None
None	None

I want to split the "Contact 1" column based on the space between numbers only if the contact numbers are 8 or 10 digits followed by space, then 8 or 10 digits. This while also preserving the few information I have on "Contact 2" column.

I tried the following code:


df[['Contact 1', 'Contact 2']]=df['Contact 1'].str.split(r'(?<=^\d{8}|\d{10})\s(?=\d{8}|\d{10}$)', n=1, expand=True)

but I get the error "re.error: look-behind requires fixed-width pattern"

I would like to get the following result:

Contact 1	Contact 2
1234567891	1234567891
12345678	12345678
12345678	1234567891
1234567891	12345678
1234567 1234567891	None
1234567891	12345678
123456789 12345678911	None
None	None

Chris · Accepted Answer

If you are interested in a non-regex solution:

Create a mask or rows that meet your conditions

m = df['Contact 1'].str.split().apply(lambda x: all([len(n) in [8,10] for n in x]))

Update df with the split/expanded values

df.update(df.loc[m]['Contact 1'].str.split(expand=True).rename(columns={0:'Contact 1',
                                                                        1:'Contact 2'}), overwrite=True)

Split DataFrame column based on regex expression with OR

Answers (2)

Related Questions