Reshape python dataframe

Question

I have dataframe like this.

description
Brian
No.22
Tel:+00123456789
email:brain@email.com

Sandra
No:43
Tel:+00312456789

Michel
No:593

Kent
No:13
Engineer
Tel:04512367890
email:kent@yahoo.com

and I want it like this.

name	address	designation	telephone	email
Brian	No:22	null	Tel:+00123456789	email:brain@email.com
Sandra	No:43	null	Tel:+00312456789	null
Michel	No:593	null	null	null
Kent	No:13	Engineer	Tel:04512367890	email:kent@yahoo.com

How to do this in python.

Corralien · Accepted Answer

Use np.where to label each row then pivot your dataframe.

Step 1.

condlist = [df['description'].shift(fill_value='').eq(''),
            df['description'].str.contains('^No[:.]'),
            df['description'].str.startswith('Tel:'),
            df['description'].str.startswith('email:')]
choicelist = ['name', 'address', 'telephone', 'email']
df['column'] = np.select(condlist, choicelist, default='designation')
print(df)

# Output:
              description       column
0                   Brian         name
1                   No.22      address
2        Tel:+00123456789    telephone
3   email:brain@email.com        email
4                          designation
5                  Sandra         name
6                   No:43      address
7        Tel:+00312456789    telephone
8                          designation
9                  Michel         name
10                 No:593      address
11                         designation
12                   Kent         name
13                  No:13      address
14               Engineer  designation
15        Tel:04512367890    telephone
16   email:kent@yahoo.com        email

Step 2. Now remove empty rows and create an index to allow the pivot:

df = df[df['description'].ne('')].assign(index=df['column'].eq('name').cumsum())
print(df)

# Output:
              description       column  index
0                   Brian         name      1
1                   No.22      address      1
2        Tel:+00123456789    telephone      1
3   email:brain@email.com        email      1
5                  Sandra         name      2
6                   No:43      address      2
7        Tel:+00312456789    telephone      2
9                  Michel         name      3
10                 No:593      address      3
12                   Kent         name      4
13                  No:13      address      4
14               Engineer  designation      4
15        Tel:04512367890    telephone      4
16   email:kent@yahoo.com        email      4

Step 3. Pivot your dataframe:

cols = ['name', 'address', 'designation', 'telephone', 'email']
out = df.pivot('index', 'column', 'description')[cols] \
        .rename_axis(index=None, columns=None)
print(out)

# Output:
     name address designation         telephone                  email
1   Brian   No.22         NaN  Tel:+00123456789  email:brain@email.com
2  Sandra   No:43         NaN  Tel:+00312456789                    NaN
3  Michel  No:593         NaN               NaN                    NaN
4    Kent   No:13    Engineer   Tel:04512367890   email:kent@yahoo.com

Edit

There is an error at final step" ValueError: Index contains duplicate entries, cannot reshape" how can I overcome this.

There is no magic to solve this problem because your data are mess. The designation label is the fallback if the row was not tagged to name, address, telephone and email. So there is a great chance, you have multiple rows labelled designation for a same person.

At then end of this step, check if you have duplicates (person/label -> index/column) with this command:

df.value_counts(['index', 'column']).loc[lambda x: x > 1]

Probably (and I hope for you), the output should indicate only designation label under column column unless one person can have multiple telephone or email. Now you can adjust the condlist to catch a maximum of pattern. I don't know anything about your data so I can't help you much.

Reshape python dataframe

Answers (1)

Related Questions