Anton Asiri
Anton Asiri

Reputation: 23

Reshape python dataframe

I have dataframe like this.

description
Brian
No.22
Tel:+00123456789
email:[email protected]
Sandra
No:43
Tel:+00312456789
Michel
No:593
Kent
No:13
Engineer
Tel:04512367890
email:[email protected]

and I want it like this.

name address designation telephone email
Brian No:22 null Tel:+00123456789 email:[email protected]
Sandra No:43 null Tel:+00312456789 null
Michel No:593 null null null
Kent No:13 Engineer Tel:04512367890 email:[email protected]

How to do this in python.

Upvotes: 0

Views: 97

Answers (1)

Corralien
Corralien

Reputation: 120409

Use np.where to label each row then pivot your dataframe.

Step 1.

condlist = [df['description'].shift(fill_value='').eq(''),
            df['description'].str.contains('^No[:.]'),
            df['description'].str.startswith('Tel:'),
            df['description'].str.startswith('email:')]
choicelist = ['name', 'address', 'telephone', 'email']
df['column'] = np.select(condlist, choicelist, default='designation')
print(df)

# Output:
              description       column
0                   Brian         name
1                   No.22      address
2        Tel:+00123456789    telephone
3   email:[email protected]        email
4                          designation
5                  Sandra         name
6                   No:43      address
7        Tel:+00312456789    telephone
8                          designation
9                  Michel         name
10                 No:593      address
11                         designation
12                   Kent         name
13                  No:13      address
14               Engineer  designation
15        Tel:04512367890    telephone
16   email:[email protected]        email

Step 2. Now remove empty rows and create an index to allow the pivot:

df = df[df['description'].ne('')].assign(index=df['column'].eq('name').cumsum())
print(df)

# Output:
              description       column  index
0                   Brian         name      1
1                   No.22      address      1
2        Tel:+00123456789    telephone      1
3   email:[email protected]        email      1
5                  Sandra         name      2
6                   No:43      address      2
7        Tel:+00312456789    telephone      2
9                  Michel         name      3
10                 No:593      address      3
12                   Kent         name      4
13                  No:13      address      4
14               Engineer  designation      4
15        Tel:04512367890    telephone      4
16   email:[email protected]        email      4

Step 3. Pivot your dataframe:

cols = ['name', 'address', 'designation', 'telephone', 'email']
out = df.pivot('index', 'column', 'description')[cols] \
        .rename_axis(index=None, columns=None)
print(out)

# Output:
     name address designation         telephone                  email
1   Brian   No.22         NaN  Tel:+00123456789  email:[email protected]
2  Sandra   No:43         NaN  Tel:+00312456789                    NaN
3  Michel  No:593         NaN               NaN                    NaN
4    Kent   No:13    Engineer   Tel:04512367890   email:[email protected]

Edit

There is an error at final step" ValueError: Index contains duplicate entries, cannot reshape" how can I overcome this.

There is no magic to solve this problem because your data are mess. The designation label is the fallback if the row was not tagged to name, address, telephone and email. So there is a great chance, you have multiple rows labelled designation for a same person.

At then end of this step, check if you have duplicates (person/label -> index/column) with this command:

df.value_counts(['index', 'column']).loc[lambda x: x > 1]

Probably (and I hope for you), the output should indicate only designation label under column column unless one person can have multiple telephone or email. Now you can adjust the condlist to catch a maximum of pattern. I don't know anything about your data so I can't help you much.

Upvotes: 4

Related Questions