Reputation: 23
I have dataframe like this.
description |
---|
Brian |
No.22 |
Tel:+00123456789 |
email:[email protected] |
Sandra |
No:43 |
Tel:+00312456789 |
Michel |
No:593 |
Kent |
No:13 |
Engineer |
Tel:04512367890 |
email:[email protected] |
and I want it like this.
name | address | designation | telephone | |
---|---|---|---|---|
Brian | No:22 | null | Tel:+00123456789 | email:[email protected] |
Sandra | No:43 | null | Tel:+00312456789 | null |
Michel | No:593 | null | null | null |
Kent | No:13 | Engineer | Tel:04512367890 | email:[email protected] |
How to do this in python.
Upvotes: 0
Views: 97
Reputation: 120409
Use np.where
to label each row then pivot
your dataframe.
Step 1.
condlist = [df['description'].shift(fill_value='').eq(''),
df['description'].str.contains('^No[:.]'),
df['description'].str.startswith('Tel:'),
df['description'].str.startswith('email:')]
choicelist = ['name', 'address', 'telephone', 'email']
df['column'] = np.select(condlist, choicelist, default='designation')
print(df)
# Output:
description column
0 Brian name
1 No.22 address
2 Tel:+00123456789 telephone
3 email:[email protected] email
4 designation
5 Sandra name
6 No:43 address
7 Tel:+00312456789 telephone
8 designation
9 Michel name
10 No:593 address
11 designation
12 Kent name
13 No:13 address
14 Engineer designation
15 Tel:04512367890 telephone
16 email:[email protected] email
Step 2. Now remove empty rows and create an index to allow the pivot:
df = df[df['description'].ne('')].assign(index=df['column'].eq('name').cumsum())
print(df)
# Output:
description column index
0 Brian name 1
1 No.22 address 1
2 Tel:+00123456789 telephone 1
3 email:[email protected] email 1
5 Sandra name 2
6 No:43 address 2
7 Tel:+00312456789 telephone 2
9 Michel name 3
10 No:593 address 3
12 Kent name 4
13 No:13 address 4
14 Engineer designation 4
15 Tel:04512367890 telephone 4
16 email:[email protected] email 4
Step 3. Pivot your dataframe:
cols = ['name', 'address', 'designation', 'telephone', 'email']
out = df.pivot('index', 'column', 'description')[cols] \
.rename_axis(index=None, columns=None)
print(out)
# Output:
name address designation telephone email
1 Brian No.22 NaN Tel:+00123456789 email:[email protected]
2 Sandra No:43 NaN Tel:+00312456789 NaN
3 Michel No:593 NaN NaN NaN
4 Kent No:13 Engineer Tel:04512367890 email:[email protected]
Edit
There is an error at final step" ValueError: Index contains duplicate entries, cannot reshape" how can I overcome this.
There is no magic to solve this problem because your data are mess. The designation
label is the fallback if the row was not tagged to name
, address
, telephone
and email
. So there is a great chance, you have multiple rows labelled designation
for a same person.
At then end of this step, check if you have duplicates (person/label -> index/column) with this command:
df.value_counts(['index', 'column']).loc[lambda x: x > 1]
Probably (and I hope for you), the output should indicate only designation
label under column
column unless one person can have multiple telephone
or email
. Now you can adjust the condlist
to catch a maximum of pattern. I don't know anything about your data so I can't help you much.
Upvotes: 4