Reputation: 47
Good afternoon, Here is my problem: I have a Pandas dataframe that looks like it:
Birth_Year Name
1964 [X, Y]
1965 [A, B]
1964 [F, G, H]
I need to transform it to the following:
Birth_Year Name_1 Name_2 Name_3
1964 X Y
1965 A B
1964 F G H
So I need to get the max length of lists in column 'Name', create the datafame with column for each item in the list, and populate it with the values from the list.
Would really appreciate an advice on how to handle this problem.
I have no good solution for it.
Upvotes: 1
Views: 1039
Reputation: 4108
One way of implementing this is by splitting array and returning multiple columns form pd.apply
:
df.apply(lambda row: [row["Name"][i-1] if i <= len(row["Name"]) else "" for i in range(1, max_len+1)], axis=1)
Full working example (with corner-case tested when array is empty):
data=[
[1964,["X", "Y"]],
[1965,["A", "B"]],
[1964,["F", "G", "H"]],
[1966,["P"]],
[1967,[]],
]
columns = ["Birth_Year","Name"]
df = pd.DataFrame(data=data, columns=columns)
max_len = df["Name"].apply(len).max()
df[[f"Name_{i}" for i in range(1, max_len+1)]] = df.apply(lambda row: [row["Name"][i-1] if i <= len(row["Name"]) else "" for i in range(1, max_len+1)], axis=1, result_type="expand")
[Out]:
Birth_Year Name Name_1 Name_2 Name_3
0 1964 [X, Y] X Y
1 1965 [A, B] A B
2 1964 [F, G, H] F G H
3 1966 [P] P
4 1967 []
Upvotes: 0
Reputation: 21938
You can also try this:
df = pd.concat([df['Birth_Year'],
(df['Name'].apply(lambda x: pd.Series([i for i in x]))
.pipe(lambda df2: df2.set_axis([f'Name_{i}' for i in range(1, len(df2.columns)+1)], axis=1)))
], axis=1)
df
Birth_Year Name_1 Name_2 Name_3
0 1964 X Y NaN
1 1965 A B NaN
2 1964 F G H
Upvotes: 0
Reputation: 1016
To get close to your data example:
import pandas as pd
df = pd.DataFrame(index=[1964, 1965, 1964],
data={'Name': [['X', 'Y'], ['A', 'B'], ['F', 'G', 'H']]})
df.index.name = 'Birth Year'
Solution:
df = df.Name.str.join(',').str.split(',', expand=True).fillna('')
df = df.rename(columns=lambda c: f'Name_{c+1}')
I found relying on the string methods to make this pretty easy to follow. But I think mozway's answer should be preferred. I believe it to be more efficient and faster.
Output:
Name_1 Name_2 Name_3
Birth Year
1964 X Y
1965 A B
1964 F G H
Upvotes: 0
Reputation: 261850
Assuming you have real lists in "Name", you can use:
df = df.join(pd.DataFrame(df.pop('Name').tolist(), index=df.index)
.fillna('')
.rename(columns=lambda c: f'Name_{c+1}')
)
Output:
Birth_Year Name_1 Name_2 Name_3
0 1964 X Y
1 1965 A B
2 1964 F G H
Upvotes: 2