Reputation: 40879
In Pandas, is it possible to have a dataframe with a column that contains a varying number of subcolumns?
For example, suppose I have this CSV file:
transactionId, userName, date, itemList, totalCost
where the itemList
contains a variable number of itemId;itemPrice
pairs, with the pairs separated by a pipe (|
). There is no upper bound on the number of itemId;itemPrice
pairs in the list.
itemId ; itemPrice | itemId ; itemPrice
Here are some examples of rows:
transactionId, userName, date, itemList, totalCost
123, Bob , 7/29/2017, ABC;10|XYZ;20, 30
234, Alice, 7/31/2017, CDE;20|QRS;15|KLM;10, 45
The first row has two itemId;itemPrice
pairs, while the second row has three pairs.
How can I create a dataframe to contain this information? Would I need a dataframe inside a dataframe?
There are other Stackoverflow posts on variable number of columns, but they assume a maximum number of columns.
Upvotes: 2
Views: 633
Reputation: 294218
You parse them with a list comprehension
d1 = df.assign(
itemList=[[x.split(';') for x in y.split('|')] for y in df.itemList.tolist()]
)
d1
transactionId userName date itemList totalCost
0 123 Bob 7/29/2017 [[ABC, 10], [XYZ, 20]] 30
1 234 Alice 7/31/2017 [[CDE, 20], [QRS, 15], [KLM, 10]] 45
Response to Comment
f = lambda x: np.array(x)[:, 1].astype(int).sum()
d1.assign(sumPrice=d1.itemList.apply(f))
transactionId userName date itemList totalCost sumPrice
0 123 Bob 7/29/2017 [[ABC, 10], [XYZ, 20]] 30 30
1 234 Alice 7/31/2017 [[CDE, 20], [QRS, 15], [KLM, 10]] 45 45
Upvotes: 2
Reputation: 210832
I'd try to normalize your data as proposed by @DYZ in comments:
In [145]: df = df.join(df.pop('itemList')
...: .str.extractall(r'(?P<item>\w+);(?P<price>\d+)')
...: .reset_index(level=1, drop=True))
...:
In [146]: df
Out[146]:
transactionId userName date totalCost item price
0 123 Bob 7/29/2017 30 ABC 10
0 123 Bob 7/29/2017 30 XYZ 20
1 234 Alice 7/31/2017 45 CDE 20
1 234 Alice 7/31/2017 45 QRS 15
1 234 Alice 7/31/2017 45 KLM 10
Normalized data allows us to apply Pandas/Numpy/SciPy/etc. ufunctions directly on columns containing scalar values.
Demo: checking totalCost
df.price = pd.to_numeric(df.price, errors='coerce')
In [151]: df.assign(tot2=df.groupby(level=0).price.transform('sum'))
Out[151]:
transactionId userName date totalCost item price tot2
0 123 Bob 7/29/2017 30 ABC 10 30
0 123 Bob 7/29/2017 30 XYZ 20 30
1 234 Alice 7/31/2017 45 CDE 20 45
1 234 Alice 7/31/2017 45 QRS 15 45
1 234 Alice 7/31/2017 45 KLM 10 45
In [152]: df.assign(tot2=df.groupby(level=0).price.transform('sum')).query("totalCost != tot2")
Out[152]:
Empty DataFrame
Columns: [transactionId, userName, date, totalCost, item, price, tot2]
Index: []
PS last empty DF shows that we don't have any entries where totalCost != sum(price)
Upvotes: 6