Reputation: 40879

Pandas - read CSV into dataframe, where a column has a varying number of subcolumns

In Pandas, is it possible to have a dataframe with a column that contains a varying number of subcolumns?

For example, suppose I have this CSV file:

transactionId, userName, date, itemList, totalCost

where the itemList contains a variable number of itemId;itemPrice pairs, with the pairs separated by a pipe (|). There is no upper bound on the number of itemId;itemPrice pairs in the list.

itemId ; itemPrice | itemId ; itemPrice

Here are some examples of rows:

transactionId, userName, date,       itemList,              totalCost
123,           Bob  ,    7/29/2017,  ABC;10|XYZ;20,         30
234,           Alice,    7/31/2017,  CDE;20|QRS;15|KLM;10,  45

The first row has two itemId;itemPrice pairs, while the second row has three pairs.

How can I create a dataframe to contain this information? Would I need a dataframe inside a dataframe?

There are other Stackoverflow posts on variable number of columns, but they assume a maximum number of columns.

Upvotes: 2

Answers (2)

piRSquared

Reputation: 294218

You parse them with a list comprehension

d1 = df.assign(
    itemList=[[x.split(';') for x in y.split('|')] for y in df.itemList.tolist()]
)

d1

   transactionId userName       date                           itemList  totalCost
0            123      Bob  7/29/2017             [[ABC, 10], [XYZ, 20]]         30
1            234    Alice  7/31/2017  [[CDE, 20], [QRS, 15], [KLM, 10]]         45

Response to Comment

f = lambda x: np.array(x)[:, 1].astype(int).sum()

d1.assign(sumPrice=d1.itemList.apply(f))

   transactionId userName       date                           itemList  totalCost  sumPrice
0            123      Bob  7/29/2017             [[ABC, 10], [XYZ, 20]]         30        30
1            234    Alice  7/31/2017  [[CDE, 20], [QRS, 15], [KLM, 10]]         45        45

Upvotes: 2

MaxU - stand with Ukraine

Reputation: 210832

I'd try to normalize your data as proposed by @DYZ in comments:

In [145]: df = df.join(df.pop('itemList')
     ...:                .str.extractall(r'(?P<item>\w+);(?P<price>\d+)')
     ...:                .reset_index(level=1, drop=True))
     ...:

In [146]: df
Out[146]:
   transactionId userName       date  totalCost item price
0            123      Bob  7/29/2017         30  ABC    10
0            123      Bob  7/29/2017         30  XYZ    20
1            234    Alice  7/31/2017         45  CDE    20
1            234    Alice  7/31/2017         45  QRS    15
1            234    Alice  7/31/2017         45  KLM    10

Normalized data allows us to apply Pandas/Numpy/SciPy/etc. ufunctions directly on columns containing scalar values.

Demo: checking totalCost

df.price = pd.to_numeric(df.price, errors='coerce')

In [151]: df.assign(tot2=df.groupby(level=0).price.transform('sum'))
Out[151]:
   transactionId userName       date  totalCost item  price  tot2
0            123      Bob  7/29/2017         30  ABC     10    30
0            123      Bob  7/29/2017         30  XYZ     20    30
1            234    Alice  7/31/2017         45  CDE     20    45
1            234    Alice  7/31/2017         45  QRS     15    45
1            234    Alice  7/31/2017         45  KLM     10    45

In [152]: df.assign(tot2=df.groupby(level=0).price.transform('sum')).query("totalCost != tot2")
Out[152]:
Empty DataFrame
Columns: [transactionId, userName, date, totalCost, item, price, tot2]
Index: []

PS last empty DF shows that we don't have any entries where totalCost != sum(price)

Upvotes: 6

Pandas - read CSV into dataframe, where a column has a varying number of subcolumns

Answers (2)

Related Questions