Harold Chaw
Harold Chaw

Reputation: 69

Partition a column based on keywords in Pandas

I have an excel file that I have read into python as a dataframe as shown.

   Store       List
   Ralphs      bag1   
               apple
               pear
   Ralphs      bag2
               toilet paper 
   Albertsons  bag3
               magazines
               paper

I am attempting to expand the second column into two columns as shown.

   Bag       Item
   bag1      Apple
   bag1      Pear
   bag2      Toilet Paper       
   bag3      Magazines
   bag3      paper

The store column has Na's whenever an item is listed in the original table. Is there some pandas function that has a seperate function based on the Na's? or would it be easier to create a Series with all the bags and use that as some cutting point?

Upvotes: 1

Views: 349

Answers (2)

Mohamed Ali JAMAOUI
Mohamed Ali JAMAOUI

Reputation: 14689

You can add a column that assign to each item its bag then groupby bag id as follows:

df['bag'] = "bag"+df.List.str.contains('bag').cumsum().astype(str)    
df[['bag','List']].groupby('bag').apply(lambda x: x[1:]).reset_index(drop=True)

or, instead of groupby, filter out the elements where df.List == df.bag (which will be bag1 == bag1, bag2 == bag2 .. )

df['bag'] = "bag"+df.List.str.contains('bag').cumsum().astype(str)    
df[df.List != df.bag].drop('Store', axis=1)

Overall execution

In [83]: df
Out[83]: 
        Store          List
0      Ralphs          bag1
1         NaN         apple
2         NaN          pear
3      Ralphs          bag2
4         NaN  toilet paper
5  Albertsons          bag3
6         NaN     magazines
7         Nan         paper

In [84]: df['bag'] = "bag"+df.List.str.contains('bag').cumsum().astype(str)

In [85]: df[['bag','List']].groupby('bag').apply(lambda x: x[1:]).reset_index(drop=True)
Out[85]: 
    bag          List
0  bag1         apple
1  bag1          pear
2  bag2  toilet paper
3  bag3     magazines
4  bag3         paper

Upvotes: 0

cs95
cs95

Reputation: 402553

Setup

df
        Store          List
0      Ralphs          bag1
1      Ralphs         apple
2      Ralphs          pear
3      Ralphs          bag2
4      Ralphs  toilet paper
5  Albertsons          bag3
6  Albertsons     magazines
7  Albertsons         paper

You can use df.where with str.contains, extracting all rows which start with bag, and then filter out spurious rows.

df = df.assign(Bag=df.List.\
       where(df.List.str.contains('bag')).ffill()).drop('Store', 1)
df[df.List != df.Bag]

           List   Bag
1         apple  bag1
2          pear  bag1
4  toilet paper  bag2
6     magazines  bag3
7         paper  bag3

Upvotes: 1

Related Questions