Will
Will

Reputation: 135

Reorganizing extraneous data in pandas

I have a dataframe organized like this...

**Name** | **Mealtime** | **Food**  
John | 8:00 am | cereal  
John | 1:00 pm | salad  
John | 6:00 pm | pasta  
Sara | 8:00 am | yogurt  
Sara | 1:00 pm | panini  
Sara | 6:00 pm | steak  
Jake | 8:00 am | coffee  
Jake | 1:00 pm | burrito  
Jake | 6:00 pm | salad

And I need it to be organized like this

____| 8:00 am | 1:00 pm | 6:00 pm  
John | cereal | salad | pasta  
Sara | yogurt | panini | steak  
Jake | coffee | burrito | salad

How would I go about doing that? Note that I am not looking for an aesthetic change. Eventually I would like to select rows of data to feed into a for loop in such a way as the name can be determined from the specific food entry. For example, having grabbed the "1:00 pm" column, "salad" would direct me to John. But if I had grabbed the "6:00 pm" column, "salad" would direct me to Jake.

Upvotes: 2

Views: 68

Answers (2)

piRSquared
piRSquared

Reputation: 294498

Option 0
numpy slicing with pd.factorize

I'm going to use pd.factorize to produce integer values for each unique 'Name' and 'Mealtime'. Along side it, I get the unique values as well. Each integer value ends up corresponding to the element in that integer position within the unique value array.

Now I know the number of unique names and unique mealtimes, so I know the size of my eventual pivoted array. So I create an empty one that will await to have values assigned.

Since I factorized and the factors are integer positions, I can use those to slice my empty array for assignment purposes... so I do. I finally, put it all together with a new shiny dataframe.

fn, un = pd.factorize(df.Name.values)
fm, um = pd.factorize(df.Mealtime.values)
v = np.empty((un.size, um.size), dtype=object)
v[fn, fm] = df.Food.values
pd.DataFrame(v, un, um)

     8:00 am  1:00 pm 6:00 pm
John  cereal    salad   pasta
Sara  yogurt   panini   steak
Jake  coffee  burrito   salad

Option 1
Using set_index/unstack

df.set_index(['Name', 'Mealtime']).Food.unstack()

Mealtime  1:00 pm 6:00 pm 8:00 am
Name                             
Jake      burrito   salad  coffee
John        salad   pasta  cereal
Sara       panini   steak  yogurt

Option 2
Using defaultdict

from collections import defaultdict

d = defaultdict(dict)
[d[m].setdefault(n, f) for n, m, f in df.values];
pd.DataFrame(d)

      1:00 pm 6:00 pm 8:00 am
Jake  burrito   salad  coffee
John    salad   pasta  cereal
Sara   panini   steak  yogurt

Option 3
Brute Force

out = pd.DataFrame(index=df.Name.unique(), columns=df.Mealtime.unique())
[out.set_value(n, m, f) for n, m, f in df.values]
out

     8:00 am  1:00 pm 6:00 pm
John  cereal    salad   pasta
Sara  yogurt   panini   steak
Jake  coffee  burrito   salad

Timing
Conclusion: use pivot for larger data with pure pandas. Hard to beat its simplicity and ability to get the job done in this scenario. But if you want to pivot with reckless speed, try Option 0.

%%timeit
fn, un = pd.factorize(df.Name.values)
fm, um = pd.factorize(df.Mealtime.values)
v = np.empty((un.size, um.size), dtype=object)
v[fn, fm] = df.Food.values
pd.DataFrame(v, un, um)

%timeit df.set_index(['Name', 'Mealtime']).Food.unstack()
%timeit df.pivot('Name', 'Mealtime', 'Food')
%timeit d = defaultdict(dict); [d[m].setdefault(n, f) for n, m, f in df.values]; pd.DataFrame(d)
%timeit out = pd.DataFrame(index=df.Name.unique(), columns=df.Mealtime.unique()); [out.set_value(n, m, f) for n, m, f in df.values]; out

Small data

1000 loops, best of 3: 300 µs per loop
1000 loops, best of 3: 1.82 ms per loop
1000 loops, best of 3: 1.11 ms per loop
1000 loops, best of 3: 541 µs per loop
1000 loops, best of 3: 656 µs per loop

Larger data

from string import ascii_letters

foods = np.array([
    'cereal', 'salad', 'pasta', 'yogurt',
    'panini', 'steak', 'coffee', 'burrito'
], dtype=object)
times = pd.date_range(
    pd.datetime.now().date(), periods=24, freq='H'
).strftime('%-I:00 %p')
names = list(ascii_letters)

idx = pd.MultiIndex.from_product([names, times], names=['Name', 'Mealtime'])
df = pd.DataFrame(dict(
    Food=np.random.choice(foods, idx.size),
), idx).reset_index()

1000 loops, best of 3: 383 µs per loop
1000 loops, best of 3: 1.99 ms per loop
1000 loops, best of 3: 1.34 ms per loop
100 loops, best of 3: 2.78 ms per loop
100 loops, best of 3: 6.6 ms per loop

Upvotes: 3

cs95
cs95

Reputation: 402852

You need df.pivot:

In [606]: df.pivot(index='Name', columns='Mealtime', values='Food')
Out[606]: 
Mealtime  1:00 pm 6:00 pm 8:00 am
Name                             
Jake      burrito   salad  coffee
John        salad   pasta  cereal
Sara       panini   steak  yogurt

Upvotes: 4

Related Questions