Reputation: 135
I have a dataframe organized like this...
**Name** | **Mealtime** | **Food**
John | 8:00 am | cereal
John | 1:00 pm | salad
John | 6:00 pm | pasta
Sara | 8:00 am | yogurt
Sara | 1:00 pm | panini
Sara | 6:00 pm | steak
Jake | 8:00 am | coffee
Jake | 1:00 pm | burrito
Jake | 6:00 pm | salad
And I need it to be organized like this
____| 8:00 am | 1:00 pm | 6:00 pm
John | cereal | salad | pasta
Sara | yogurt | panini | steak
Jake | coffee | burrito | salad
How would I go about doing that? Note that I am not looking for an aesthetic change. Eventually I would like to select rows of data to feed into a for loop in such a way as the name can be determined from the specific food entry. For example, having grabbed the "1:00 pm" column, "salad" would direct me to John. But if I had grabbed the "6:00 pm" column, "salad" would direct me to Jake.
Upvotes: 2
Views: 68
Reputation: 294498
Option 0
numpy
slicing with pd.factorize
I'm going to use pd.factorize
to produce integer values for each unique 'Name'
and 'Mealtime'
. Along side it, I get the unique values as well. Each integer value ends up corresponding to the element in that integer position within the unique value array.
Now I know the number of unique names and unique mealtimes, so I know the size of my eventual pivoted array. So I create an empty one that will await to have values assigned.
Since I factorized and the factors are integer positions, I can use those to slice my empty array for assignment purposes... so I do. I finally, put it all together with a new shiny dataframe.
fn, un = pd.factorize(df.Name.values)
fm, um = pd.factorize(df.Mealtime.values)
v = np.empty((un.size, um.size), dtype=object)
v[fn, fm] = df.Food.values
pd.DataFrame(v, un, um)
8:00 am 1:00 pm 6:00 pm
John cereal salad pasta
Sara yogurt panini steak
Jake coffee burrito salad
Option 1
Using set_index
/unstack
df.set_index(['Name', 'Mealtime']).Food.unstack()
Mealtime 1:00 pm 6:00 pm 8:00 am
Name
Jake burrito salad coffee
John salad pasta cereal
Sara panini steak yogurt
Option 2
Using defaultdict
from collections import defaultdict
d = defaultdict(dict)
[d[m].setdefault(n, f) for n, m, f in df.values];
pd.DataFrame(d)
1:00 pm 6:00 pm 8:00 am
Jake burrito salad coffee
John salad pasta cereal
Sara panini steak yogurt
Option 3
Brute Force
out = pd.DataFrame(index=df.Name.unique(), columns=df.Mealtime.unique())
[out.set_value(n, m, f) for n, m, f in df.values]
out
8:00 am 1:00 pm 6:00 pm
John cereal salad pasta
Sara yogurt panini steak
Jake coffee burrito salad
Timing
Conclusion: use pivot
for larger data with pure pandas. Hard to beat its simplicity and ability to get the job done in this scenario. But if you want to pivot with reckless speed, try Option 0.
%%timeit
fn, un = pd.factorize(df.Name.values)
fm, um = pd.factorize(df.Mealtime.values)
v = np.empty((un.size, um.size), dtype=object)
v[fn, fm] = df.Food.values
pd.DataFrame(v, un, um)
%timeit df.set_index(['Name', 'Mealtime']).Food.unstack()
%timeit df.pivot('Name', 'Mealtime', 'Food')
%timeit d = defaultdict(dict); [d[m].setdefault(n, f) for n, m, f in df.values]; pd.DataFrame(d)
%timeit out = pd.DataFrame(index=df.Name.unique(), columns=df.Mealtime.unique()); [out.set_value(n, m, f) for n, m, f in df.values]; out
Small data
1000 loops, best of 3: 300 µs per loop
1000 loops, best of 3: 1.82 ms per loop
1000 loops, best of 3: 1.11 ms per loop
1000 loops, best of 3: 541 µs per loop
1000 loops, best of 3: 656 µs per loop
Larger data
from string import ascii_letters
foods = np.array([
'cereal', 'salad', 'pasta', 'yogurt',
'panini', 'steak', 'coffee', 'burrito'
], dtype=object)
times = pd.date_range(
pd.datetime.now().date(), periods=24, freq='H'
).strftime('%-I:00 %p')
names = list(ascii_letters)
idx = pd.MultiIndex.from_product([names, times], names=['Name', 'Mealtime'])
df = pd.DataFrame(dict(
Food=np.random.choice(foods, idx.size),
), idx).reset_index()
1000 loops, best of 3: 383 µs per loop
1000 loops, best of 3: 1.99 ms per loop
1000 loops, best of 3: 1.34 ms per loop
100 loops, best of 3: 2.78 ms per loop
100 loops, best of 3: 6.6 ms per loop
Upvotes: 3