Jonas Palačionis
Jonas Palačionis

Reputation: 4842

Flatting out a multiindex dataframe

I have a df:

df = pd.DataFrame.from_dict({('group', ''): {0: 'A',
  1: 'A',
  2: 'A',
  3: 'A',
  4: 'A',
  5: 'A',
  6: 'A',
  7: 'A',
  8: 'A',
  9: 'B',
  10: 'B',
  11: 'B',
  12: 'B',
  13: 'B',
  14: 'B',
  15: 'B',
  16: 'B',
  17: 'B',
  18: 'all',
  19: 'all'},
 ('category', ''): {0: 'Amazon',
  1: 'Apple',
  2: 'Facebook',
  3: 'Google',
  4: 'Netflix',
  5: 'Tesla',
  6: 'Total',
  7: 'Uber',
  8: 'total',
  9: 'Amazon',
  10: 'Apple',
  11: 'Facebook',
  12: 'Google',
  13: 'Netflix',
  14: 'Tesla',
  15: 'Total',
  16: 'Uber',
  17: 'total',
  18: 'Total',
  19: 'total'},
 (pd.Timestamp('2020-06-29'), 'last_sales'): {0: 195.0,
  1: 61.0,
  2: 106.0,
  3: 61.0,
  4: 37.0,
  5: 13.0,
  6: 954.0,
  7: 4.0,
  8: 477.0,
  9: 50.0,
  10: 50.0,
  11: 75.0,
  12: 43.0,
  13: 17.0,
  14: 14.0,
  15: 504.0,
  16: 3.0,
  17: 252.0,
  18: 2916.0,
  19: 2916.0},
 (pd.Timestamp('2020-06-29'), 'sales'): {0: 1268.85,
  1: 18274.385000000002,
  2: 19722.65,
  3: 55547.255,
  4: 15323.800000000001,
  5: 1688.6749999999997,
  6: 227463.23,
  7: 1906.0,
  8: 113731.615,
  9: 3219.6499999999996,
  10: 15852.060000000001,
  11: 17743.7,
  12: 37795.15,
  13: 5918.5,
  14: 1708.75,
  15: 166349.64,
  16: 937.01,
  17: 83174.82,
  18: 787625.7400000001,
  19: 787625.7400000001},
 (pd.Timestamp('2020-06-29'), 'difference'): {0: 0.0,
  1: 0.0,
  2: 0.0,
  3: 0.0,
  4: 0.0,
  5: 0.0,
  6: 0.0,
  7: 0.0,
  8: 0.0,
  9: 0.0,
  10: 0.0,
  11: 0.0,
  12: 0.0,
  13: 0.0,
  14: 0.0,
  15: 0.0,
  16: 0.0,
  17: 0.0,
  18: 0.0,
  19: 0.0},
 (pd.Timestamp('2020-07-06'), 'last_sales'): {0: 26.0,
  1: 39.0,
  2: 79.0,
  3: 49.0,
  4: 10.0,
  5: 10.0,
  6: 436.0,
  7: 5.0,
  8: 218.0,
  9: 89.0,
  10: 34.0,
  11: 133.0,
  12: 66.0,
  13: 21.0,
  14: 20.0,
  15: 732.0,
  16: 3.0,
  17: 366.0,
  18: 2336.0,
  19: 2336.0},
 (pd.Timestamp('2020-07-06'), 'sales'): {0: 3978.15,
  1: 12138.96,
  2: 19084.175,
  3: 40033.46000000001,
  4: 4280.15,
  5: 1495.1,
  6: 165548.29,
  7: 1764.15,
  8: 82774.145,
  9: 8314.92,
  10: 12776.649999999996,
  11: 28048.075,
  12: 55104.21000000002,
  13: 6962.844999999999,
  14: 3053.2000000000003,
  15: 231049.11000000002,
  16: 1264.655,
  17: 115524.55500000001,
  18: 793194.8000000002,
  19: 793194.8000000002},
 (pd.Timestamp('2020-07-06'), 'difference'): {0: 0.0,
  1: 0.0,
  2: 0.0,
  3: 0.0,
  4: 0.0,
  5: 0.0,
  6: 0.0,
  7: 0.0,
  8: 0.0,
  9: 0.0,
  10: 0.0,
  11: 0.0,
  12: 0.0,
  13: 0.0,
  14: 0.0,
  15: 0.0,
  16: 0.0,
  17: 0.0,
  18: 0.0,
  19: 0.0},
 (pd.Timestamp('2021-06-28'), 'last_sales'): {0: 96.0,
  1: 56.0,
  2: 106.0,
  3: 44.0,
  4: 34.0,
  5: 13.0,
  6: 716.0,
  7: 9.0,
  8: 358.0,
  9: 101.0,
  10: 22.0,
  11: 120.0,
  12: 40.0,
  13: 13.0,
  14: 8.0,
  15: 610.0,
  16: 1.0,
  17: 305.0,
  18: 2652.0,
  19: 2652.0},
 (pd.Timestamp('2021-06-28'), 'sales'): {0: 5194.95,
  1: 19102.219999999994,
  2: 22796.420000000002,
  3: 30853.115,
  4: 11461.25,
  5: 992.6,
  6: 188143.41,
  7: 3671.15,
  8: 94071.705,
  9: 6022.299999999998,
  10: 7373.6,
  11: 33514.0,
  12: 35943.45,
  13: 4749.000000000001,
  14: 902.01,
  15: 177707.32,
  16: 349.3,
  17: 88853.66,
  18: 731701.46,
  19: 731701.46},
 (pd.Timestamp('2021-06-28'), 'difference'): {0: 0.0,
  1: 0.0,
  2: 0.0,
  3: 0.0,
  4: 0.0,
  5: 0.0,
  6: 0.0,
  7: 0.0,
  8: 0.0,
  9: 0.0,
  10: 0.0,
  11: 0.0,
  12: 0.0,
  13: 0.0,
  14: 0.0,
  15: 0.0,
  16: 0.0,
  17: 0.0,
  18: 0.0,
  19: 0.0},
 (pd.Timestamp('2021-07-07'), 'last_sales'): {0: 45.0,
  1: 47.0,
  2: 87.0,
  3: 45.0,
  4: 13.0,
  5: 8.0,
  6: 494.0,
  7: 2.0,
  8: 247.0,
  9: 81.0,
  10: 36.0,
  11: 143.0,
  12: 56.0,
  13: 9.0,
  14: 9.0,
  15: 670.0,
  16: 1.0,
  17: 335.0,
  18: 2328.0,
  19: 2328.0},
 (pd.Timestamp('2021-07-07'), 'sales'): {0: 7556.414999999998,
  1: 14985.05,
  2: 16790.899999999998,
  3: 36202.729999999996,
  4: 4024.97,
  5: 1034.45,
  6: 163960.32999999996,
  7: 1385.65,
  8: 81980.16499999998,
  9: 5600.544999999999,
  10: 11209.92,
  11: 32832.61,
  12: 42137.44500000001,
  13: 3885.1499999999996,
  14: 1191.5,
  15: 194912.34000000003,
  16: 599.0,
  17: 97456.17000000001,
  18: 717745.3400000001,
  19: 717745.3400000001},
 (pd.Timestamp('2021-07-07'), 'difference'): {0: 0.0,
  1: 0.0,
  2: 0.0,
  3: 0.0,
  4: 0.0,
  5: 0.0,
  6: 0.0,
  7: 0.0,
  8: 0.0,
  9: 0.0,
  10: 0.0,
  11: 0.0,
  12: 0.0,
  13: 0.0,
  14: 0.0,
  15: 0.0,
  16: 0.0,
  17: 0.0,
  18: 0.0,
  19: 0.0}}).set_index(['group','category'])

I am trying to sort of flatten it so it would no longer be a multiindex df. As there are several dates I try to select one:

df.loc[:,'2020-06-29 00:00:00']

But this gives me an error :

KeyError: '2020-06-29 00:00:00'

I am trying to make it that the first week ( and my final output ) of 2020-06-29 would look like this :

group   category            last_sales  sales       difference                                              
A       Amazon              195.00      1,268.85    0.00    
A       Apple               61.00       18,274.39   0.00    
A       Facebook            106.00      19,722.65   0.00    
A       Google              61.00       55,547.25   0.00    
A       Netflix             37.00       15,323.80   0.00    
A       Tesla               13.00       1,688.67    0.00    
A       Total               954.00      227,463.23  0.00
A       Uber                4.00        1,906.00    0.00    
A       total               477.00      113,731.62  0.00    
B       Amazon              0.00        3,219.65    0.00    
B       Apple               50.00       15,852.06   0.00    
B        Facebook           75.00       17,743.70   0.00
B        Google             43.00       37,795.15   0.00    
B        Netflix            17.00       5,918.50    0.00    
B        Tesla              14.00       1,708.75    0.00    
B        Total              504.00      166,349.64  0.00    
B        Uber               3.00        937.01      0.00    
B        total              252.00      83,174.82   0.00    
all      Total              2,916.00    787,625.74  0.00    

Upvotes: 2

Views: 76

Answers (2)

Anurag Dabas
Anurag Dabas

Reputation: 24314

try via pd.to_dateime():

out=df.loc[:,pd.to_datetime('2020-06-29 00:00:00')]
#out=df.loc[:,pd.to_datetime('2020-06-29 00:00:00')].reset_index()

OR

try via pd.Timestamp()

out=df.loc[:,pd.Timestamp('2020-06-29 00:00:00')]
#out=df.loc[:,pd.Timestamp('2020-06-29 00:00:00')].reset_index()

The 0th level of your column is Timestamp and you can verify that by:

df.columns.to_numpy()

#output

array([(Timestamp('2020-06-29 00:00:00'), 'last_sales'),
       (Timestamp('2020-06-29 00:00:00'), 'sales'),
       (Timestamp('2020-06-29 00:00:00'), 'difference'),
       (Timestamp('2020-07-06 00:00:00'), 'last_sales'),
       (Timestamp('2020-07-06 00:00:00'), 'sales'),
       (Timestamp('2020-07-06 00:00:00'), 'difference'),
       (Timestamp('2021-06-28 00:00:00'), 'last_sales'),
       (Timestamp('2021-06-28 00:00:00'), 'sales'),
       (Timestamp('2021-06-28 00:00:00'), 'difference'),
       (Timestamp('2021-07-07 00:00:00'), 'last_sales'),
       (Timestamp('2021-07-07 00:00:00'), 'sales'),
       (Timestamp('2021-07-07 00:00:00'), 'difference')], dtype=object)

output of out:

                last_sales       sales  difference
group category                                    
A     Amazon         195.0    1268.850         0.0
      Apple           61.0   18274.385         0.0
      Facebook       106.0   19722.650         0.0
      Google          61.0   55547.255         0.0
      Netflix         37.0   15323.800         0.0
      Tesla           13.0    1688.675         0.0
      Total          954.0  227463.230         0.0
      Uber             4.0    1906.000         0.0
      total          477.0  113731.615         0.0
B     Amazon          50.0    3219.650         0.0
      Apple           50.0   15852.060         0.0
      Facebook        75.0   17743.700         0.0
      Google          43.0   37795.150         0.0
      Netflix         17.0    5918.500         0.0
      Tesla           14.0    1708.750         0.0
      Total          504.0  166349.640         0.0
      Uber             3.0     937.010         0.0
      total          252.0   83174.820         0.0
all   Total         2916.0  787625.740         0.0
      total         2916.0  787625.740         0.0

NOTE:

There is no need of providing a tuple in .loc[] because you are selecting the 0th level

Upvotes: 4

Cimbali
Cimbali

Reputation: 11395

I’m also getting a KeyError, but if you use a Timestamp object to index the first-level columns, it seems to work:

>>> df[pd.Timestamp('2020-06-29 00:00:00')]
                last_sales       sales  difference
group category                                    
A     Amazon         195.0    1268.850         0.0
      Apple           61.0   18274.385         0.0
      Facebook       106.0   19722.650         0.0
      Google          61.0   55547.255         0.0
      Netflix         37.0   15323.800         0.0
      Tesla           13.0    1688.675         0.0
      Total          954.0  227463.230         0.0
      Uber             4.0    1906.000         0.0
      total          477.0  113731.615         0.0
B     Amazon          50.0    3219.650         0.0
      Apple           50.0   15852.060         0.0
      Facebook        75.0   17743.700         0.0
      Google          43.0   37795.150         0.0
      Netflix         17.0    5918.500         0.0
      Tesla           14.0    1708.750         0.0
      Total          504.0  166349.640         0.0
      Uber             3.0     937.010         0.0
      total          252.0   83174.820         0.0
all   Total         2916.0  787625.740         0.0
      total         2916.0  787625.740         0.0

Otherwise you could use .xs which will then also allow you more flexibility, e.g. selecting in the second level of columns and so on:

>>> df.xs(pd.Timestamp('2020-06-29 00:00:00'), axis='columns', level=0)
                last_sales       sales  difference
group category                                    
A     Amazon         195.0    1268.850         0.0
      Apple           61.0   18274.385         0.0
      Facebook       106.0   19722.650         0.0
      Google          61.0   55547.255         0.0
      Netflix         37.0   15323.800         0.0
      Tesla           13.0    1688.675         0.0
      Total          954.0  227463.230         0.0
      Uber             4.0    1906.000         0.0
      total          477.0  113731.615         0.0
B     Amazon          50.0    3219.650         0.0
      Apple           50.0   15852.060         0.0
      Facebook        75.0   17743.700         0.0
      Google          43.0   37795.150         0.0
      Netflix         17.0    5918.500         0.0
      Tesla           14.0    1708.750         0.0
      Total          504.0  166349.640         0.0
      Uber             3.0     937.010         0.0
      total          252.0   83174.820         0.0
all   Total         2916.0  787625.740         0.0
      total         2916.0  787625.740         0.0

You can then add .drop(index=[('all', 'total')]) to remove the second total line, and possible .reset_index()


The way to do it with .loc[] is to provide a tuple, with the first item being a Timestamp object and the second an empty slice. However this will keep the 2 levels of indexing, so it is not what you want:

>>> df.loc[:, (pd.Timestamp('2020-06-29 00:00:00'), slice(None))].head(2)
               2020-06-29 00:00:00                      
                        last_sales      sales difference
group category                                          
A     Amazon                 195.0   1268.850        0.0
      Apple                   61.0  18274.385        0.0

Upvotes: 4

Related Questions