Understanding .groupby().first() when compacting Pandas DataFrame?

Question

Since I'm not sure of the exact terminology - let's say, I have this file:

dataA.csv:

event,car,bike,bus
63175,,18,
65641,45,9,
65805,,,54
68388,,65,
68388,,,39
73041,7,,18
79336,,44,
79423,,,5

Reading this with dataA = pd.read_csv("dataA.csv", dtype='Int64'), we get a pandas DataFrame:

dataA:
   event   car  bike   bus
0  63175      18  
1  65641    45     9  
2  65805        54
3  68388      65  
4  68388        39
5  73041     7      18
6  79336      44  
7  79423         5

There are two rows, where "event" column has the same value (what I call "duplicates"):

3  68388      65  
4  68388        39

... and I'd like them to be "compacted" (is this the right word?) to a single row, such that actual values (where available) are there, instead of NaN (that is, NA):

3  68388      65    39

From How to compact a merge between two Pandas Dataframes with NaN and duplicate join keys? I got the answer, that I should use .groupby(...).first() - and indeed, it works; this script:

#!/usr/bin/env python3

import pandas as pd
print(pd.__version__) # 1.0.2 for me

dataA = pd.read_csv("dataA.csv", dtype='Int64')
print("dataA:")
print(dataA)

# make sure Pandas prints entirety of DataFrame
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', None)

dataCompact = dataA.groupby('event').first() ##***
print("
dataCompact:")
print(dataCompact)

eventually prints:

dataCompact:
        car  bike   bus
event
63175      18  
65641    45     9  
65805        54
68388      65    39
73041     7      18
79336      44  
79423         5

... which is what I wanted, so it works.

However, upon a closer look, I realized I don't really understand how this works - that is, I cannot really tell what .first() specifically refers to in this case; and https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.first.html does not help me much, because it states "Method to subset initial periods of time series data based on a date offset." (and most other introductory pages online follow the dates example), - but here I do not use dates.

So, I did some experiments - basically changing the line marked with ##*** in the above code, and looking at printouts.

First, if I use this line instead:

dataCompact = dataA.groupby('event').apply(lambda x: "{} ({}): {}".format(type(x), len(x), x.values.tolist())) ##***

... I get this printout:

dataCompact:
event
63175                              (1): [[63175, , 18, ]]
65641                                 (1): [[65641, 45, 9, ]]
65805                              (1): [[65805, , , 54]]
68388     (2): [[68388, , 65, ], [68388, , , 39]]
73041                                 (1): [[73041, 7, , 18]]
79336                              (1): [[79336, , 44, ]]
79423                               (1): [[79423, , , 5]]
dtype: object

From this, I gather that - essentially, groupby('event') provides a DataFrame for each unique value of 'event' column:

if the value is unique in the original dataset, that DataFrame has a single row; however
if it is a "duplicate" value, we get a DataFrame with two rows (or as many rows, as there are "duplicates" of that particular value).

So, .first() must take a DataFrame with N>=1 rows as input, and return a single row.

However, this is where my confusion starts - I'd read .first() to refer to returning the first of the N>=1 input rows; but in that case, the values would not have been "compacted" (that is, slots that have actual numbers replacing slots with (undefined values)); - instead, all other rows but the first would be dropped! And that is not what happens here ...

So, I tried to simulate what .first() does, by writing my own lambda handler for .apply():

def proc_df_first(x):
  # here we get a DataFrame with single row, if "event" (groupby arg) is a unique value;
  # or a DataFrame with as many rows, as there are repeated rows with "event" of same value ("duplicate")
  if len(x) == 1:
    return x
  elif len(x) > 1:
    # create empty return DataFrame (eventually it will only have a single row)
    retdf = pd.DataFrame(columns = x.columns)
    #return retdf # is empty, so is skipped in final result of .groupby.apply
    # must populate rowdata first, then assign via .loc (SO:17091769)
    for icol in x.columns:
      coldata = x[icol] # is Series
      thisval = pd.NA # initialize the "compact" single value we want to set for this column (eventually derived from all the row values in this column)
      for idx, val in coldata.iteritems():
        #print("***", idx, val, val is pd.NA) # `val is None` is always False; `val==pd.NA` always prints ``; `val is pd.NA` is correct
        if thisval is pd.NA:
          if val is not pd.NA:
            # found the first non-NA value; save it, and stop looking further
            thisval = val
            break
      # store saved non-NA value into return DataFrame
      retdf.at[ x.index[0], icol ] = thisval # SO: 13842088
    return retdf

dataCompact = dataA.groupby('event').apply(lambda x: proc_df_first(x)) ##***

... which eventually prints:

dataCompact:
         event   car  bike   bus
event
63175 0  63175      18  
65641 1  65641    45     9  
65805 2  65805        54
68388 3  68388   NaN    65    39
73041 5  73041     7      18
79336 6  79336      44  
79423 7  79423         5

... which is essentially the same result that .groupby('event').first() does (aside from the duplicated "event" column (would that be a hierarchical label?), and the index column).

So, here are my questions:

As per the above, I'd say that .first() returns the first non-NA per-column value in a set of rows, resulting in a single-row representation of the set of rows - is this correct?
Why, for crying out loud, did I get NaN in the "car" column of the output for "event"==68388 - when I tried to handle everything internally as "Int64" (with the corresponding pd.NA value)? I'm aware I can do dataA.groupby('event').apply(lambda x: proc_df_first(x)).astype('Int64') and get pd.NA everywhere - but given that I iterate the whole table element-by-element already, I wouldn't want yet another loop over the table - just to get rid of a single float NaN. What could I do in my handler, to ensure that the return from proc_df_first() in .apply() always results with pd.NA values, if that particular value is undefined?

jezrael · Accepted Answer

In pandas are 2 different first functions, which are different:

GroupBy.first and DataFrame.first.

Here is used first, first per groups.

As per the above, I'd say that .first() returns the first non-NA per-column value in a set of rows, resulting in a single-row representation of the set of rows - is this correct?

Yes, you are right, but it is something like long term bug, 6732 and 8427. Still in pandas 1.0.1.

Why, for crying out loud, did I get NaN in the "car" column of the output for "event"==68388

Unfortunately I think because bug.

Understanding .groupby().first() when compacting Pandas DataFrame?

Answers (1)

Related Questions