sdaau
sdaau

Reputation: 38711

Understanding .groupby().first() when compacting Pandas DataFrame?

Since I'm not sure of the exact terminology - let's say, I have this file:

dataA.csv:

event,car,bike,bus
63175,,18,
65641,45,9,
65805,,,54
68388,,65,
68388,,,39
73041,7,,18
79336,,44,
79423,,,5

Reading this with dataA = pd.read_csv("dataA.csv", dtype='Int64'), we get a pandas DataFrame:

dataA:
   event   car  bike   bus
0  63175  <NA>    18  <NA>
1  65641    45     9  <NA>
2  65805  <NA>  <NA>    54
3  68388  <NA>    65  <NA>
4  68388  <NA>  <NA>    39
5  73041     7  <NA>    18
6  79336  <NA>    44  <NA>
7  79423  <NA>  <NA>     5

There are two rows, where "event" column has the same value (what I call "duplicates"):

3  68388  <NA>    65  <NA>
4  68388  <NA>  <NA>    39

... and I'd like them to be "compacted" (is this the right word?) to a single row, such that actual values (where available) are there, instead of NaN (that is, NA):

3  68388  <NA>    65    39

From How to compact a merge between two Pandas Dataframes with NaN and duplicate join keys? I got the answer, that I should use .groupby(...).first() - and indeed, it works; this script:

#!/usr/bin/env python3

import pandas as pd
print(pd.__version__) # 1.0.2 for me

dataA = pd.read_csv("dataA.csv", dtype='Int64')
print("dataA:")
print(dataA)

# make sure Pandas prints entirety of DataFrame
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', None)

dataCompact = dataA.groupby('event').first() ##***
print("\ndataCompact:")
print(dataCompact)

eventually prints:

dataCompact:
        car  bike   bus
event
63175  <NA>    18  <NA>
65641    45     9  <NA>
65805  <NA>  <NA>    54
68388  <NA>    65    39
73041     7  <NA>    18
79336  <NA>    44  <NA>
79423  <NA>  <NA>     5

... which is what I wanted, so it works.

However, upon a closer look, I realized I don't really understand how this works - that is, I cannot really tell what .first() specifically refers to in this case; and https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.first.html does not help me much, because it states "Method to subset initial periods of time series data based on a date offset." (and most other introductory pages online follow the dates example), - but here I do not use dates.

So, I did some experiments - basically changing the line marked with ##*** in the above code, and looking at printouts.

First, if I use this line instead:

dataCompact = dataA.groupby('event').apply(lambda x: "{} ({}): {}".format(type(x), len(x), x.values.tolist())) ##***

... I get this printout:

dataCompact:
event
63175                             <class 'pandas.core.frame.DataFrame'> (1): [[63175, <NA>, 18, <NA>]]
65641                                <class 'pandas.core.frame.DataFrame'> (1): [[65641, 45, 9, <NA>]]
65805                             <class 'pandas.core.frame.DataFrame'> (1): [[65805, <NA>, <NA>, 54]]
68388    <class 'pandas.core.frame.DataFrame'> (2): [[68388, <NA>, 65, <NA>], [68388, <NA>, <NA>, 39]]
73041                                <class 'pandas.core.frame.DataFrame'> (1): [[73041, 7, <NA>, 18]]
79336                             <class 'pandas.core.frame.DataFrame'> (1): [[79336, <NA>, 44, <NA>]]
79423                              <class 'pandas.core.frame.DataFrame'> (1): [[79423, <NA>, <NA>, 5]]
dtype: object

From this, I gather that - essentially, groupby('event') provides a DataFrame for each unique value of 'event' column:

So, .first() must take a DataFrame with N>=1 rows as input, and return a single row.

However, this is where my confusion starts - I'd read .first() to refer to returning the first of the N>=1 input rows; but in that case, the values would not have been "compacted" (that is, slots that have actual numbers replacing slots with <NA> (undefined values)); - instead, all other rows but the first would be dropped! And that is not what happens here ...

So, I tried to simulate what .first() does, by writing my own lambda handler for .apply():

def proc_df_first(x):
  # here we get a DataFrame with single row, if "event" (groupby arg) is a unique value;
  # or a DataFrame with as many rows, as there are repeated rows with "event" of same value ("duplicate")
  if len(x) == 1:
    return x
  elif len(x) > 1:
    # create empty return DataFrame (eventually it will only have a single row)
    retdf = pd.DataFrame(columns = x.columns)
    #return retdf # is empty, so is skipped in final result of .groupby.apply
    # must populate rowdata first, then assign via .loc (SO:17091769)
    for icol in x.columns:
      coldata = x[icol] # is Series
      thisval = pd.NA # initialize the "compact" single value we want to set for this column (eventually derived from all the row values in this column)
      for idx, val in coldata.iteritems():
        #print("***", idx, val, val is pd.NA) # `val is None` is always False; `val==pd.NA` always prints `<NA>`; `val is pd.NA` is correct
        if thisval is pd.NA:
          if val is not pd.NA:
            # found the first non-NA value; save it, and stop looking further
            thisval = val
            break
      # store saved non-NA value into return DataFrame
      retdf.at[ x.index[0], icol ] = thisval # SO: 13842088
    return retdf

dataCompact = dataA.groupby('event').apply(lambda x: proc_df_first(x)) ##***

... which eventually prints:

dataCompact:
         event   car  bike   bus
event
63175 0  63175  <NA>    18  <NA>
65641 1  65641    45     9  <NA>
65805 2  65805  <NA>  <NA>    54
68388 3  68388   NaN    65    39
73041 5  73041     7  <NA>    18
79336 6  79336  <NA>    44  <NA>
79423 7  79423  <NA>  <NA>     5

... which is essentially the same result that .groupby('event').first() does (aside from the duplicated "event" column (would that be a hierarchical label?), and the index column).

So, here are my questions:

Upvotes: 1

Views: 128

Answers (1)

jezrael
jezrael

Reputation: 863751

In pandas are 2 different first functions, which are different:

GroupBy.first and DataFrame.first.

Here is used first, first per groups.

As per the above, I'd say that .first() returns the first non-NA per-column value in a set of rows, resulting in a single-row representation of the set of rows - is this correct?

Yes, you are right, but it is something like long term bug, 6732 and 8427. Still in pandas 1.0.1.

Why, for crying out loud, did I get NaN in the "car" column of the output for "event"==68388

Unfortunately I think because bug.

Upvotes: 2

Related Questions