Reputation: 38711
Since I'm not sure of the exact terminology - let's say, I have this file:
dataA.csv:
event,car,bike,bus
63175,,18,
65641,45,9,
65805,,,54
68388,,65,
68388,,,39
73041,7,,18
79336,,44,
79423,,,5
Reading this with dataA = pd.read_csv("dataA.csv", dtype='Int64')
, we get a pandas DataFrame:
dataA:
event car bike bus
0 63175 <NA> 18 <NA>
1 65641 45 9 <NA>
2 65805 <NA> <NA> 54
3 68388 <NA> 65 <NA>
4 68388 <NA> <NA> 39
5 73041 7 <NA> 18
6 79336 <NA> 44 <NA>
7 79423 <NA> <NA> 5
There are two rows, where "event" column has the same value (what I call "duplicates"):
3 68388 <NA> 65 <NA>
4 68388 <NA> <NA> 39
... and I'd like them to be "compacted" (is this the right word?) to a single row, such that actual values (where available) are there, instead of NaN (that is, NA):
3 68388 <NA> 65 39
From How to compact a merge between two Pandas Dataframes with NaN and duplicate join keys? I got the answer, that I should use .groupby(...).first()
- and indeed, it works; this script:
#!/usr/bin/env python3
import pandas as pd
print(pd.__version__) # 1.0.2 for me
dataA = pd.read_csv("dataA.csv", dtype='Int64')
print("dataA:")
print(dataA)
# make sure Pandas prints entirety of DataFrame
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', None)
dataCompact = dataA.groupby('event').first() ##***
print("\ndataCompact:")
print(dataCompact)
eventually prints:
dataCompact:
car bike bus
event
63175 <NA> 18 <NA>
65641 45 9 <NA>
65805 <NA> <NA> 54
68388 <NA> 65 39
73041 7 <NA> 18
79336 <NA> 44 <NA>
79423 <NA> <NA> 5
... which is what I wanted, so it works.
However, upon a closer look, I realized I don't really understand how this works - that is, I cannot really tell what .first()
specifically refers to in this case; and https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.first.html does not help me much, because it states "Method to subset initial periods of time series data based on a date offset." (and most other introductory pages online follow the dates example), - but here I do not use dates.
So, I did some experiments - basically changing the line marked with ##***
in the above code, and looking at printouts.
First, if I use this line instead:
dataCompact = dataA.groupby('event').apply(lambda x: "{} ({}): {}".format(type(x), len(x), x.values.tolist())) ##***
... I get this printout:
dataCompact:
event
63175 <class 'pandas.core.frame.DataFrame'> (1): [[63175, <NA>, 18, <NA>]]
65641 <class 'pandas.core.frame.DataFrame'> (1): [[65641, 45, 9, <NA>]]
65805 <class 'pandas.core.frame.DataFrame'> (1): [[65805, <NA>, <NA>, 54]]
68388 <class 'pandas.core.frame.DataFrame'> (2): [[68388, <NA>, 65, <NA>], [68388, <NA>, <NA>, 39]]
73041 <class 'pandas.core.frame.DataFrame'> (1): [[73041, 7, <NA>, 18]]
79336 <class 'pandas.core.frame.DataFrame'> (1): [[79336, <NA>, 44, <NA>]]
79423 <class 'pandas.core.frame.DataFrame'> (1): [[79423, <NA>, <NA>, 5]]
dtype: object
From this, I gather that - essentially, groupby('event')
provides a DataFrame for each unique value of 'event' column:
So, .first()
must take a DataFrame with N>=1 rows as input, and return a single row.
However, this is where my confusion starts - I'd read .first()
to refer to returning the first of the N>=1 input rows; but in that case, the values would not have been "compacted" (that is, slots that have actual numbers replacing slots with <NA>
(undefined values)); - instead, all other rows but the first would be dropped! And that is not what happens here ...
So, I tried to simulate what .first()
does, by writing my own lambda handler for .apply()
:
def proc_df_first(x):
# here we get a DataFrame with single row, if "event" (groupby arg) is a unique value;
# or a DataFrame with as many rows, as there are repeated rows with "event" of same value ("duplicate")
if len(x) == 1:
return x
elif len(x) > 1:
# create empty return DataFrame (eventually it will only have a single row)
retdf = pd.DataFrame(columns = x.columns)
#return retdf # is empty, so is skipped in final result of .groupby.apply
# must populate rowdata first, then assign via .loc (SO:17091769)
for icol in x.columns:
coldata = x[icol] # is Series
thisval = pd.NA # initialize the "compact" single value we want to set for this column (eventually derived from all the row values in this column)
for idx, val in coldata.iteritems():
#print("***", idx, val, val is pd.NA) # `val is None` is always False; `val==pd.NA` always prints `<NA>`; `val is pd.NA` is correct
if thisval is pd.NA:
if val is not pd.NA:
# found the first non-NA value; save it, and stop looking further
thisval = val
break
# store saved non-NA value into return DataFrame
retdf.at[ x.index[0], icol ] = thisval # SO: 13842088
return retdf
dataCompact = dataA.groupby('event').apply(lambda x: proc_df_first(x)) ##***
... which eventually prints:
dataCompact:
event car bike bus
event
63175 0 63175 <NA> 18 <NA>
65641 1 65641 45 9 <NA>
65805 2 65805 <NA> <NA> 54
68388 3 68388 NaN 65 39
73041 5 73041 7 <NA> 18
79336 6 79336 <NA> 44 <NA>
79423 7 79423 <NA> <NA> 5
... which is essentially the same result that .groupby('event').first()
does (aside from the duplicated "event" column (would that be a hierarchical label?), and the index column).
So, here are my questions:
.first()
returns the first non-NA per-column value in a set of rows, resulting in a single-row representation of the set of rows - is this correct?pd.NA
value)? I'm aware I can do dataA.groupby('event').apply(lambda x: proc_df_first(x)).astype('Int64')
and get pd.NA
everywhere - but given that I iterate the whole table element-by-element already, I wouldn't want yet another loop over the table - just to get rid of a single float NaN. What could I do in my handler, to ensure that the return from proc_df_first()
in .apply()
always results with pd.NA
values, if that particular value is undefined?Upvotes: 1
Views: 128
Reputation: 863751
In pandas are 2 different first
functions, which are different:
GroupBy.first
and DataFrame.first
.
Here is used first, first
per groups.
As per the above, I'd say that .first() returns the first non-NA per-column value in a set of rows, resulting in a single-row representation of the set of rows - is this correct?
Yes, you are right, but it is something like long term bug, 6732 and 8427. Still in pandas 1.0.1.
Why, for crying out loud, did I get NaN in the "car" column of the output for "event"==68388
Unfortunately I think because bug.
Upvotes: 2