scipio1551
scipio1551

Reputation: 53

Pandas Groupby: How to get distinct Column Values

Trying to get distinct values for columns after grouping; but am getting a total sum groupby is dropping object distinction and losing leading zeros for column

df = pd.read_csv("trial.txt",sep='|',converters={'zip':str},keep_default_na=True,low_memory=False)

Data:

Emp State   Zip      Jan feb mar 

Int  NY    11111      1   0   1

int  NY    11111      1   1   0

int  NC    09999      2   2   0

int  ON    NH443     2   2   2

after

df2 = df.groupby("Zip").count()

df2 for zip my output for zip = 11111 i'll have the output for all 12 months show 2 2 2. Were I would expect 2 1 1 and zip 09999 shows as 9999.

How what is wrong about the grouping to not get distinct column values. Have account for non-null values (there are no nulls). Column value is only (0 , 1 ,2)

Upvotes: 2

Views: 2052

Answers (2)

Valdi_Bo
Valdi_Bo

Reputation: 31011

Let's start from why Zip column is read as int.

Note the subtle difference:

  • Third column name in the source file is Zip, with capital Z.
  • In your code you put converters={'zip': str}, with column name starting with lower case z.

Change it to converters={'Zip': str} and the column in question will be read as string.

Another, maybe better solution is to pass dtype={'Zip': object} instead. I wrote better because no conversion is actually needed. It is enough just to specify the column type.

I performed also additional check, i.e. I changed Zip in the last row from NH443 to 88443.

So even if this column in the source file contains only digits (in all rows), its type will be object, meaning actually string.

And now let's look at what is actual result of your code:

When you run df2 = df.groupby('Zip').count(), the result is:

       Emp  State  Jan  Feb  Mar 
Zip                              
09999    1      1    1    1     1
11111    2      2    2    2     2
88443    1      1    1    1     1

Note the description of count function (GroupBy variant), which reads: Compute count of group, excluding missing values. As you didn't pass any column list, this count is computed for all columns.

So your code does'n generate distinct values, but instead it counts non-null values along each column in each group.

If you want what you wrote in the title, i.e. a list of distinct values, for each column (not their counts), run:

df.groupby('Zip').agg(lambda col: np.sort(col.unique()).tolist())

This time the result is:

         Emp State  Jan     Feb    Mar 
Zip                                    
09999  [Int]  [NC]  [2]     [2]     [0]
11111  [Int]  [NY]  [1]  [0, 1]  [0, 1]
88443  [Int]  [ON]  [2]     [2]     [2]

E.g. group 11111 contains in Feb and Mar columns two distinct values: 0 and 1.

In all other cases (for your sample data) each column in each group contains a single distinct value, so corresponding lists contain only a single element.

Upvotes: 1

Alexander
Alexander

Reputation: 109756

count returns the count of each group, excluding missing values. That means that a value of zero would also be included in the count. To only count positive values, you can apply a lambda function that sums the count of values greater than zero.

>>> df.groupby('Zip')[['Jan', 'feb', 'mar']].apply(lambda x: x.gt(0).sum())
       Jan  feb  mar
Zip                 
09999    1    1    0
11111    2    1    1
NH443    1    1    1

Upvotes: 0

Related Questions