Reputation: 53
Trying to get distinct values for columns after grouping; but am getting a total sum groupby is dropping object distinction and losing leading zeros for column
df = pd.read_csv("trial.txt",sep='|',converters={'zip':str},keep_default_na=True,low_memory=False)
Data:
Emp State Zip Jan feb mar
Int NY 11111 1 0 1
int NY 11111 1 1 0
int NC 09999 2 2 0
int ON NH443 2 2 2
after
df2 = df.groupby("Zip").count()
df2 for zip my output for zip = 11111 i'll have the output for all 12 months show 2 2 2. Were I would expect 2 1 1 and zip 09999 shows as 9999.
How what is wrong about the grouping to not get distinct column values. Have account for non-null values (there are no nulls). Column value is only (0 , 1 ,2)
Upvotes: 2
Views: 2052
Reputation: 31011
Let's start from why Zip column is read as int.
Note the subtle difference:
Change it to converters={'Zip': str} and the column in question will be read as string.
Another, maybe better solution is to pass dtype={'Zip': object} instead. I wrote better because no conversion is actually needed. It is enough just to specify the column type.
I performed also additional check, i.e. I changed Zip in the last row from NH443 to 88443.
So even if this column in the source file contains only digits (in all rows), its type will be object, meaning actually string.
And now let's look at what is actual result of your code:
When you run df2 = df.groupby('Zip').count()
, the result is:
Emp State Jan Feb Mar
Zip
09999 1 1 1 1 1
11111 2 2 2 2 2
88443 1 1 1 1 1
Note the description of count function (GroupBy variant), which reads: Compute count of group, excluding missing values. As you didn't pass any column list, this count is computed for all columns.
So your code does'n generate distinct values, but instead it counts non-null values along each column in each group.
If you want what you wrote in the title, i.e. a list of distinct values, for each column (not their counts), run:
df.groupby('Zip').agg(lambda col: np.sort(col.unique()).tolist())
This time the result is:
Emp State Jan Feb Mar
Zip
09999 [Int] [NC] [2] [2] [0]
11111 [Int] [NY] [1] [0, 1] [0, 1]
88443 [Int] [ON] [2] [2] [2]
E.g. group 11111 contains in Feb and Mar columns two distinct values: 0 and 1.
In all other cases (for your sample data) each column in each group contains a single distinct value, so corresponding lists contain only a single element.
Upvotes: 1
Reputation: 109756
count
returns the count of each group, excluding missing values. That means that a value of zero would also be included in the count. To only count positive values, you can apply a lambda function that sums the count of values greater than zero.
>>> df.groupby('Zip')[['Jan', 'feb', 'mar']].apply(lambda x: x.gt(0).sum())
Jan feb mar
Zip
09999 1 1 0
11111 2 1 1
NH443 1 1 1
Upvotes: 0