mother_chucker
mother_chucker

Reputation: 55

Python, Merging rows with the same id - date but different values in one column

I have selected the rows that are and aren't mentioning 'Korona', and counted them by date. Some of the dates don't have Korona True. The dataframe looks like:

TABLE 1

Published_date Korona Count
242 2020-06-01 False 13
243 2020-06-01 True 3
244 2020-06-02 False 7
245 2020-06-02 True 1
246 2020-06-03 False 11
247 2020-06-04 False 8
248 2020-06-04 True 1
249 2020-06-05 False 10
250 2020-06-06 False 5
251 2020-06-07 False 5
252 2020-06-08 False 14

What I'm trying to do is remove duplicate date rows, but transform the value to another column. So for example this:

Published_date Korona Count
242 2020-06-01 False 13
243 2020-06-01 True 3

Looks like this (after some renaming of columns and adding an Count-All column):

TABLE 2

Published_date Count-NoKorona Count-Korona Count-All
152 2020-06-01 13 3 16

And I'm doing that with this code (found it on Python, Merging rows with same value in one column ) :

df = df.set_index(['Published_date', df.groupby('Published_date').cumcount()])['Count'].unstack().add_prefix('Count').reset_index()

Problem: For some reason after that line of code, my data gets mixed up.

Before that line everything was fine. I had 782 Korona True rows (For a test I only took 'True' rows from the table 1, and summed up the Count of it, and it was correct --> 782). So 782/3443 True.

After the code line I get a sum of 1011/3443.

I'm guessing it takes some wrong values of dates, or it gets mixed up, but I don't know how to fix it, and the data table is too big to found the mistakes manualy to try and understand the problem better.

I would be grateful for any help. (Also sorry if the question doesn't look okay, it's my first :D)

Upvotes: 0

Views: 964

Answers (1)

Jonathan Leon
Jonathan Leon

Reputation: 5648

Give this a try using pivot table

d = ''' Published_date  Korona  Count
242 2020-06-01  False   13
243 2020-06-01  True    3
244 2020-06-02  False   7
245 2020-06-02  True    1
246 2020-06-03  False   11
247 2020-06-04  False   8
248 2020-06-04  True    1
249 2020-06-05  False   10
250 2020-06-06  False   5
251 2020-06-07  False   5
252 2020-06-08  False   14'''

df = pd.read_csv(io.StringIO(d), sep='\s+', engine='python')

# pivot the data and reset the index
df1 = pd.pivot_table(df, values='Count', index=['Published_date'],
                    columns=['Korona'], aggfunc=np.sum, fill_value=0).reset_index()
# rename the columns to what you want
df1.columns = ['Published_date', 'Count-NoKorona', 'Count-Korona']
# sum the values into a new column
df1['Count-All'] = df1[['Count-NoKorona', 'Count-Korona']].sum(axis=1)

Output:

  Published_date  Count-NoKorona  Count-Korona  Count-All
0     2020-06-01              13             3         16
1     2020-06-02               7             1          8
2     2020-06-03              11             0         11
3     2020-06-04               8             1          9
4     2020-06-05              10             0         10
5     2020-06-06               5             0          5
6     2020-06-07               5             0          5
7     2020-06-08              14             0         14

Upvotes: 1

Related Questions