Reputation: 55
I have selected the rows that are and aren't mentioning 'Korona', and counted them by date. Some of the dates don't have Korona True. The dataframe looks like:
TABLE 1
Published_date | Korona | Count | |
---|---|---|---|
242 | 2020-06-01 | False | 13 |
243 | 2020-06-01 | True | 3 |
244 | 2020-06-02 | False | 7 |
245 | 2020-06-02 | True | 1 |
246 | 2020-06-03 | False | 11 |
247 | 2020-06-04 | False | 8 |
248 | 2020-06-04 | True | 1 |
249 | 2020-06-05 | False | 10 |
250 | 2020-06-06 | False | 5 |
251 | 2020-06-07 | False | 5 |
252 | 2020-06-08 | False | 14 |
What I'm trying to do is remove duplicate date rows, but transform the value to another column. So for example this:
Published_date | Korona | Count | |
---|---|---|---|
242 | 2020-06-01 | False | 13 |
243 | 2020-06-01 | True | 3 |
Looks like this (after some renaming of columns and adding an Count-All column):
TABLE 2
Published_date | Count-NoKorona | Count-Korona | Count-All | |
---|---|---|---|---|
152 | 2020-06-01 | 13 | 3 | 16 |
And I'm doing that with this code (found it on Python, Merging rows with same value in one column ) :
df = df.set_index(['Published_date', df.groupby('Published_date').cumcount()])['Count'].unstack().add_prefix('Count').reset_index()
Problem: For some reason after that line of code, my data gets mixed up.
Before that line everything was fine. I had 782 Korona True rows (For a test I only took 'True' rows from the table 1, and summed up the Count of it, and it was correct --> 782). So 782/3443 True.
After the code line I get a sum of 1011/3443.
I'm guessing it takes some wrong values of dates, or it gets mixed up, but I don't know how to fix it, and the data table is too big to found the mistakes manualy to try and understand the problem better.
I would be grateful for any help. (Also sorry if the question doesn't look okay, it's my first :D)
Upvotes: 0
Views: 964
Reputation: 5648
Give this a try using pivot table
d = ''' Published_date Korona Count
242 2020-06-01 False 13
243 2020-06-01 True 3
244 2020-06-02 False 7
245 2020-06-02 True 1
246 2020-06-03 False 11
247 2020-06-04 False 8
248 2020-06-04 True 1
249 2020-06-05 False 10
250 2020-06-06 False 5
251 2020-06-07 False 5
252 2020-06-08 False 14'''
df = pd.read_csv(io.StringIO(d), sep='\s+', engine='python')
# pivot the data and reset the index
df1 = pd.pivot_table(df, values='Count', index=['Published_date'],
columns=['Korona'], aggfunc=np.sum, fill_value=0).reset_index()
# rename the columns to what you want
df1.columns = ['Published_date', 'Count-NoKorona', 'Count-Korona']
# sum the values into a new column
df1['Count-All'] = df1[['Count-NoKorona', 'Count-Korona']].sum(axis=1)
Output:
Published_date Count-NoKorona Count-Korona Count-All
0 2020-06-01 13 3 16
1 2020-06-02 7 1 8
2 2020-06-03 11 0 11
3 2020-06-04 8 1 9
4 2020-06-05 10 0 10
5 2020-06-06 5 0 5
6 2020-06-07 5 0 5
7 2020-06-08 14 0 14
Upvotes: 1