Reputation: 347
I have the following dataframe
USER_ID MONTH STATUS_ID
0 23026 2019-09-01 2
1 23026 2019-10-01 2
2 23026 2019-11-01 2
3 23026 2019-12-01 2
4 23027 2019-09-01 2
... ... ... ...
123 16546 2018-10-01 2
124 16622 2018-09-01 1
125 16622 2018-10-01 1
126 16622 2018-11-01 1
127 16622 2018-12-01 1
All users should have 4 rows (4 Months) but period can be different (01.02.2018 -01.05.2018 or 01.02.2019 - 01.05.2019 ...)
and I would like to transform like this
USER_ID MONTH_1 MONTH_2 MONTH_3 MONTH_4
0 23026 2 2 2 2
1 23027 2 2 2 2
... ... ... ...
123 16546 2 2 1 1
124 16622 1 1 1 1
and the last thing I should transform dataframe based on STATUS ID
USER_ID ID
0 23026 2
1 23027 2
... ... ... ...
123 16546 2
124 16622 1
That is a kind of boolean operation between ID's in different Months. Any idea of how to transform this? Or could this be solved more efficiently?
Upvotes: 2
Views: 79
Reputation: 863301
I think possible solution is create new column by counts with GroupBy.cumcount
and pass to DataFrame.pivot
:
print (df)
USER_ID MONTH STATUS_ID
0 23026 2019-09-01 2
1 23026 2019-10-01 2
2 23026 2019-11-01 2
3 23026 2019-12-01 2
123 16546 2018-09-01 2
123 16546 2018-10-01 2
123 16546 2018-11-01 1
123 16546 2018-12-01 1
124 16622 2018-09-01 1
125 16622 2018-10-01 1
126 16622 2018-11-01 1
127 16622 2018-12-01 1
df['MONTH1'] = 'MONTH_' + df.groupby('USER_ID').cumcount().add(1).astype(str)
df = df.pivot('USER_ID','MONTH1','STATUS_ID')
print (df)
MONTH1 MONTH_1 MONTH_2 MONTH_3 MONTH_4
USER_ID
16546 2 2 1 1
16622 1 1 1 1
23026 2 2 2 2
Then compare by 1
by DataFrame.eq
and test if at least one True
per rows by DataFrame.any
, last map
and convert to DataFrame
:
df1 = df1.eq(1).any(axis=1).map({True:1, False:2}).reset_index(name='ID')
print (df1)
USER_ID ID
0 16546 1
1 16622 1
2 23026 2
Detail:
print (df1.eq(1))
MONTH1 MONTH_1 MONTH_2 MONTH_3 MONTH_4
USER_ID
16546 False False True True
16622 True True True True
23026 False False False False
Upvotes: 2
Reputation: 1
Your question is not clear as to the format of the first column of your final output.
Adjust these two scripts as needed:
It is required, that each id has always 4 lines. Also first user / last user may be skipped.
#!/usr/bin/env python3
import csv
prev_id = 0
sc = 0
buf = [0, 0, 0, 0]
with open("test.csv") as f:
data = csv.reader(f, delimiter=' ', skipinitialspace=True)
for row in data:
if sc == 4:
print(int(row[0]) - 4, prev_id, buf[0], buf[1], buf[2], buf[3])
sc = 0
else:
prev_id = row[1]
buf[sc] = row[3]
sc += 1
#!/usr/bin/env python3
import csv
prev_id = 0
prev_stat_max = 0
counter = 0
with open('test.csv') as f:
data = csv.reader(f, delimiter=' ', skipinitialspace=True)
length = sum(1 for row in data)
with open('test.csv') as f:
data = csv.reader(f, delimiter=' ', skipinitialspace=True)
for row in data:
if prev_id == row[1]:
if row[3] > prev_stat_max:
priv_stat_max = row[3]
else:
print(counter, prev_id, prev_stat_max)
prev_id = row[1]
prev_stat_max = row[3]
counter += 1
if counter == length:
print(counter, prev_id, prev_stat_max)
Upvotes: 0
Reputation: 881
import pandas as pd
df = pd.DataFrame(
[
{"USER_ID": 23026, "MONTH": "2019-09-01", "STATUS_ID": 2},
{"USER_ID": 23026, "MONTH": "2019-10-01", "STATUS_ID": 2},
{"USER_ID": 23026, "MONTH": "2019-11-01", "STATUS_ID": 2},
{"USER_ID": 16622, "MONTH": "2019-09-01", "STATUS_ID": 1},
{"USER_ID": 16622, "MONTH": "2019-10-01", "STATUS_ID": 1},
{"USER_ID": 16622, "MONTH": "2019-11-01", "STATUS_ID": 1},
]
)
pivoted = df.pivot(index="USER_ID", columns="MONTH", values="STATUS_ID")
# Sort column names
pivoted = pivoted.reindex(sorted(pivoted.columns), axis=1)
pivoted.columns = ["MONTH_" + str(x) for x in range(1, len(pivoted.columns) + 1)]
print(pivoted)
# if you want USER_ID as a regular column use:
# pivoted = pivoted.reset_index(col_fill="USER_ID")
Upvotes: 0