how to split lines in dataframe pandas into new lines given a percentage

Question

I have two dataframes that I would like to turn into one dataframe.

dataframe 1:

   level_1 level_2  ...  2021-01-30 00:00:00  2021-01-31 00:00:00
0        A       X  ...            85.191861 (*)        61.387459
1        A       Y  ...            40.931960            76.803598
2        A       Z  ...            36.811113            44.834602
3        B       X  ...            60.707172             2.457655
4        B       Y  ...            53.990172            50.636714
5        B       Z  ...            52.347966            27.938671
6        C       X  ...            17.619055            89.028475
7        C       Y  ...            91.442144            56.108239
8        C       Z  ...             8.633474            14.716926
9        D       X  ...            41.690869            68.389570
10       D       Y  ...            37.896461            11.763265
11       D       Z  ...            12.492016            70.571018

dataframe 2:


   level_2  level_3  2020-02-01 00:00:00  2020-02-02 00:00:00
0        X     1060                 0.10                  0.0
1        X     1064                 0.20                  1.0
2        X     1065                 0.50                  0.0
3        X     1067                 0.00                  0.0
4        X     1068                 0.00                  0.0
5        X     1264                 0.20                  0.0
6        X     1061                 0.00                  0.0
7        X     1000                 0.00                  0.0
8        Y     1060                 0.05                  0.1
9        Y     1064                 0.05                  0.2
10       Y     1065                 0.10                  0.5
11       Y     1067                 0.20                  0.0
12       Y     1068                 0.20                  0.0
13       Y     1264                 0.10                  0.2
14       Y     1061                 0.15                  0.0
15       Y     1000                 0.15                  0.0
16       Z     1060                 0.00                  0.0
17       Z     1064                 0.00                  0.0
18       Z     1065                 0.00                  0.0
19       Z     1067                 0.90                  0.9
20       Z     1068                 0.10                  0.1
21       Z     1264                 0.00                  0.0
22       Z     1061                 0.00                  0.0
23       Z     1000                 0.00                  0.0

I would like to add a third column to the first dataframe, level 3, and split the original row across the percentage given in dataframe 2. The numbers aren't correct, but the structure would look something like this:

   level_1 level_2  level_3  2020-02-01 00:00:00
0        A       X     1060             2.374184
1        A       X     1064             4.748367
2        A       X     1065            11.870918
3        A       X     1067             0.000000
4        A       X     1068             0.000000
5        A       X     1264             4.748367
6        A       X     1061             0.000000
7        A       X     1000             0.000000
8        A       Y     1060             2.813819
9        A       Y     1064             2.813819
10       A       Y     1065             5.627637
11       A       Y     1067            11.255275
12       A       Y     1068            11.255275
13       A       Y     1264             5.627637
14       A       Y     1061             8.441456
15       A       Y     1000             8.441456
16       A       Z     1060             0.000000
17       A       Z     1064             0.000000
18       A       Z     1065             0.000000
19       A       Z     1067            24.890250
20       A       Z     1068             2.765583
21       A       Z     1264             0.000000
22       A       Z     1061             0.000000
23       A       Z     1000             0.000000

So every line dataframe would be split out across 8 factors (1060, 1064, 1065, 1067, 1068, 1264, 1061, 1000) So the original amount would 85.19 (*), would be split into 8 parts. The split is done according to the percentage in dataframe 2. I was thinking of stacking the dataframes and doing a merge, but currently haven't managed to make it work.

Thanks!

Rob Raymond · Accepted Answer

cleaned up your sample data - made date columns consistent across both dataframes
reshape them - make Date a column rather than a series of columns
now it's a straight forward merge
finally reshape back to your target structure with a pivot to make Date columns again

df1 = pd.read_csv(io.StringIO("""   level_1  level_2  ...  2021-01-30 00:00:00  2021-01-31 00:00:00
0        A       X  ...            85.191861         61.387459
1        A       Y  ...            40.931960            76.803598
2        A       Z  ...            36.811113            44.834602
3        B       X  ...            60.707172             2.457655
4        B       Y  ...            53.990172            50.636714
5        B       Z  ...            52.347966            27.938671
6        C       X  ...            17.619055            89.028475
7        C       Y  ...            91.442144            56.108239
8        C       Z  ...             8.633474            14.716926
9        D       X  ...            41.690869            68.389570
10       D       Y  ...            37.896461            11.763265
11       D       Z  ...            12.492016            70.571018
"""), sep="\s\s+", engine="python").drop(columns="...")

# cleanup DF1,  make date columns rows
df1 = df1.set_index(["level_1","level_2"]).rename_axis(columns="Date").stack().reset_index()
df1.Date = pd.to_datetime(df1.Date)

df2 = pd.read_csv(io.StringIO("""   level_2  level_3  2021-01-30 00:00:00  2021-01-31 00:00:00
0        X     1060                 0.10                  0.0
1        X     1064                 0.20                  1.0
2        X     1065                 0.50                  0.0
3        X     1067                 0.00                  0.0
4        X     1068                 0.00                  0.0
5        X     1264                 0.20                  0.0
6        X     1061                 0.00                  0.0
7        X     1000                 0.00                  0.0
8        Y     1060                 0.05                  0.1
9        Y     1064                 0.05                  0.2
10       Y     1065                 0.10                  0.5
11       Y     1067                 0.20                  0.0
12       Y     1068                 0.20                  0.0
13       Y     1264                 0.10                  0.2
14       Y     1061                 0.15                  0.0
15       Y     1000                 0.15                  0.0
16       Z     1060                 0.00                  0.0
17       Z     1064                 0.00                  0.0
18       Z     1065                 0.00                  0.0
19       Z     1067                 0.90                  0.9
20       Z     1068                 0.10                  0.1
21       Z     1264                 0.00                  0.0
22       Z     1061                 0.00                  0.0
23       Z     1000                 0.00                  0.0"""), sep="\s\s+", engine="python")

df2 = df2.set_index(["level_2","level_3"]).rename_axis(columns="Date").stack().reset_index()
df2.Date = pd.to_datetime(df2.Date)

dfm = (df1.merge(df2, on=["level_2","Date"])
 .assign(val=lambda dfa: dfa["0_x"]*dfa["0_y"])
 .drop(columns=["0_x","0_y"])
 .pivot(index=["level_1","level_2","level_3"], columns="Date", values="val")
 .reset_index()
)

print(dfm.head(10).to_markdown())

	level_1	level_2	level_3	2021-01-30 00:00:00	2021-01-31 00:00:00
0	A	X	1000	0	0
1	A	X	1060	8.51919	0
2	A	X	1061	0	0
3	A	X	1064	17.0384	61.3875
4	A	X	1065	42.5959	0
5	A	X	1067	0	0
6	A	X	1068	0	0
7	A	X	1264	17.0384	0
8	A	Y	1000	6.13979	0
9	A	Y	1060	2.0466	7.68036

how to split lines in dataframe pandas into new lines given a percentage

Answers (1)

Related Questions