ord_dav
ord_dav

Reputation: 53

how to split lines in dataframe pandas into new lines given a percentage

I have two dataframes that I would like to turn into one dataframe.

dataframe 1:

   level_1 level_2  ...  2021-01-30 00:00:00  2021-01-31 00:00:00
0        A       X  ...            85.191861 (*)        61.387459
1        A       Y  ...            40.931960            76.803598
2        A       Z  ...            36.811113            44.834602
3        B       X  ...            60.707172             2.457655
4        B       Y  ...            53.990172            50.636714
5        B       Z  ...            52.347966            27.938671
6        C       X  ...            17.619055            89.028475
7        C       Y  ...            91.442144            56.108239
8        C       Z  ...             8.633474            14.716926
9        D       X  ...            41.690869            68.389570
10       D       Y  ...            37.896461            11.763265
11       D       Z  ...            12.492016            70.571018

dataframe 2:


   level_2  level_3  2020-02-01 00:00:00  2020-02-02 00:00:00
0        X     1060                 0.10                  0.0
1        X     1064                 0.20                  1.0
2        X     1065                 0.50                  0.0
3        X     1067                 0.00                  0.0
4        X     1068                 0.00                  0.0
5        X     1264                 0.20                  0.0
6        X     1061                 0.00                  0.0
7        X     1000                 0.00                  0.0
8        Y     1060                 0.05                  0.1
9        Y     1064                 0.05                  0.2
10       Y     1065                 0.10                  0.5
11       Y     1067                 0.20                  0.0
12       Y     1068                 0.20                  0.0
13       Y     1264                 0.10                  0.2
14       Y     1061                 0.15                  0.0
15       Y     1000                 0.15                  0.0
16       Z     1060                 0.00                  0.0
17       Z     1064                 0.00                  0.0
18       Z     1065                 0.00                  0.0
19       Z     1067                 0.90                  0.9
20       Z     1068                 0.10                  0.1
21       Z     1264                 0.00                  0.0
22       Z     1061                 0.00                  0.0
23       Z     1000                 0.00                  0.0

I would like to add a third column to the first dataframe, level 3, and split the original row across the percentage given in dataframe 2. The numbers aren't correct, but the structure would look something like this:

   level_1 level_2  level_3  2020-02-01 00:00:00
0        A       X     1060             2.374184
1        A       X     1064             4.748367
2        A       X     1065            11.870918
3        A       X     1067             0.000000
4        A       X     1068             0.000000
5        A       X     1264             4.748367
6        A       X     1061             0.000000
7        A       X     1000             0.000000
8        A       Y     1060             2.813819
9        A       Y     1064             2.813819
10       A       Y     1065             5.627637
11       A       Y     1067            11.255275
12       A       Y     1068            11.255275
13       A       Y     1264             5.627637
14       A       Y     1061             8.441456
15       A       Y     1000             8.441456
16       A       Z     1060             0.000000
17       A       Z     1064             0.000000
18       A       Z     1065             0.000000
19       A       Z     1067            24.890250
20       A       Z     1068             2.765583
21       A       Z     1264             0.000000
22       A       Z     1061             0.000000
23       A       Z     1000             0.000000

So every line dataframe would be split out across 8 factors (1060, 1064, 1065, 1067, 1068, 1264, 1061, 1000) So the original amount would 85.19 (*), would be split into 8 parts. The split is done according to the percentage in dataframe 2. I was thinking of stacking the dataframes and doing a merge, but currently haven't managed to make it work.

Thanks!

Upvotes: 0

Views: 96

Answers (1)

Rob Raymond
Rob Raymond

Reputation: 31166

  • cleaned up your sample data - made date columns consistent across both dataframes
  • reshape them - make Date a column rather than a series of columns
  • now it's a straight forward merge
  • finally reshape back to your target structure with a pivot to make Date columns again
df1 = pd.read_csv(io.StringIO("""   level_1  level_2  ...  2021-01-30 00:00:00  2021-01-31 00:00:00
0        A       X  ...            85.191861         61.387459
1        A       Y  ...            40.931960            76.803598
2        A       Z  ...            36.811113            44.834602
3        B       X  ...            60.707172             2.457655
4        B       Y  ...            53.990172            50.636714
5        B       Z  ...            52.347966            27.938671
6        C       X  ...            17.619055            89.028475
7        C       Y  ...            91.442144            56.108239
8        C       Z  ...             8.633474            14.716926
9        D       X  ...            41.690869            68.389570
10       D       Y  ...            37.896461            11.763265
11       D       Z  ...            12.492016            70.571018
"""), sep="\s\s+", engine="python").drop(columns="...")

# cleanup DF1,  make date columns rows
df1 = df1.set_index(["level_1","level_2"]).rename_axis(columns="Date").stack().reset_index()
df1.Date = pd.to_datetime(df1.Date)

df2 = pd.read_csv(io.StringIO("""   level_2  level_3  2021-01-30 00:00:00  2021-01-31 00:00:00
0        X     1060                 0.10                  0.0
1        X     1064                 0.20                  1.0
2        X     1065                 0.50                  0.0
3        X     1067                 0.00                  0.0
4        X     1068                 0.00                  0.0
5        X     1264                 0.20                  0.0
6        X     1061                 0.00                  0.0
7        X     1000                 0.00                  0.0
8        Y     1060                 0.05                  0.1
9        Y     1064                 0.05                  0.2
10       Y     1065                 0.10                  0.5
11       Y     1067                 0.20                  0.0
12       Y     1068                 0.20                  0.0
13       Y     1264                 0.10                  0.2
14       Y     1061                 0.15                  0.0
15       Y     1000                 0.15                  0.0
16       Z     1060                 0.00                  0.0
17       Z     1064                 0.00                  0.0
18       Z     1065                 0.00                  0.0
19       Z     1067                 0.90                  0.9
20       Z     1068                 0.10                  0.1
21       Z     1264                 0.00                  0.0
22       Z     1061                 0.00                  0.0
23       Z     1000                 0.00                  0.0"""), sep="\s\s+", engine="python")

df2 = df2.set_index(["level_2","level_3"]).rename_axis(columns="Date").stack().reset_index()
df2.Date = pd.to_datetime(df2.Date)

dfm = (df1.merge(df2, on=["level_2","Date"])
 .assign(val=lambda dfa: dfa["0_x"]*dfa["0_y"])
 .drop(columns=["0_x","0_y"])
 .pivot(index=["level_1","level_2","level_3"], columns="Date", values="val")
 .reset_index()
)

print(dfm.head(10).to_markdown())
level_1 level_2 level_3 2021-01-30 00:00:00 2021-01-31 00:00:00
0 A X 1000 0 0
1 A X 1060 8.51919 0
2 A X 1061 0 0
3 A X 1064 17.0384 61.3875
4 A X 1065 42.5959 0
5 A X 1067 0 0
6 A X 1068 0 0
7 A X 1264 17.0384 0
8 A Y 1000 6.13979 0
9 A Y 1060 2.0466 7.68036

Upvotes: 1

Related Questions