BlandCorporation
BlandCorporation

Reputation: 1344

In a pandas DataFrame, how can "flattened" variables be "unflattened" into new columns using their index?

I have some data in pandas DataFrames that was accessing using root_pandas from a ROOT file. Most of the data are simple variables that can have various values. Some variables, however, are arrays of numbers. In order to load these arrays, there is the option of flattening the variable.

So, for instance, the array variable jet_tagWeightBin can have various numbers of values depending on the number of jets in a physics event. When "flattened", one can access its various values for each jet in a given physics event by using an index, __array_index.

Here is what it looks like to load three physics events. You can see that for each physics event there is one value of HT_jets but multiple values of jet_tagWeightBin which are accessible using their index:

|  |HT_jets|jet_tagWeightBin|__array_index|
|--|-------|----------------|-------------|
|0 |319676 |1               |0            |     |<---------- 1st event
|1 |319676 |5               |1            |     |
|2 |319676 |1               |2            |     |
|3 |319676 |5               |3            |     |
|4 |200476 |5               |0            |        |<------- 2nd event
|5 |200476 |2               |1            |        |
|6 |200476 |1               |2            |        |
|7 |200476 |1               |3            |        |
|8 |520111 |5               |0            |           |<---- 3rd event
|9 |520111 |1               |1            |           |
|10|520111 |2               |2            |           |
|11|520111 |5               |3            |           |
|12|520111 |5               |4            |           |
|13|520111 |2               |5            |           |

Here's that in code:

import pandas as pd

df = pd.DataFrame(
         [
             [319676, 1, 0],
             [319676, 5, 1],
             [319676, 1, 2],
             [319676, 5, 3],
             [200476, 5, 0],
             [200476, 2, 1],
             [200476, 1, 2],
             [200476, 1, 3],
             [520111, 5, 0],
             [520111, 1, 1],
             [520111, 2, 2],
             [520111, 5, 3],
             [520111, 5, 4],
             [520111, 2, 5],
         ],
         columns = [
             "HT_jets",
             "jet_tagWeightBin",
             "__array_index"
         ]
    )

Now, what I want to do is get rid of that __array_index and by adding a bunch of new single-value variables like jet_tagWeightBin_0, jet_tagWeightBin_1, jet_tagWeightBin_2, ..., up to as many as are needed. So, I want to get to something like this:

|  |HT_jets|jet_tagWeightBin_0|jet_tagWeightBin_1|jet_tagWeightBin_2|jet_tagWeightBin_3|jet_tagWeightBin_4|jet_tagWeightBin_5|
|--|-------|------------------|------------------|------------------|------------------|------------------|------------------|
|0 |319676 |1                 |5                 |1                 |5                 |NaN               |NaN               |
|1 |200476 |5                 |2                 |1                 |1                 |NaN               |NaN               |
|2 |520111 |5                 |1                 |2                 |5                 |5                 |2                 |

I'm not sure what this type of operation is, but I feel sure that this must be some straightforward thing to do. I just don't know how to do it.

Anyway, here's the beginnings of an attempt:

I can add a new column with the appropriate name like this:

df["new_name"] = df.apply(lambda row: "jet_tagWeightBin_" + str(row["__array_index"]), axis = 1)

That results in this:

|  |HT_jets|jet_tagWeightBin|__array_index|new_name          |
|--|-------|----------------|-------------|------------------|
|0 |319676 |1               |0            |jet_tagWeightBin_0|
|1 |319676 |5               |1            |jet_tagWeightBin_1|
|2 |319676 |1               |2            |jet_tagWeightBin_2|
|3 |319676 |5               |3            |jet_tagWeightBin_3|
|4 |200476 |5               |0            |jet_tagWeightBin_0|
|5 |200476 |2               |1            |jet_tagWeightBin_1|
|6 |200476 |1               |2            |jet_tagWeightBin_2|
|7 |200476 |1               |3            |jet_tagWeightBin_3|
|8 |520111 |5               |0            |jet_tagWeightBin_0|
|9 |520111 |1               |1            |jet_tagWeightBin_1|
|10|520111 |2               |2            |jet_tagWeightBin_2|
|11|520111 |5               |3            |jet_tagWeightBin_3|
|12|520111 |5               |4            |jet_tagWeightBin_4|
|13|520111 |2               |5            |jet_tagWeightBin_5|

That's where I'm at. I'd welcome guidance. :)


EDIT: For clarity, I'm dealing with many variables. Here are more columns in the data:

|  |eventNumber|Mjj_MindR   |HT_jets|jet_tagWeightBin|__array_index|
|--|-----------|------------|-------|----------------|-------------|
|0 |446427     |98896.421875|319676 |1               |0            |     |<---------- 1st event
|1 |446427     |98896.421875|319676 |5               |1            |     |
|2 |446427     |98896.421875|319676 |1               |2            |     |
|3 |446427     |98896.421875|319676 |5               |3            |     |
|4 |446650     |29691.271484|200476 |5               |0            |        |<------- 2nd event
|5 |446650     |29691.271484|200476 |2               |1            |        |
|6 |446650     |29691.271484|200476 |1               |2            |        |
|7 |446650     |29691.271484|200476 |1               |3            |        |
|8 |446707     |57697.246094|520111 |5               |0            |          |<---- 3rd event
|9 |446707     |57697.246094|520111 |1               |1            |          |
|10|446707     |57697.246094|520111 |2               |2            |          |
|11|446707     |57697.246094|520111 |5               |3            |          |
|12|446707     |57697.246094|520111 |5               |4            |          |
|13|446707     |57697.246094|520111 |2               |5            |          |

Upvotes: 0

Views: 412

Answers (1)

Woody Pride
Woody Pride

Reputation: 13965

This is a pivot problem

newDF = df.pivot(columns='array_index', values='jet_tagWeightBin', index='HT_jets')

Then just rename the columns

This gives:

array_index    0    1    2    3    4    5
HT_jets
200476       5.0  2.0  1.0  1.0  NaN  NaN
319676       1.0  5.0  1.0  5.0  NaN  NaN
520111       5.0  1.0  2.0  5.0  5.0  2.0

Upvotes: 2

Related Questions