Reputation: 1344
I have some data in pandas DataFrames that was accessing using root_pandas from a ROOT file. Most of the data are simple variables that can have various values. Some variables, however, are arrays of numbers. In order to load these arrays, there is the option of flattening the variable.
So, for instance, the array variable jet_tagWeightBin
can have various numbers of values depending on the number of jets in a physics event. When "flattened", one can access its various values for each jet in a given physics event by using an index, __array_index
.
Here is what it looks like to load three physics events. You can see that for each physics event there is one value of HT_jets
but multiple values of jet_tagWeightBin
which are accessible using their index:
| |HT_jets|jet_tagWeightBin|__array_index|
|--|-------|----------------|-------------|
|0 |319676 |1 |0 | |<---------- 1st event
|1 |319676 |5 |1 | |
|2 |319676 |1 |2 | |
|3 |319676 |5 |3 | |
|4 |200476 |5 |0 | |<------- 2nd event
|5 |200476 |2 |1 | |
|6 |200476 |1 |2 | |
|7 |200476 |1 |3 | |
|8 |520111 |5 |0 | |<---- 3rd event
|9 |520111 |1 |1 | |
|10|520111 |2 |2 | |
|11|520111 |5 |3 | |
|12|520111 |5 |4 | |
|13|520111 |2 |5 | |
Here's that in code:
import pandas as pd
df = pd.DataFrame(
[
[319676, 1, 0],
[319676, 5, 1],
[319676, 1, 2],
[319676, 5, 3],
[200476, 5, 0],
[200476, 2, 1],
[200476, 1, 2],
[200476, 1, 3],
[520111, 5, 0],
[520111, 1, 1],
[520111, 2, 2],
[520111, 5, 3],
[520111, 5, 4],
[520111, 2, 5],
],
columns = [
"HT_jets",
"jet_tagWeightBin",
"__array_index"
]
)
Now, what I want to do is get rid of that __array_index
and by adding a bunch of new single-value variables like jet_tagWeightBin_0
, jet_tagWeightBin_1
, jet_tagWeightBin_2
, ..., up to as many as are needed. So, I want to get to something like this:
| |HT_jets|jet_tagWeightBin_0|jet_tagWeightBin_1|jet_tagWeightBin_2|jet_tagWeightBin_3|jet_tagWeightBin_4|jet_tagWeightBin_5|
|--|-------|------------------|------------------|------------------|------------------|------------------|------------------|
|0 |319676 |1 |5 |1 |5 |NaN |NaN |
|1 |200476 |5 |2 |1 |1 |NaN |NaN |
|2 |520111 |5 |1 |2 |5 |5 |2 |
I'm not sure what this type of operation is, but I feel sure that this must be some straightforward thing to do. I just don't know how to do it.
Anyway, here's the beginnings of an attempt:
I can add a new column with the appropriate name like this:
df["new_name"] = df.apply(lambda row: "jet_tagWeightBin_" + str(row["__array_index"]), axis = 1)
That results in this:
| |HT_jets|jet_tagWeightBin|__array_index|new_name |
|--|-------|----------------|-------------|------------------|
|0 |319676 |1 |0 |jet_tagWeightBin_0|
|1 |319676 |5 |1 |jet_tagWeightBin_1|
|2 |319676 |1 |2 |jet_tagWeightBin_2|
|3 |319676 |5 |3 |jet_tagWeightBin_3|
|4 |200476 |5 |0 |jet_tagWeightBin_0|
|5 |200476 |2 |1 |jet_tagWeightBin_1|
|6 |200476 |1 |2 |jet_tagWeightBin_2|
|7 |200476 |1 |3 |jet_tagWeightBin_3|
|8 |520111 |5 |0 |jet_tagWeightBin_0|
|9 |520111 |1 |1 |jet_tagWeightBin_1|
|10|520111 |2 |2 |jet_tagWeightBin_2|
|11|520111 |5 |3 |jet_tagWeightBin_3|
|12|520111 |5 |4 |jet_tagWeightBin_4|
|13|520111 |2 |5 |jet_tagWeightBin_5|
That's where I'm at. I'd welcome guidance. :)
EDIT: For clarity, I'm dealing with many variables. Here are more columns in the data:
| |eventNumber|Mjj_MindR |HT_jets|jet_tagWeightBin|__array_index|
|--|-----------|------------|-------|----------------|-------------|
|0 |446427 |98896.421875|319676 |1 |0 | |<---------- 1st event
|1 |446427 |98896.421875|319676 |5 |1 | |
|2 |446427 |98896.421875|319676 |1 |2 | |
|3 |446427 |98896.421875|319676 |5 |3 | |
|4 |446650 |29691.271484|200476 |5 |0 | |<------- 2nd event
|5 |446650 |29691.271484|200476 |2 |1 | |
|6 |446650 |29691.271484|200476 |1 |2 | |
|7 |446650 |29691.271484|200476 |1 |3 | |
|8 |446707 |57697.246094|520111 |5 |0 | |<---- 3rd event
|9 |446707 |57697.246094|520111 |1 |1 | |
|10|446707 |57697.246094|520111 |2 |2 | |
|11|446707 |57697.246094|520111 |5 |3 | |
|12|446707 |57697.246094|520111 |5 |4 | |
|13|446707 |57697.246094|520111 |2 |5 | |
Upvotes: 0
Views: 412
Reputation: 13965
This is a pivot problem
newDF = df.pivot(columns='array_index', values='jet_tagWeightBin', index='HT_jets')
Then just rename the columns
This gives:
array_index 0 1 2 3 4 5
HT_jets
200476 5.0 2.0 1.0 1.0 NaN NaN
319676 1.0 5.0 1.0 5.0 NaN NaN
520111 5.0 1.0 2.0 5.0 5.0 2.0
Upvotes: 2