Beth Long
Beth Long

Reputation: 403

Adding Data to Pandas DataFrame

I want to use machine learning techniques to categorise "images" of energy released in an electromagnetic calorimeter, using a keras CNN. In order to import the data I'm using a Pandas DataFrame, however the data isn't formatted in the necessary way.

The calorimeter can be considered a 28x28 crystal square, however the data that I receive only show the energy in crystals that have been triggered, on average about 10-15 crystals per event.

   Event X  Y  Energy
   0     22 13 203.49
   0     23 12 73.1848
   ...
   ...
   1     23 16 55.1652
   1     24 16 0
   1     25 16 20.4953

That means I want to add a layer to the data frame for every crystal (X,Y) that doesn't already have an energy assigned, and assign 0 energy to it.

I've tried the following:

newdf=pd.DataFrame()

for event in range(0,2):#999):
  for xi in range(0,28):
    for yi in range(0,28):
      arr=np.array([event,xi,yi,0])
      newdf=newdf.append(pd.DataFrame(arr))
      print('newdf = ',newdf)

But the arrays get appended into column data in some strange way.

Can anyone tell me an efficient way of doing this?

Thank you.

Upvotes: 0

Views: 517

Answers (2)

Stef
Stef

Reputation: 30579

First we create a dataframe with a MultiIndex for the all events and crystals and set the Energy to 0. Then we add our dataframe with the same index.

Example:

df = pd.DataFrame({'Event': [0,0], 'X': [1,1], 'Y': [0,2], 'Energy': [203.49,73.1848]})
#   Event  X  Y    Energy
#0      0  1  0  203.4900
#1      0  1  2   73.1848

n_crystals = 3  # 28 in your case
n_events = 2

idx = pd.MultiIndex.from_product((range(n_events), range(n_crystals), range(n_crystals)), names=['Event','X','Y'])
newdf = pd.DataFrame(index=idx).assign(Energy=0)
newdf = (newdf + df.set_index(['Event','X','Y'])).fillna(0).reset_index()

Result:

    Event  X  Y    Energy
0       0  0  0    0.0000
1       0  0  1    0.0000
2       0  0  2    0.0000
3       0  1  0  203.4900
4       0  1  1    0.0000
5       0  1  2   73.1848
6       0  2  0    0.0000
7       0  2  1    0.0000
8       0  2  2    0.0000
9       1  0  0    0.0000
10      1  0  1    0.0000
11      1  0  2    0.0000
12      1  1  0    0.0000
13      1  1  1    0.0000
14      1  1  2    0.0000
15      1  2  0    0.0000
16      1  2  1    0.0000
17      1  2  2    0.0000

For 28x28 crystals and 1000 events (newdf with 784000 rows), this takes 1.5 s on my machine.

Upvotes: 1

Mikycid
Mikycid

Reputation: 111

Your arr shape is actually (4,) and what you want is an array of (1,4) if I didn't misunderstood. You could doarr=np.array([[event,xi,yi,0]]) to have the good shape.

Upvotes: 1

Related Questions