JNevens
JNevens

Reputation: 11982

Indexing on DataFrame with MultiIndex

I have a large pandas DataFrame that I need to fill.

Here is my code:

trains = np.arange(1, 101) 
#The above are example values, it's actually 900 integers between 1 and 20000
tresholds = np.arange(10, 70, 10)
tuples = []
for i in trains:
    for j in tresholds:
        tuples.append((i, j))

index = pd.MultiIndex.from_tuples(tuples, names=['trains', 'tresholds'])
df = pd.DataFrame(np.zeros((len(index), len(trains))), index=index, columns=trains, dtype=float)

metrics = dict()
for i in trains:
    m = binary_metric_train(True, i) 
    #Above function returns a binary array of length 35
    #Example: [1, 0, 0, 1, ...]
    metrics[i] = m

for i in trains:
    for j in tresholds:
        trA = binary_metric_train(True, i, tresh=j)
        for k in trains:
            if k != i:
                trB = metrics[k]
                corr = abs(pearsonr(trA, trB)[0])
                df[k][i][j] = corr
            else:
                df[k][i][j] = np.nan

My problem is, when this piece of code is finally done computing, my DataFrame df still contains nothing but zeros. Even the NaN are not inserted. I think that my indexing is correct. Also, I have tested my binary_metric_train function separately, it does return an array of length 35.

Can anyone spot what I am missing here?

EDIT: For clarity, this DataFrame looks like this:

                    1   2   3   4   5   ...
trains  tresholds
     1         10
               20
               30
               40
               50
               60
     2         10
               20
               30
               40
               50
               60
   ...

Upvotes: 0

Views: 253

Answers (1)

Matt
Matt

Reputation: 17629

As @EdChum noted, you should take a lookt at pandas indexing. Here's some test data for the purpose of illustration, which should clear things up.

import numpy as np
import pandas as pd

trains     = [ 1,  1,  1,  2,  2,  2]
thresholds = [10, 20, 30, 10, 20, 30]
data       = [ 1,  0,  1,  0,  1,  0]
df = pd.DataFrame({
    'trains'     : trains,
    'thresholds' : thresholds,
    'C1'         : data,
    'C2'         : data
}).set_index(['trains', 'thresholds'])

print df
df.ix[(2, 30), 0] = 3 # using column index
# or...
df.ix[(2, 30), 'C1'] = 3 # using column name
df.loc[(2, 30), 'C1'] = 3 # using column name
# but not...
df.loc[(2, 30), 1] = 3 # creates a new column
print df

Which outputs the DataFrame before and after modification:

                   C1  C2
trains thresholds        
1      10           1   1
       20           0   0
       30           1   1
2      10           0   0
       20           1   1
       30           0   0
                   C1  C2   1
trains thresholds            
1      10           1   1 NaN
       20           0   0 NaN
       30           1   1 NaN
2      10           0   0 NaN
       20           1   1 NaN
       30           3   0   3

Upvotes: 2

Related Questions