Anu
Anu

Reputation: 197

Looping over columns and column values for subsetting pandas dataframe

I have a dataframe as follows df:

 ID color   finish  duration
    A1  black   smooth  12
    A2  white   matte   8
    A3  blue    smooth  20
    A4  green   matte   10
    B1  black   smooth  12
    B2  white   matte   8
    B3  blue    smooth   
    B4  green       10
    C1  black   smooth   
    C2  white   matte   8
    C3  blue    smooth   
    C4  green       10

I want to generate subsets of this dataframe based on certain conditions. For example, color= black, finish = smooth, duration = 12, I get the following dataframe.

ID  color   finish  duration    score
A1  black   smooth  12  1
B1  black   smooth  12  1

color= blue, finish = smooth, duration = 20, I get the following dataframe.

ID  color   finish  duration    score
A3  blue    smooth  20  1
B3  blue    smooth      0.666667
C3  blue    smooth      0.666667

Score is calculated as number of columns populated/total number of columns. I want to loop this in pandas dataframe. Following code is working for me for 2 columns.

list2 = list(df['color'].unique())
list3 = list(df['finish'].unique())

df_final = pd.DataFrame()


for i in range(len(list2)):
   for j in range(len(list3)):
       print 'Current Attribute Value:',list2[i],list3[j]

       gbl["df_"+list2[i]] = df[df.color == list2[i]]
       gbl["df_" + list2[i] + list3[j]] =          
       gbl["df_"+list2[i]].loc[gbl["df_"+list2[i]].finish == list3[j]]
       gbl["df_" + list2[i] + list3[j]]['dfattribval'] = list2[i] + list3[j]
       df_final = df_final.append(gbl["df_" + list2[i] + list3[j]], ignore_index=True)

However, I am not able to loop this over column names. What I would like to do is,

lista = ['color','finish']

df_final = pd.DataFrame()
for a in range(len(lista)):
  for i in range(len(list2)):
    for j in range(len(list3)):
       print 'Current Attribute Value:',lista[a],list2[i],lista[a+1],list3[j]
       gbl["df_"+list2[i]] = df[df.lista[a] == list2[i]]
       gbl["df_" + list2[i] + list3[j]] = gbl["df_"+list2[i]].loc[gbl["df_"+list2[i]].lista[a+1] == list3[j]]
       gbl["df_" + list2[i] + list3[j]]['dfattribval'] = list2[i] + list3[j]
       df_final = df_final.append(gbl["df_" + list2[i] + list3[j]], ignore_index=True)

I get the obvious error -

AttributeError: 'DataFrame' object has no attribute 'lista'.

Anyone would know how to loop over column names and values. Thanks much in advance!

Upvotes: 0

Views: 1574

Answers (1)

Parfait
Parfait

Reputation: 107587

Not quite sure of your needs, but consider permuting your lists with a list comprehension to avoid the nested loops and use a dictionary of data frames. Possibly the scorecalc() apply function can be adjusted to fit your needs:

colorlist = list(df['color'].unique())
finishlist = list(df['finish'].unique())
durationlist = list(df['duration'].unique())

# ALL COMBINATIONS BETWEEN LISTS
allList = [(c,f, d) for c in colorlist for f in finishlist for d in durationlist]

def scorecalc(row):    
    row['score'] = row['duration'].count()
    return(row)

dfList = []; dfDict = {}
for i in allList:    
    # SUBSET DFS
    tempdf = df[(df['color'] == i[0]) & (df['finish']==i[1]) & (df['duration']==i[2])]

    if len(tempdf) > 0:  # FOR NON-EMPTY DFS
        print('Current Attribute Value:', i[0], i[1], i[2])
        tempdf = tempdf.groupby(['color','finish']).apply(scorecalc)        
        tempdf['score'] = tempdf['score'] / len(tempdf)
        print(tempdf)

        key = str(i[0]) + str(i[1]) + str(i[2])
        dfDict[key] = tempdf    # DICTIONARY OF DFS (USE pd.DataFrame(list(...)) FOR FINAL)
        dfList.append(tempdf)   # LIST OF DFS (USE pd.concat() FOR FINAL DF)

# Current Attribute Value: black smooth 12.0
#   ID  color  finish  duration  score
#0  A1  black  smooth      12.0    1.0
#4  B1  black  smooth      12.0    1.0
#Current Attribute Value: white matte 8.0
#   ID  color finish  duration  score
#1  A2  white  matte       8.0    1.0
#5  B2  white  matte       8.0    1.0
#9  C2  white  matte       8.0    1.0
#Current Attribute Value: blue smooth 20.0
#   ID color  finish  duration  score
#2  A3  blue  smooth      20.0    1.0
#Current Attribute Value: green matte 10.0
#   ID  color finish  duration  score
#3  A4  green  matte      10.0    1.0

Upvotes: 1

Related Questions