Bob McBobson
Bob McBobson

Reputation: 904

How to create a new column in a dataframe whose values represent the ranges that values from a certain column fall into?

I have read a .csv file to create a dict that, for each given sequence, contains names as keys and a list with one DNA sequence and one fluorescence measurement as values. After these sequences are processed for a while by a variety of other functions, I will be making a new dataframe that contains the fluorescence values and other various values that are the products of the said functions.

I now want to create a new column that basically "sorts" each each row into a class that represents the range in which the fluorescence measurement falls in. For example, if a certain DNA sequence is associated with a fluorescence measurement of 240, it should fall into the class labeled "200-300", or "100-400". As I have not yet decided what sizes my ranges should be set to, just assume that I will have three classes (for the sake of simplicity): "<100", "100-200", and ">200".

I have the following code that works fine for making a new dataframe with the new values, but I don't know how to set it up in order to make add the "class" in which the respective fluorescence measurements fall in.

def data_assembler(folder_contents):
    df= DataFrame(columns= ['Column1','Column2','Column3])
    for candidate in folder_contents.keys()[:50]:
        fluorescence= folder_contents[candidate][0]
        score0= fluorescence 
        if score0 < 100:
             class1= str("<100")
        elif score0>100 and score0<200:
             class2= str("100-200")
        elif score0>200:
             class3= str(">200")
        score1= calculate_complex_mfe(folder_contents[candidate][1])
        score2= calculate_complex_ensemble_defect(folder_contents[candidate][1])
        score3= calculate_GC_content(folder_contents[candidate][1])
    ###note: the following line is not correct because I'm not sure how to add the class to the particular cell
    df.loc[candidate]= [class1 or class2 or class3 or score0, score1, score2, score3]
    df= df.sort(['score3'], ascending=False)
df.to_csv(path.join(output, "DNAScoring.csv"))

How can I ameliorate my code in order for it to ultimately have a dataframe that would look something like this:

enter image description here

Upvotes: 2

Views: 980

Answers (1)

jezrael
jezrael

Reputation: 862641

I think you need cut:

df = pd.DataFrame({'Fluorescence':[0,100,200,300]})
bins = [-np.inf, 99, 200, np.inf]
labels=['<100','100-200','>200']
df['Class'] = pd.cut(df['Fluorescence'], bins=bins, labels=labels)
print (df)
   Fluorescence    Class
0             0     <100
1           100  100-200
2           200  100-200
3           300     >200

Upvotes: 3

Related Questions