Sasihci
Sasihci

Reputation: 143

concat the strings of one column based on condition on other column

I have a data frame that I want to remove duplicates on column named "sample" and the add string information in gene and status columns to new column as shown in the attached pics.

Thank you so much in advance

enter image description here

below is the modified version of data frame.where gene in rows are replaced by actual gene namesenter image description here

Upvotes: 3

Views: 4013

Answers (2)

lightalchemist
lightalchemist

Reputation: 10211

Here, df is your Pandas DataFrame.

def new_1(g):
    return ','.join(g.gene)

def new_2(g):
    return ','.join(g.gene + '-' + g.status)    

new_1_data = df.groupby("sample").apply(new_1).to_frame(name="new_1")
new_2_data = df.groupby("sample").apply(new_2).to_frame(name="new_2")

new_data = pd.merge(new_1_data, new_2_data, on="sample")
new_df = pd.merge(df, new_data, on="sample").drop_duplicates("sample")

If you wish to have "sample" as a column instead of an index, then add

new_df = new_df.reset_index(drop=True)

Lastly, as you did not specify which of the original rows of duplicates to retain, I simply use the default behavior of Pandas and drop all but the first occurrence.

Edit

I converted your example to the following CSV file (delimited by ',') which I will call "data.csv".

sample,gene,status
ppar,p53,gain
ppar,gata,gain
ppar,nb,loss
srty,nf1,gain
srty,cat,gain
srty,cd23,gain
tygd,brac1,loss
tygd,brac2,gain
tygd,ras,loss

I load this data as

# Default delimiter is ','. Pass `sep` argument to specify delimiter.
df = pd.read_csv("data.csv")  

Running the code above and printing the dataframe produces the output

  sample   gene status            new_1                           new_2
0   ppar    p53   gain      p53,gata,nb      p53-gain,gata-gain,nb-loss
3   srty    nf1   gain     nf1,cat,cd23     nf1-gain,cat-gain,cd23-gain
6   tygd  brac1   loss  brac1,brac2,ras  brac1-loss,brac2-gain,ras-loss

This is exactly the expected output given in your example.

Note that the left-most column of numbers (0, 3, 6) are the remnants of the index of the original dataframes produced after the merges. When you write this dataframe to file you can exclude it by setting index=False for df.to_csv(...).

Edit 2

I checked the CSV file you emailed me. You have a space after the word "gene" in the header of your CSV file.

Change the first line of your CSV file from

sample,gene ,status

to

sample,gene,status

Also, there are spaces in your entries. If you wish to remove them, you can

# Strip spaces from entries. Only works for string entries
df = df.applymap(lambda x: x.strip()) 

Upvotes: 5

CGul
CGul

Reputation: 167

Might not be the most efficient solution but this should get you there:

        samples = []
        genes= []
        statuses = []

        for s in set(df["sample"]):
            #grab unique samples
            samples.append(s)

            #get the genes for each sample and concatenate them
            g = df["gene"][df["sample"]==s].str.cat(sep=",")
            genes.append(g)

            #loop through the genes for the sample and get the statuses
            status = ''
            for gene in g.split(","):
                gene_status = df["status"][(df["sample"] == s) & (df["gene"] == gene)].to_string(index=False)

                status += gene
                status += "-"
                status += gene_status
                status += ','

            statuses.append(status)

        #create new df
        new_df = pd.DataFrame({'sample': samples,
                               'new': genes,
                               'new1': statuses})

Upvotes: 1

Related Questions