Reputation: 43
I'm trying to create a column of microsatellite motifs in a pandas dataframe. I have one column that gives the length of the motif and another that has the whole microsatellite.
Here's an example of the columns of interest.
motif_len sequence
0 3 ATTATTATTATT
1 4 ATCTATCTATCT
2 3 ATCATCATCATC
I would like to slice the values in sequence using the values in motif_len to give a single repeat(motif) of each microsatellite. I'd then like to add all these motifs as a third column in the data frame to give something like this.
motif_len sequence motif
0 3 ATTATTATTATT ATT
1 4 ATCTATCTATCT ATCT
2 3 ATCATCATCATC ATC
I've tried a few things with no luck.
>>df['motif'] = df.sequence.str[:df.motif_len]
>>df['motif'] = df.sequence.str[:df.motif_len.values]
Both make the motif column but all the values are NaN.
I think I understand why these don't work. I'm passing a series/array as the upper index in the slice rather than the a value from the mot_len column.
I also tried to create a series by iterating through each Any ideas?
Upvotes: 4
Views: 2568
Reputation: 393893
You can call apply
on the df pass axis=1
to apply row-wise and use the column values to slice the str:
In [5]:
df['motif'] = df.apply(lambda x: x['sequence'][:x['motif_len']], axis=1)
df
Out[5]:
motif_len sequence motif
0 3 ATTATTATTATT ATT
1 4 ATCTATCTATCT ATCT
2 3 ATCATCATCATC ATC
Upvotes: 4