Reputation: 2250
I have a pandas df that looks like so:
df = pd.DataFrame({'index': {0: 34, 1: 35, 2: 36, 3: 37, 4: 38},
'lane': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1},
'project': {0: 'default',
1: 'default',
2: 'default',
3: 'default',
4: 'default'},
'sample': {0: 'None-BORD1778',
1: 'None-BORD1779',
2: 'None-BORD1780',
3: 'None-BORD1782',
4: 'None-BORD1783'},
'barcode_sequence': {0: 'AACCTACG',
1: 'TTGCGAGA',
2: 'TTGCTTGG',
3: 'TACACACG',
4: 'TTCGGCTA'},
'pf_clusters': {0: '"1,018,468"',
1: '"750,563"',
2: '"752,191"',
3: '"876,957"',
4: '"695,347"'},
'%_of_the_lane': {0: 0.28, 1: 0.21, 2: 0.21, 3: 0.24, 4: 0.19},
'%_perfect_barcode': {0: 100.0, 1: 100.0, 2: 100.0, 3: 100.0, 4: 100.0},
'yield_(mbases)': {0: '511', 1: '377', 2: '378', 3: '440', 4: '349'},
'%_pf_clusters': {0: 100.0, 1: 100.0, 2: 100.0, 3: 100.0, 4: 100.0},
'%_>=_q30_bases': {0: 89.74, 1: 89.9, 2: 89.0, 3: 89.31, 4: 88.69},
'mean_quality_score': {0: 35.13, 1: 35.15, 2: 34.98, 3: 35.04, 4: 34.92}})
I am now trying to do the following. For each of the values under the column barcode_sequence
, I want to compare, character by character, how similar they are to all of the other values under that same column.
For that I have defined the following function:
def compare(s1,s2):
return len([x for x in range(len(s1)) if s1[x] == s2[x]])/len(s1)
Now I want to apply this function to each value under df['barcode_sequence']
. This means that, in my first iteration (where s1
is AACCTACG
) I would apply the function compare
to all other values under the same column i.e. AACCTACG
with TTGCGAGA
, TTGCTTGG
, TACACACG
and TTCGGCTA
. Then I would do the same for the second row TTGCGAGA
(which is now my new value of s1
), and so on, until I reach the final entry under df['barcode_sequence']
.
So far I have got the number of iterations that I need for each entry under df['barcode_sequence']
, which can be achieved with a combination of a nested for loop with the iterrows()
method. So if I do:
for index, row in df.iterrows():
for sample in list(range(len(df.index))):
print(index, row['sample'],row['barcode_sequence'])
I get at least which string I am comparing (my s1
in compare
) and the number of comparisons I will do for each s1
.
Though I am stuck at extracting all the s2
for each s1
Upvotes: 2
Views: 490
Reputation: 21709
Here's a way to do using a cross join format (no explicit for loops required):
# do a cross join
df1 = df[['barcode_sequence']].copy()
df1['barcode_un'] = [df1['barcode_sequence'].unique().tolist() for _ in range(df1.shape[0])]
# remove duplicate rows
df1 = df1.explode('barcode_un').query("barcode_sequence != barcode_un").reset_index(drop=True)
# calculate the score
df1['score'] = df1.apply(lambda x: compare(x['barcode_sequence'], x['barcode_un']), 1)
print(df1)
barcode_sequence barcode_un score
0 AACCTACG TTGCGAGA 0.250
1 AACCTACG TTGCTTGG 0.375
2 AACCTACG TACACACG 0.625
3 AACCTACG TTCGGCTA 0.125
4 TTGCGAGA AACCTACG 0.250
5 TTGCGAGA TTGCTTGG 0.625
6 TTGCGAGA TACACACG 0.250
7 TTGCGAGA TTCGGCTA 0.500
8 TTGCTTGG AACCTACG 0.375
9 TTGCTTGG TTGCGAGA 0.625
10 TTGCTTGG TACACACG 0.250
11 TTGCTTGG TTCGGCTA 0.250
12 TACACACG AACCTACG 0.625
13 TACACACG TTGCGAGA 0.250
14 TACACACG TTGCTTGG 0.250
15 TACACACG TTCGGCTA 0.250
16 TTCGGCTA AACCTACG 0.125
17 TTCGGCTA TTGCGAGA 0.500
18 TTCGGCTA TTGCTTGG 0.250
19 TTCGGCTA TACACACG 0.250
Upvotes: 1