comparing one string with all the others in the same column in python

Question

I have a pandas df that looks like so:

df = pd.DataFrame({'index': {0: 34, 1: 35, 2: 36, 3: 37, 4: 38},
 'lane': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1},
 'project': {0: 'default',
  1: 'default',
  2: 'default',
  3: 'default',
  4: 'default'},
 'sample': {0: 'None-BORD1778',
  1: 'None-BORD1779',
  2: 'None-BORD1780',
  3: 'None-BORD1782',
  4: 'None-BORD1783'},
 'barcode_sequence': {0: 'AACCTACG',
  1: 'TTGCGAGA',
  2: 'TTGCTTGG',
  3: 'TACACACG',
  4: 'TTCGGCTA'},
 'pf_clusters': {0: '"1,018,468"',
  1: '"750,563"',
  2: '"752,191"',
  3: '"876,957"',
  4: '"695,347"'},
 '%_of_the_lane': {0: 0.28, 1: 0.21, 2: 0.21, 3: 0.24, 4: 0.19},
 '%_perfect_barcode': {0: 100.0, 1: 100.0, 2: 100.0, 3: 100.0, 4: 100.0},
 'yield_(mbases)': {0: '511', 1: '377', 2: '378', 3: '440', 4: '349'},
 '%_pf_clusters': {0: 100.0, 1: 100.0, 2: 100.0, 3: 100.0, 4: 100.0},
 '%_>=_q30_bases': {0: 89.74, 1: 89.9, 2: 89.0, 3: 89.31, 4: 88.69},
 'mean_quality_score': {0: 35.13, 1: 35.15, 2: 34.98, 3: 35.04, 4: 34.92}})

I am now trying to do the following. For each of the values under the column barcode_sequence, I want to compare, character by character, how similar they are to all of the other values under that same column.

For that I have defined the following function:

def compare(s1,s2):
    return len([x for x in range(len(s1)) if s1[x] == s2[x]])/len(s1)

Now I want to apply this function to each value under df['barcode_sequence']. This means that, in my first iteration (where s1 is AACCTACG) I would apply the function compare to all other values under the same column i.e. AACCTACG with TTGCGAGA, TTGCTTGG, TACACACG and TTCGGCTA. Then I would do the same for the second row TTGCGAGA (which is now my new value of s1), and so on, until I reach the final entry under df['barcode_sequence'].

So far I have got the number of iterations that I need for each entry under df['barcode_sequence'], which can be achieved with a combination of a nested for loop with the iterrows() method. So if I do:

for index, row in df.iterrows():
    for sample in list(range(len(df.index))):
        print(index, row['sample'],row['barcode_sequence'])

I get at least which string I am comparing (my s1 in compare) and the number of comparisons I will do for each s1.

Though I am stuck at extracting all the s2 for each s1

YOLO · Accepted Answer

Here's a way to do using a cross join format (no explicit for loops required):

# do a cross join 
df1 = df[['barcode_sequence']].copy()
df1['barcode_un'] = [df1['barcode_sequence'].unique().tolist() for _ in range(df1.shape[0])]

# remove duplicate rows
df1 = df1.explode('barcode_un').query("barcode_sequence != barcode_un").reset_index(drop=True)

# calculate the score
df1['score'] = df1.apply(lambda x: compare(x['barcode_sequence'], x['barcode_un']), 1)

print(df1)

   barcode_sequence barcode_un  score
0          AACCTACG   TTGCGAGA  0.250
1          AACCTACG   TTGCTTGG  0.375
2          AACCTACG   TACACACG  0.625
3          AACCTACG   TTCGGCTA  0.125
4          TTGCGAGA   AACCTACG  0.250
5          TTGCGAGA   TTGCTTGG  0.625
6          TTGCGAGA   TACACACG  0.250
7          TTGCGAGA   TTCGGCTA  0.500
8          TTGCTTGG   AACCTACG  0.375
9          TTGCTTGG   TTGCGAGA  0.625
10         TTGCTTGG   TACACACG  0.250
11         TTGCTTGG   TTCGGCTA  0.250
12         TACACACG   AACCTACG  0.625
13         TACACACG   TTGCGAGA  0.250
14         TACACACG   TTGCTTGG  0.250
15         TACACACG   TTCGGCTA  0.250
16         TTCGGCTA   AACCTACG  0.125
17         TTCGGCTA   TTGCGAGA  0.500
18         TTCGGCTA   TTGCTTGG  0.250
19         TTCGGCTA   TACACACG  0.250

comparing one string with all the others in the same column in python

Answers (1)

Related Questions