Removing duplicate entries and extracting desired information

Question

I have a 2 X 2 mattrix that looks like this :

DNA_pol3_beta_3 121 Paja_0001_peg_[locus_tag=BCY86_RS00005] 384 1.2e+03 16  44  23  49
DNA_pol3_beta_3 121 Paja_0001_peg_[locus_tag=BCY86_RS00005] 384 6.3e-27 2   121 264 383
DNA_pol3_beta_2 116 Paja_0001_peg_[locus_tag=BCY86_RS00005] 384 3.7 2   96  5   95
DNA_pol3_beta_2 116 Paja_0001_peg_[locus_tag=BCY86_RS00005] 384 5e-20   3   115 133 260
DNA_pol3_beta_2 116 Paja_0001_peg_[locus_tag=BCY86_RS00005] 384 1.3e+03 3   21  277 295
DNA_pol3_beta_2 116 Paja_0001_peg_[locus_tag=BCY86_RS00005] 384 4.1e+03 14  29  345 360
DNA_pol3_beta   121 Paja_0001_peg_[locus_tag=BCY86_RS00005] 384 6.9e-18 1   121 1   121
DNA_pol3_beta   121 Paja_0001_peg_[locus_tag=BCY86_RS00005] 384 4.1e+02 30  80  157 209
DNA_pol3_beta   121 Paja_0001_peg_[locus_tag=BCY86_RS00005] 384 0.94    2   101 273 369
SMC_N   220 Paja_0002_peg_[locus_tag=BCY86_RS00010] 378 1.2e-14 3   199 19  351
AAA_21  303 Paja_0002_peg_[locus_tag=BCY86_RS00010] 378 0.00011 1   32  40  68
AAA_21  303 Paja_0002_peg_[locus_tag=BCY86_RS00010] 378 0.0015  231 300 279 352
AAA_15  369 Paja_0002_peg_[locus_tag=BCY86_RS00010] 378 4e-05   4   53  19  67
AAA_15  369 Paja_0002_peg_[locus_tag=BCY86_RS00010] 378 8.8e+02 347 363 332 348
AAA_23  200 Paja_0002_peg_[locus_tag=BCY86_RS00010] 378 0.0014  3   41  22  60

I want to filter out the results so that for example, for the item "DNA_pol3_beta_3" there are 2 entries. out of these two entries, I want to extract only that row whose respective value at the 5th column is the lowest. so that means, out of the two entries :

DNA_pol3_beta_3 121 Paja_0001_peg_[locus_tag=BCY86_RS00005] 384 6.3e-27 2   121 264 383

the above one should be in the result. similarly for "DNA_pol3_beta_2" there are 4 entries and the program should extract only

DNA_pol3_beta_2 116 Paja_0001_peg_[locus_tag=BCY86_RS00005] 384 5e-20   3   115 133 260

because it has the lowest value of 5th column among 4. Also, the program should ignore the entries whose value at 5th column is less than 1E-5.

i tried following code :

for i in lines:
    if lines[i+1] == lines [i]:
        if lines[i+1][4] > lines [i][4]:
            evalue = lines[i][4]
        else:
            evalue = lines[i+1][4]

IoaTzimas · Accepted Answer

You would better use pandas for this. See below:

import pandas as pd

df=pd.read_csv('yourfile.txt', sep=' ', skipinitialspace=True, names=(range(9)))

df=df[df[4]>=0.00001]

result=df.loc[df.groupby(0)[4].idxmin()].sort_index().reset_index(drop=True)

Output:

>>> print(result)
                 0    1                                        2    3           4   5    6    7    8
0  DNA_pol3_beta_3  121  Paja_0001_peg_[locus_tag=BCY86_RS00005]  384  1200.00000  16   44   23   49
1  DNA_pol3_beta_2  116  Paja_0001_peg_[locus_tag=BCY86_RS00005]  384     3.70000   2   96    5   95
2    DNA_pol3_beta  121  Paja_0001_peg_[locus_tag=BCY86_RS00005]  384     0.94000   2  101  273  369
3           AAA_21  303  Paja_0002_peg_[locus_tag=BCY86_RS00010]  378     0.00011   1   32   40   68
4           AAA_15  369  Paja_0002_peg_[locus_tag=BCY86_RS00010]  378     0.00004   4   53   19   67
5           AAA_23  200  Paja_0002_peg_[locus_tag=BCY86_RS00010]  378     0.00140

If you want the file back to csv, you can save it with df.to_csv()

Removing duplicate entries and extracting desired information

Answers (1)

Related Questions