Reputation: 1185
I have a CSV file with 6 cols. I load it to memory and process by some methods. My result is a data frame with 4 cols looks like:
name number Allele Allele
aaa 111 A B
aab 112 A A
aac 113 A B
But now I got csv with another format (no Illumina) and I need to change it to above.
I have a result:
name number Allele1 Allele2
aaa 111 A C
aab 112 A G
aac 113 G G
I know how to change format, for example AG == AB, GG == AA, CC == AA (too) etc. But it the better way to do this than for loop?
Lets say:
for line in range(len(dataframe)):
if(dataframe.Allele1[line] == A and dataframe.Allele2[line] == G):
dataframe.Allele1[line] = A
dataframe.Allele2[line] = B
elif:
etc.
I feel that this is not the best method to accomplish this task. Meaby is a better way in pandas or just python?
I need to change thath format to Illumina format because database deal with Illumina.
And: in illumina AA = AA,CC,GG; AB = AC, AG, AT, CT, GT; BB = CG, TT etc.
So if row[1] in col Allele1 is A and in Allele2 is T, edited row will be: Allele1 = A, Allele2 = B.
The expected result is:
name number Allele1 Allele2
aaa 111 A B
aab 112 A B
aac 113 A A
In result I MUST have a 4 cols.
Upvotes: 0
Views: 64
Reputation: 2032
You can try this (to convert AG to AB) :
df.loc[df['Allele1'] == 'A' & df['Allele1'] == 'G', 'Allele1'] = 'A'
df.loc[df['Allele1'] == 'A' & df['Allele1'] == 'G', 'Allele2'] = 'B'
Upvotes: 0
Reputation: 72
Have you tried using pandas.DataFrame.replace? For instance:
df['Allele1'].replace(['GC', 'CC'], 'AA')
With that line you could replace in the column "Allele1" the values GC and CC for the one you look for, AA. You can apply that logic for all the substitutions you need, and If you desire to do it in the whole dataframe just don't specify the column, do instead something like:
df.replace(['GC', 'CC'], 'AA')
Upvotes: 1