Reputation: 23
I know it is a repeated question, but I tried the answers from other questions and I could not solve the issue.
In summary, I want to replace 0 by 'A A ', 1 by 'A B ', 2 by 'B B ', and 5 by '0 0 '.
My impute datafile (datafile.txt) format is presented below, and I want to replace just information in the column "Geno" (in the true dataset I have a million lines).
Sample | Geno |
---|---|
ID1 | 11010111151 |
ID2 | 12000120022 |
ID3 | 12055520022 |
ID4 | 12000120022 |
The pipeline I am using is:
import pandas as pd
#input file
fin = pd.read_table('dataframe.txt',sep = ' ', header=None)
df = pd.DataFrame(fin)
geno = (df.iloc[: , 1:])
id = (df.iloc[: , 0])
geno = pd.DataFrame(geno)
geno2 = geno.replace("0","A A ").replace("1","A B ").replace("2","B B ").replace("5","0 0 ")
I appreciate your help! I was doing it in bash (using awk), but it is taking a long time. I decided to try in Python since I believe would be faster. PS: I am beginner in Python. Thank you again.
Upvotes: 2
Views: 57
Reputation: 35686
Series Replace with a dict
is also an option:
import pandas as pd
df = pd.DataFrame({
'Sample': ['ID1', 'ID2', 'ID3', 'ID4'],
'Geno': [11010111151, 12000120022, 12055520022, 12000120022]
})
df['Geno'] = df['Geno'].astype(str).replace({
'0': ' A A',
'1': ' A B',
'2': ' B B',
'5': ' 0 0'
}, regex=True).str.lstrip()
print(df)
df
:
Sample Geno
0 ID1 A B A B A A A B A A A B A B A B A B 0 0 A B
1 ID2 A B B B A A A A A A A B B B A A A A B B B B
2 ID3 A B B B A A 0 0 0 0 0 0 B B A A A A B B B B
3 ID4 A B B B A A A A A A A B B B A A A A B B B B
Upvotes: 2
Reputation: 14949
TRY:
df.Geno = df.Geno.astype(str).str.replace("0","A A ").str.replace("1","A B ").str.replace("2","B B ").str.replace("5","0 0 ")
OUTPUT:
Sample Geno
0 ID1 A B A B A A A B A A A B A B A B A B 0 0 A B
1 ID2 A B B B A A A A A A A B B B A A A A B B B B
2 ID3 A B B B A A 0 0 0 0 0 0 B B A A A A B B B B
3 ID4 A B B B A A A A A A A B B B A A A A B B B B
Upvotes: 3