Python - Encoding Genomic Data in dataframe

Question

Hi I'm trying to encode a Genome, stored as a string inside a dataframe read from a CSV.

Right now I'm looking to split each string in the dataframe under the column 'Genome' into a list of it's base pairs i.e. from ('acgt...') to ('a','c','g','t'...) then convert each base pair into a float (0.25,0.50,0.75,1.00) respectively.

I thought I was looking for a split function to split each string into characters but none seem to work on the data in the dataframe even when transformed to string using .tostring

Here's my most recent code:

import re
import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder


def string_to_array(my_string):
    my_string = my_string.lower()
    my_string = re.sub('[^acgt]', 'z', my_string)
    my_array = np.array(list(my_string))
    return my_array

label_encoder = LabelEncoder()
label_encoder.fit(np.array(['a','g','c','t','z']))

def ordinal_encoder(my_array):
    integer_encoded = label_encoder.transform(my_array)
    float_encoded = integer_encoded.astype(float)
    float_encoded[float_encoded == 0] = 0.25  # A
    float_encoded[float_encoded == 1] = 0.50  # C
    float_encoded[float_encoded == 2] = 0.75  # G
    float_encoded[float_encoded == 3] = 1.00  # T
    float_encoded[float_encoded == 4] = 0.00  # anything else, z
    return float_encoded



dfpath = 'C:\Users\CAAVR\Desktop\Ison.csv'
dataframe = pd.read_csv(dfpath)

df = ordinal_encoder(string_to_array(dataframe[['Genome']].values.tostring()))
print(df)

I've tried making my own function but I don't have any clue how they work. Everything I try points to not being able to process data when it's in a numpy array and nothing is working to transform the data to another type.

Thanks for the tips!

Edit: Here is the print of the dataframe-

 Antibiotic  ...                                             Genome
0       isoniazid  ...  ccctgacacatcacggcgcctgaccgacgagcagaagatccagctc...
1       isoniazid  ...  gggggtgctggcggggccggcgccgataaccccaccggcatcggcg...
2       isoniazid  ...  aatcacaccccgcgcgattgctagcatcctcggacacactgcacgc...
3       isoniazid  ...  gttgttgttgccgagattcgcaatgcccaggttgttgttgccgaga...
4       isoniazid  ...  ttgaccgatgaccccggttcaggcttcaccacagtgtggaacgcgg...

There are 5 columns 'Genome' being the 5th in the list I don't know why 1. .head() will not work and 2. why print() doesn't give me all columns...

Dave · Accepted Answer

I don't think LabelEncoder is what you want. This is a simple transformation, I recommend doing it directly. Start with a lookup your base pair mapping:

lookup = {
  'a': 0.25,
  'g': 0.50,
  'c': 0.75,
  't': 1.00
  # z: 0.00
}

Then apply the lookup to value of the "Genome" column. The values attribute will return the resulting dataframe as an ndarray.

dataframe['Genome'].apply(lambda bps: pd.Series([lookup[bp] if bp in lookup else 0.0 for bp in bps.lower()])).values

Python - Encoding Genomic Data in dataframe

Answers (1)

Related Questions