Tina
Tina

Reputation: 49

DNA sequence into feature

I have a dataset contains DNA sequences, and I want to convert them into a numerical representation. As in this document:

DNA to Binary

Upvotes: 1

Views: 1275

Answers (1)

brinebroker
brinebroker

Reputation: 46

I believe the process you're referring to is one-hot encoding. You'll first want to transform your DNA sequence into a sequence of 3bp words using a sliding window of width 3. see here: Generate a list of strings with a sliding window using itertools, yield, and iter() in Python 2.7.1?

So you should have something like a list of DNA "words" (e.g. ['aaa', 'tgc'])Then you'll want to convert each of the words into a vector. One way to do this is to create a dictionary with keys corresponding to all possible words and values with the one-hot representation. Then you can simply convert each word to its corresponding vector using a list comprehension and dictionary look-up. That might not be the most efficient way to do it, but it's a start. sklearn has OneHotEncoder, but it only works on integers.

See also https://machinelearningmastery.com/how-to-one-hot-encode-sequence-data-in-python/

Upvotes: 2

Related Questions