Reputation: 49
I have a dataset contains DNA sequences, and I want to convert them into a numerical representation. As in this document:
Upvotes: 1
Views: 1275
Reputation: 46
I believe the process you're referring to is one-hot encoding. You'll first want to transform your DNA sequence into a sequence of 3bp words using a sliding window of width 3. see here: Generate a list of strings with a sliding window using itertools, yield, and iter() in Python 2.7.1?
So you should have something like a list of DNA "words" (e.g. ['aaa', 'tgc']
)Then you'll want to convert each of the words into a vector. One way to do this is to create a dictionary with keys corresponding to all possible words and values with the one-hot representation. Then you can simply convert each word to its corresponding vector using a list comprehension and dictionary look-up. That might not be the most efficient way to do it, but it's a start. sklearn has OneHotEncoder, but it only works on integers.
See also https://machinelearningmastery.com/how-to-one-hot-encode-sequence-data-in-python/
Upvotes: 2