Lukas
Lukas

Reputation: 59

Binary Classification with char/string features

Im currently working on a binary classification problem with proteins. The goal is to figure out whether or not a mutation will change the proteins function from active to inactive. The mutation can happen at 4 different but fixed places in the amino acid chain that makes up the protein. So my feature vector consists of a char code of length 4, where each char represents the amino acid at one of the 4 places where a mutation takes place. In total there are 21 possible amino acids.

My question is how would I turn this string of 4 chars into something numerical for my classification. What i tried so far is turning each cahr into the ASCII decimal representing the capital letter for that char (e.g. A->65) but this gave me only mediocre results.

I found something about one hot encoding but I don't know how to use it since besides the information about 4 of the total 21 amino acids occur in the mutation also the position at which they occur is important in my case.

This is a sample of the training data:

enter image description here

Upvotes: 0

Views: 354

Answers (1)

user1808924
user1808924

Reputation: 4926

my feature vector consists of a char code of length 4.. .. the position at which they occur is important in my case

Expand your four-character string into four one-character strings. This way there will be one feature per site (let's call them "S1", "S2", "S3" and "S4").

This way each protein site will be independent of other sites.

how would I turn this string of 4 chars into something numerical for my classification.

Simply apply one-hot-encoding to each of those one-character features. Assuming you're working in Scikit-Learn environment, you could use sklearn_pandas.DataFrameMapper or sklearn.compose.ColumnTransformer to perform this mapping:

mapper = DataFrameMapper([
  (["S1", "S2", "S3", "S4"], OneHotEncoder())
])
classifier = LogisticRegression()
pipeline = Pipeline([
  ("mapper", mapper),
  ("classifier", classifier)
])
pipeline.fit(X, y)

Upvotes: 1

Related Questions