Reputation: 11
I am using R for a research project that requires me to input a sequence of 1-5 of varying length and then calculate a score from that sequence.
The data frame I have stores the sequences as a factor. If I take a single entry and convert it to a numeric vector, I can input it into the formula. But if I try to do this for all rows I run into errors.
I have searched SO and other sources but only found information on how to convert factors to numeric if they contain one value per cell. My data contains a sequence of numbers per cell separated by commas. If I take input from one cell and use as.numeric(strsplit(as.character it works. But I don't want to do all cells manually. How can I solve this?
This is what I did:
df <- read.csv2("example_seq_logs.csv", na.strings = "n/a")
df$seqtext <- as.character(df$hmm)
This is what the data frame looks like:
head(df)
lesson hmm
1 A 1,2,3,3,3,4,3,4,5,4,4,5,5,2,2,1,2,3,4,2,3
2 B 2,2,3,4,1,1,3,3,3,5,5,4,4,4,2,1
3 C 1,3,1,3,2,3,2,2,3,3,4,1,3,2,3,3,5,4,4,3,3
4 D 1,3,2,2,3,3,2,3,1,4,4,5,5,2,4,4,4,3
5 E 1,4,2,5,1,3,1,3,1,4,3,4,4
str(df)
'data.frame': 5 obs. of 2 variables:
$ lesson: Factor w/ 5 levels "A","B","C","D",..: 1 2 3 4 5
$ hmm : Factor w/ 5 levels "1,2,3,3,3,4,3,4,5,4,4,5,5,2,2,1,2,3,4,2,3",..: 1 5 2 3 4
sapply(df, mode)
lesson hmm
"numeric" "numeric"
Now if I take a single entry I can do this:
testseq <- as.numeric(strsplit(df$seqtext)[1],",")[[1]])
str(testseq)
num [1:21] 1 2 3 3 3 4 3 4 5 4 ...
and then I can input the testseq sequence into the function I need. But when I try the same for the whole column it results in an error
df$seq <- as.numeric(strsplit(df$seqtext, ","))[[1:58]]
Error: (list) object cannot be coerced to type 'double'
Thank you for your help!
Edit: The first suggestion yields this error:
df$seq <- as.numeric(unlist(strsplit(paste(df$seqtext, collapse = ","), ",")))
Error in `$<-.data.frame`(`*tmp*`, seq, value = c(1, 2, 3, 3, 3, 4, 3, :
replacement has 89 rows, data has 5
It seems it turns the entire column into one long string.
a <- as.numeric(unlist(strsplit(paste(df$seqtext, collapse = ","), ",")))
print(a)
[1] 1 2 3 3 3 4 3 4 5 4 4 5 5 2 2 1 2 3 4 2 3 2 2 3 4 1 1 3 3 3 5 5 4 4 4 2 1 1 3 1 3 2 3 2 2 3 3 4 1 3 2 3
[53] 3 5 4 4 3 3 1 3 2 2 3 3 2 3 1 4 4 5 5 2 4 4 4 3 1 4 2 5 1 3 1 3 1 4 3 4 4
But I need each sequence to turn up in the right row as a string.
Edit: I found that the function I need to calculate results with doesn't need numerics so now I've solved the issue using a for loop:
df$score <- 0
for (i in 1:nrow(df)) {
seq <- as.array(strsplit(as.character(df$hmm),","))
session_seq <- seq[i]
res = computehmm(session_seq)
df$score[i] <- res$score
}
But now it stops calculating once it reaches an empty df$hmm field.
I understand sapply would be better but I don't understand how to get it to work.
Upvotes: 1
Views: 111
Reputation: 3183
You can use paste
as:
as.numeric(unlist(strsplit(paste(df$seqtext, collapse = ","), ",")))
Upvotes: 1