Reputation: 101
I am trying to import this text data into R:
+1 4:1 10:1 18:1 22:1 36:1 40:1 59:1 63:1 67:1 73:1 74:1 76:1 80:1 83:1 -1 3:1 6:1 17:1 29:1 39:1 40:1 52:1 63:1 67:1 73:1 74:1 76:1 82:1 83:1 -1 2:1 6:1 14:1 19:1 39:1 42:1 52:1 64:1 68:1 72:1 74:1 76:1 80:1 98:1
Which is
<label> <feature>:<value> <feature>:<value>...
The data stores only those features that are non-zero.So the first observation is Y=1, 4th 10th 18th...83th feature of X is 1.
I am trying to store these label in one vector, and value in a matrix. The scan or read.table seem not work here. So I need some help to finger out any way to make it possible.
Upvotes: 0
Views: 56
Reputation: 1481
Let's say you have your data in a single line txt file called test.txt. You can load it into R as a string with:
library(dplyr)
library(tidyr)
file = "~/Desktop/test.txt"
read_line = readChar(file, nchars = file.info(file)$size)
It looks from what you posted that is space separated so yo can do:
space_separated = strsplit(read_line, " ", fixed = TRUE)[[1]]
It looks like the labels are the ones that are not containing the character ':' so you can identify their position with:
find_labels = which(!grepl("\\:", space_separated))
now the tricky bit is to split the character in an efficient way and you can achieve it like this:
all_res = lapply(seq_along(find_labels), function(i){
# Create indexes that identify one label
if(i == length(find_labels))
label_subset = seq(find_labels[i], length(space_separated))
else
label_subset = seq(find_labels[i], find_labels[i + 1] - 1)
# Name of the label
the_label = space_separated[label_subset[1]]
# Value
the_subset = space_separated[label_subset[-1]]
data_frame(label = the_label, value = the_subset)
})
This will return a list of dataframes that you can bind together with:
all_res = rbind_all(all_res) %>%
separate(value, c("feature", "value"))
So the output will be a data frame looking like this:
label | feature | value
+1 | 4 | 1
+1 | 10 | 1
+1 | 18 | 1
.......
Upvotes: 0
Reputation: 59335
Another approach.
txt <- "+1 4:1 10:1 18:1 22:1 36:1 40:1 59:1 63:1 67:1 73:1 74:1 76:1 80:1 83:1 -1 3:1 6:1 17:1 29:1 39:1 40:1 52:1 63:1 67:1 73:1 74:1 76:1 82:1 83:1 -1 2:1 6:1 14:1 19:1 39:1 42:1 52:1 64:1 68:1 72:1 74:1 76:1 80:1 98:1"
txt <- gsub("(\\-1|\\+1])","\n\\1",txt)
lines <- readLines(textConnection(txt))
parse.line <- function(line) {
lst <- strsplit(line, " ")[[1]]
mat <- do.call(rbind,lapply(lst[-1],function(x)strsplit(as.character(x),split=":")[[1]]))
data.frame(label=lst[1],mat)
}
result <- do.call(rbind, lapply(lines,parse.line))
So this takes your string (txt
) and embeds CR before each instance of +/-1, then reads the result using readLines(...)
. Then we parse each line using parse.line(...)
into a matrix of feature/value pairs, and a label (+/-1) and combine these into a data.frame. The last line binds the data.frames together row-wise.
This might be similar to the other answer but I'm not really sure.
Upvotes: 1