user46262
user46262

Reputation: 101

How to read this type of data in R

I am trying to import this text data into R:

+1 4:1 10:1 18:1 22:1 36:1 40:1 59:1 63:1 67:1 73:1 74:1 76:1 80:1 83:1 -1 3:1 6:1 17:1 29:1 39:1 40:1 52:1 63:1 67:1 73:1 74:1 76:1 82:1 83:1 -1 2:1 6:1 14:1 19:1 39:1 42:1 52:1 64:1 68:1 72:1 74:1 76:1 80:1 98:1

Which is

<label> <feature>:<value> <feature>:<value>...

The data stores only those features that are non-zero.So the first observation is Y=1, 4th 10th 18th...83th feature of X is 1.

I am trying to store these label in one vector, and value in a matrix. The scan or read.table seem not work here. So I need some help to finger out any way to make it possible.

Upvotes: 0

Views: 56

Answers (2)

Lorenzo Rossi
Lorenzo Rossi

Reputation: 1481

Let's say you have your data in a single line txt file called test.txt. You can load it into R as a string with:

library(dplyr)
library(tidyr)
file = "~/Desktop/test.txt"
read_line = readChar(file, nchars = file.info(file)$size)

It looks from what you posted that is space separated so yo can do:

space_separated = strsplit(read_line, " ", fixed = TRUE)[[1]]

It looks like the labels are the ones that are not containing the character ':' so you can identify their position with:

find_labels = which(!grepl("\\:", space_separated)) 

now the tricky bit is to split the character in an efficient way and you can achieve it like this:

all_res = lapply(seq_along(find_labels), function(i){
  # Create indexes that identify one label
  if(i == length(find_labels))
    label_subset = seq(find_labels[i], length(space_separated))
  else
    label_subset = seq(find_labels[i], find_labels[i + 1] - 1)
  # Name of the label
  the_label = space_separated[label_subset[1]]
  # Value
  the_subset = space_separated[label_subset[-1]]
  data_frame(label = the_label, value = the_subset)
})

This will return a list of dataframes that you can bind together with:

all_res = rbind_all(all_res) %>% 
 separate(value, c("feature", "value"))

So the output will be a data frame looking like this:

label | feature | value
  +1  |   4     |   1
  +1  |   10    |   1     
  +1  |   18    |   1
        .......

Upvotes: 0

jlhoward
jlhoward

Reputation: 59335

Another approach.

txt   <- "+1 4:1 10:1 18:1 22:1 36:1 40:1 59:1 63:1 67:1 73:1 74:1 76:1 80:1 83:1 -1 3:1 6:1 17:1 29:1 39:1 40:1 52:1 63:1 67:1 73:1 74:1 76:1 82:1 83:1 -1 2:1 6:1 14:1 19:1 39:1 42:1 52:1 64:1 68:1 72:1 74:1 76:1 80:1 98:1"
txt   <- gsub("(\\-1|\\+1])","\n\\1",txt)
lines <- readLines(textConnection(txt))
parse.line <- function(line) {
  lst <- strsplit(line, " ")[[1]]
  mat <- do.call(rbind,lapply(lst[-1],function(x)strsplit(as.character(x),split=":")[[1]]))
  data.frame(label=lst[1],mat)
}
result <- do.call(rbind, lapply(lines,parse.line))

So this takes your string (txt) and embeds CR before each instance of +/-1, then reads the result using readLines(...). Then we parse each line using parse.line(...) into a matrix of feature/value pairs, and a label (+/-1) and combine these into a data.frame. The last line binds the data.frames together row-wise.

This might be similar to the other answer but I'm not really sure.

Upvotes: 1

Related Questions