Omar Reyes
Omar Reyes

Reputation: 13

R Split string data delimited by spaces into columns

I have a large data frame with one column, containing different numeric values separated by spaces, that I need to extract and organize in columns

<Call Begin=6.0982886400000051 End=6.1078732800000051 MaxFreq=40893.5546875 MinFreq=35400.390625 PeakFreq=39672.8515625 PeakFreqs=39672.8515625 39672.8515625 39672.8515625 39672.8515625 39672.8515625 39672.8515625 39672.8515625 39672.8515625 39672.8515625 39672.8515625 39672.8515625 39672.8515625 39062.5 39062.5 39062.5 39062.5 39062.5 39062.5 39062.5 39062.5 39062.5 39062.5 39062.5 39062.5 39062.5 39062.5 39062.5 39062.5 39062.5 39062.5 38452.1484375 38452.1484375 38452.1484375 38452.1484375 38452.1484375 38452.1484375 38452.1484375 38452.1484375 38452.1484375 38452.1484375 38452.1484375 38452.1484375 38452.1484375 38452.1484375 38452.1484375 38452.1484375 38452.1484375 38452.1484375 38452.1484375 38452.1484375 38452.1484375 38452.1484375 38452.1484375 38452.1484375 38452.1484375 38452.1484375 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37231.4453125 37231.4453125 37231.4453125 37231.4453125 37231.4453125 37231.4453125 37231.4453125 37231.4453125 37231.4453125 37231.4453125 37231.4453125 37231.4453125 37231.4453125 37231.4453125 37231.4453125 37231.4453125 37231.4453125 37231.4453125 37231.4453125 37231.4453125 36621.09375 36621.09375 36621.09375 36621.09375 Intensity=-14.902734633213136 Periodicity=0.853448275862069 Shape=- CallType=cf-n Species=Pipistrellus kuhlii (77%), Pipistrellus nathusii (77%) Custom=false /> 

this is more information about my data

'data.frame':39 obs. of  1 variable $ x1: Factor w/ 120 levels "
<double>25.318181818181806</double>",..: 66 67 68 69 70 71 72 73 74 75...

I need something like that:

     call_begin            call_end         maxfrec         minfrec
1 0.59170816000000048 0.60006400000000049 531.005.859.375 433.349.609.375
2  0.7636582400000006 0.77135872000000061 531.005.859.375  42.724.609.375
         peakfrec
1 482.177.734.375
2 469.970.703.125

I have some ideas to achieve this, first try to separate in columns, using strsplit, and later use substr function, to extract the numbers and finally rbind to make a table, I found some threads with some related topics, but I could replicate it in my data.

I'll appreciate any help and please let me know if this is not clear.

Upvotes: 1

Views: 1784

Answers (3)

Rohit Das
Rohit Das

Reputation: 2042

It all depends on how strictly your data follows the pattern. For the data you have given you can split on " " and "=" in one go and just extract the relevant columns in one go.

result <- do.call(rbind,lapply(strList,function(s) {strsplit(s,split = "[ =]")[[1]][c(3,5,7,9,11)]}))

You can then name the columns whatever you want using names() function.

Upvotes: 0

romants
romants

Reputation: 3648

Similar solution to what you described. This solution is a bit more generic and doesn't depend on number of columns:

text <- '<Call Begin=0.59170816000000048 End=0.60006400000000049 MaxFreq=53100.5859375 MinFreq=43334.9609375 PeakFreq=48217.7734375
<Call Begin=0.7636582400000006 End=0.77135872000000061 MaxFreq=53100.5859375 MinFreq=42724.609375 PeakFreq=46997.0703125'

process_line <- function(line) {
    sp <- strsplit(line, ' ')[[1]][-1]
    cn <- sapply(sp, function(x) strsplit(x, "=")[[1]][1])
    data <- sapply(sp, function(x) as.numeric(strsplit(x, "=")[[1]][2]))
    names(data) <- cn
    data
}

t(sapply(strsplit(text, "\n")[[1]], process_line, USE.NAMES = FALSE))
         Begin       End  MaxFreq  MinFreq PeakFreq
[1,] 0.5917082 0.6000640 53100.59 43334.96 48217.77
[2,] 0.7636582 0.7713587 53100.59 42724.61 46997.07

It is based on assumption that test is not separated by lines, otherwise strsplit(text, "\n")[[1]] with text. There is no need to use regex, since data can be obtained by splitting smaller chunks by =

Upvotes: 1

adamzjw
adamzjw

Reputation: 79

gsub is my favorite.

strList = list("<Call Begin=0.59170816000000048 End=0.60006400000000049 MaxFreq=53100.5859375 MinFreq=43334.9609375 PeakFreq=48217.7734375", "<Call Begin=0.7636582400000006 End=0.77135872000000061 MaxFreq=53100.5859375 MinFreq=42724.609375 PeakFreq=46997.0703125")

dataExtract <- function(str){
  str = gsub("^<Call Begin=([0-9.]+) End=([0-9.]+) MaxFreq=([0-9.]+) MinFreq=([0-9.]+) PeakFreq=([0-9.]+)", "\\1 \\2 \\3 \\4 \\5", str)

  str = unlist(strsplit(str, " "))

  return(sapply(str, FUN=as.numeric, USE.NAMES=F))
}

#dataExtract(strList[[1]])

res = matrix(unlist(lapply(str, FUN=dataExtract)), ncol=5, byrow=F)
colnames(res) = c("Call Begin", "End", "MaxFreq", "MinFreq", "PeakFreq")

Upvotes: 0

Related Questions