Reputation: 13
I have a large data frame with one column, containing different numeric values separated by spaces, that I need to extract and organize in columns
<Call Begin=6.0982886400000051 End=6.1078732800000051 MaxFreq=40893.5546875 MinFreq=35400.390625 PeakFreq=39672.8515625 PeakFreqs=39672.8515625 39672.8515625 39672.8515625 39672.8515625 39672.8515625 39672.8515625 39672.8515625 39672.8515625 39672.8515625 39672.8515625 39672.8515625 39672.8515625 39062.5 39062.5 39062.5 39062.5 39062.5 39062.5 39062.5 39062.5 39062.5 39062.5 39062.5 39062.5 39062.5 39062.5 39062.5 39062.5 39062.5 39062.5 38452.1484375 38452.1484375 38452.1484375 38452.1484375 38452.1484375 38452.1484375 38452.1484375 38452.1484375 38452.1484375 38452.1484375 38452.1484375 38452.1484375 38452.1484375 38452.1484375 38452.1484375 38452.1484375 38452.1484375 38452.1484375 38452.1484375 38452.1484375 38452.1484375 38452.1484375 38452.1484375 38452.1484375 38452.1484375 38452.1484375 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37841.796875 37231.4453125 37231.4453125 37231.4453125 37231.4453125 37231.4453125 37231.4453125 37231.4453125 37231.4453125 37231.4453125 37231.4453125 37231.4453125 37231.4453125 37231.4453125 37231.4453125 37231.4453125 37231.4453125 37231.4453125 37231.4453125 37231.4453125 37231.4453125 36621.09375 36621.09375 36621.09375 36621.09375 Intensity=-14.902734633213136 Periodicity=0.853448275862069 Shape=- CallType=cf-n Species=Pipistrellus kuhlii (77%), Pipistrellus nathusii (77%) Custom=false />
this is more information about my data
'data.frame':39 obs. of 1 variable $ x1: Factor w/ 120 levels "
<double>25.318181818181806</double>",..: 66 67 68 69 70 71 72 73 74 75...
I need something like that:
call_begin call_end maxfrec minfrec
1 0.59170816000000048 0.60006400000000049 531.005.859.375 433.349.609.375
2 0.7636582400000006 0.77135872000000061 531.005.859.375 42.724.609.375
peakfrec
1 482.177.734.375
2 469.970.703.125
I have some ideas to achieve this, first try to separate in columns, using strsplit, and later use substr function, to extract the numbers and finally rbind to make a table, I found some threads with some related topics, but I could replicate it in my data.
I'll appreciate any help and please let me know if this is not clear.
Upvotes: 1
Views: 1784
Reputation: 2042
It all depends on how strictly your data follows the pattern. For the data you have given you can split on " " and "=" in one go and just extract the relevant columns in one go.
result <- do.call(rbind,lapply(strList,function(s) {strsplit(s,split = "[ =]")[[1]][c(3,5,7,9,11)]}))
You can then name the columns whatever you want using names() function.
Upvotes: 0
Reputation: 3648
Similar solution to what you described. This solution is a bit more generic and doesn't depend on number of columns:
text <- '<Call Begin=0.59170816000000048 End=0.60006400000000049 MaxFreq=53100.5859375 MinFreq=43334.9609375 PeakFreq=48217.7734375
<Call Begin=0.7636582400000006 End=0.77135872000000061 MaxFreq=53100.5859375 MinFreq=42724.609375 PeakFreq=46997.0703125'
process_line <- function(line) {
sp <- strsplit(line, ' ')[[1]][-1]
cn <- sapply(sp, function(x) strsplit(x, "=")[[1]][1])
data <- sapply(sp, function(x) as.numeric(strsplit(x, "=")[[1]][2]))
names(data) <- cn
data
}
t(sapply(strsplit(text, "\n")[[1]], process_line, USE.NAMES = FALSE))
Begin End MaxFreq MinFreq PeakFreq
[1,] 0.5917082 0.6000640 53100.59 43334.96 48217.77
[2,] 0.7636582 0.7713587 53100.59 42724.61 46997.07
It is based on assumption that test is not separated by lines, otherwise strsplit(text, "\n")[[1]]
with text
.
There is no need to use regex, since data can be obtained by splitting smaller chunks by =
Upvotes: 1
Reputation: 79
gsub is my favorite.
strList = list("<Call Begin=0.59170816000000048 End=0.60006400000000049 MaxFreq=53100.5859375 MinFreq=43334.9609375 PeakFreq=48217.7734375", "<Call Begin=0.7636582400000006 End=0.77135872000000061 MaxFreq=53100.5859375 MinFreq=42724.609375 PeakFreq=46997.0703125")
dataExtract <- function(str){
str = gsub("^<Call Begin=([0-9.]+) End=([0-9.]+) MaxFreq=([0-9.]+) MinFreq=([0-9.]+) PeakFreq=([0-9.]+)", "\\1 \\2 \\3 \\4 \\5", str)
str = unlist(strsplit(str, " "))
return(sapply(str, FUN=as.numeric, USE.NAMES=F))
}
#dataExtract(strList[[1]])
res = matrix(unlist(lapply(str, FUN=dataExtract)), ncol=5, byrow=F)
colnames(res) = c("Call Begin", "End", "MaxFreq", "MinFreq", "PeakFreq")
Upvotes: 0