user2280549
user2280549

Reputation: 1234

R - extract all strings matching pattern and create relational table

I am looking for a shorter and more pretty solution (possibly in tidyverse) to the following problem. I have a data.frame "data":

  id            string
1  A 1.001 xxx 123.123
2  B 23,45 lorem ipsum
3  C      donald trump
4  D    ssss 134, 1,45

What I wanted to do is to extract all numbers (no matter if the delimiter is "." or "," -> in this case I assume that string "134, 1,45" can be extracted into two numbers: 134 and 1.45) and create a data.frame "output" looking similar to this:

  id  string
1  A   1.001
2  A 123.123
3  B   23.45
4  C    <NA>
5  D     134
6  D    1.45

I managed to do this (code below) but the solution is pretty ugly for me also not so efficient (two for-loops). Could someone suggest a better way to do do this (preferably using dplyr)

# data
data <- data.frame(id = c("A", "B", "C", "D"), 
                  string = c("1.001 xxx 123.123", 
                             "23,45 lorem ipsum", 
                             "donald trump", 
                             "ssss 134, 1,45"),
                  stringsAsFactors = FALSE)

# creating empty data.frame                     
len <- length(unlist(sapply(data$string, function(x) gregexpr("[0-9]+[,|.]?[0-9]*", x))))
output <- data.frame(id = rep(NA, len), string = rep(NA, len))

# main solution
start = 0

for(i in 1:dim(data)[1]){
  tmp_len <- length(unlist(gregexpr("[0-9]+[,|.]?[0-9]*", data$string[i])))
  for(j in (start+1):(start+tmp_len)){
    output[j,1] <- data$id[i]
    output[j,2] <- regmatches(data$string[i], gregexpr("[0-9]+[,|.]?[0-9]*", data$string[i]))[[1]][j-start]
  }
  start = start + tmp_len
}

# further modifications
output$string <- gsub(",", ".", output$string)
output$string <- as.numeric(ifelse(substring(output$string, nchar(output$string), nchar(output$string)) == ".",
                                   substring(output$string, 1, nchar(output$string) - 1),
                                   output$string))

output

Upvotes: 2

Views: 2113

Answers (3)

G. Grothendieck
G. Grothendieck

Reputation: 269526

1) Base R This uses relatively simple regular expressions and no packages.

In the first 2 lines of code replace any comma followed by a space with a space and then replace all remaining commas with a dot. After these two lines s will be: c("1.001 xxx 123.123", "23.45 lorem ipsum", "donald trump", "ssss 134 1.45")

In the next 4 lines of code trim whitespace from beginning and end of each string field and split the string field on whitespace producing a list. grep out those elements consisting only of digits and dots. (The regular expression ^[0-9.]*$ matches the start of a word followed by zero or more digits or dots followed by the end of the word so only words containing only those characters are matched.) Replace any zero length components with NA. Finally add data$id as the names. After these 4 lines are run the list L will be list(A = c("1.001", "123.123"), B = "23.45", C = NA, D = c("134", "1.45")) .

In the last line of code convert the list L to a data frame with the appropriate names.

s <- gsub(", ", " ", data$string)
s <- gsub(",", ".", s)

L <- strsplit(trimws(s), "\\s+")
L <- lapply(L, grep, pattern = "^[0-9.]*$", value = TRUE)
L <- ifelse(lengths(L), L, NA)
names(L) <- data$id

with(stack(L), data.frame(id = ind, string = values))

giving:

  id  string
1  A   1.001
2  A 123.123
3  B   23.45
4  C    <NA>
5  D     134
6  D    1.45

2) magrittr This variation of (1) writes it as a magrittr pipeline.

library(magrittr)

data %>%
     transform(string = gsub(", ", " ", string)) %>%
     transform(string = gsub(",", ".", string)) %>%
     transform(string = trimws(string)) %>%
     with(setNames(strsplit(string, "\\s+"), id)) %>%
     lapply(grep, pattern = "^[0-9.]*$", value = TRUE) %>%
     replace(lengths(.) == 0, NA) %>%
     stack() %>%
     with(data.frame(id = ind, string = values))

3) dplyr/tidyr This is an alternate pipeline solution using dplyr and tidyr. unnest converts to long form, id is made factor so that we can later use complete to recover id's that are removed by subsequent filtering, the filter removes junk rows and complete inserts NA rows for each id that would otherwise not appear.

library(dplyr)
library(tidyr)

data %>%
  mutate(string = gsub(", ", " ", string)) %>%
  mutate(string = gsub(",", ".", string)) %>%
  mutate(string = trimws(string)) %>%
  mutate(string = strsplit(string, "\\s+")) %>%
  unnest() %>%
  mutate(id = factor(id))
  filter(grepl("^[0-9.]*$", string)) %>%
  complete(id)

4) data.table

library(data.table)

DT <- as.data.table(data)
DT[, string := gsub(", ", " ", string)][, 
     string := gsub(",", ".", string)][,
     string := trimws(string)][,
     string := setNames(strsplit(string, "\\s+"), id)][,
     list(string = list(grep("^[0-9.]*$", unlist(string), value = TRUE))), by = id][,
     list(string = if (length(unlist(string))) unlist(string) else NA_character_), by = id]
DT

Update Removed assumption that junk words do not have digit or dot. Also added (2), (3) and (4) and some improvements.

Upvotes: 6

Frank
Frank

Reputation: 66819

Same idea as Gabor's. I had hoped to use R's built-in parsing of strings (type.convert, used in read.table) rather than writing custom regex substitutions:

sp = setNames(strsplit(data$string, " "), data$id)

spc = lapply(sp, function(x) {
  x = x[grep("[^0-9.,]$", x, invert=TRUE)]

  if (!length(x))
    NA_real_
  else 
    mapply(type.convert, x, dec=gsub("[^.,]", "", x), USE.NAMES=FALSE)
})

setNames(rev(stack(spc)), names(data))

  id  string
1  A   1.001
2  A 123.123
3  B   23.45
4  C    <NA>
5  D     134
6  D    1.45

Unfortunately, type.convert is not robust enough to consider both decimal delimiters at once, so we need this mapply malarkey instead of type.convert(x, dec = "[.,]").

Upvotes: 1

akrun
akrun

Reputation: 887088

We can replace the , in between the numbers with . (using gsub), extract the numbers with str_extract_all (from stringr into a list), replace the list elements that have length equal to 0 with NA, set the names of the list with 'id' column, stack to convert the list to data.frame and rename the columns.

library(stringr)
setNames(stack(setNames(lapply(str_extract_all(gsub("(?<=[0-9]),(?=[0-9])", ".", 
      data$string, perl = TRUE), "[0-9.]+"), function(x) 
     if(length(x)==0) NA else as.numeric(x)), data$id))[2:1], c("id", "string"))
#  id  string
#1  A   1.001
#2  A 123.123
#3  B   23.45
#4  C      NA
#5  D     134
#6  D    1.45

Upvotes: 2

Related Questions