warship
warship

Reputation: 3024

R regex expressions in tidyr and dplyr?

I have a file composed of thousands of lines of this type:

1   number  entry   size1   size2   value   size5   value2  my_id1k "AJKJjsdfe76r55"; my_label “1900”; my_idk2 "49354ytu866"; you_digit "some"; my_copy “jkl”;
1   number  entry   size3   size4   value   size6   value2  my_id1k "xyz804"; my_id2k “FI71"; my_id3k “Sk9000”; my_id4k “ldv”;

I would like to find a way to extract what's enclosed within the my_id1k and my_id2k entries (without the double quotes), as well as extract some of the other columns (my code is provided below).

I’d like to use the separate() and select() functions in the tidyr and dplyr packages for this purpose because they're very fast (and I'm concerned about performance), so I’ve been studying: http://rpackages.ianhowson.com/cran/tidyr/man/separate.html

However, I’m not sure how to specify the into and sep options in this kind of heterogeneous case (where my last column has varying length) in order to get my desired output. I clearly have some lines that have more info than others, so I’m wondering how I could write some high-performance tidyr and dplyr code that extracts the desired entries as quickly as possible.

Here's my work so far:

> library(dplyr)
> library(tidyr)
> library(data.table)
> x <- fread("myfile_MWE.txt")
> x
   V1     V2    V3    V4    V5    V6    V7     V8                                                                                                 V9
1:  1 number entry size1 size2 value size5 value2 my_id1k "AJKJjsdfe76r55"; my_label “1900”; my_idk2 "49354ytu866"; you_digit "some"; my_copy “jkl”;
2:  1 number entry size3 size4 value size6 value2                                 my_id1k "xyz804"; my_id2k “FI71"; my_id3k “Sk9000”; my_id4k “ldv”;
> y <- separate(x, V9, into = paste("V", 1:15, sep = "_"))
> y
   V1     V2    V3    V4    V5    V6    V7     V8 V_1  V_2            V_3 V_4   V_5  V_6 V_7  V_8         V_9 V_10  V_11 V_12 V_13 V_14 V_15
1:  1 number entry size1 size2 value size5 value2  my id1k AJKJjsdfe76r55  my label 1900  my idk2 49354ytu866  you digit some   my copy  jkl
2:  1 number entry size3 size4 value size6 value2  my id1k         xyz804  my  id2k FI71  my id3k      Sk9000   my  id4k  ldv        NA   NA

Clearly, due to the differing length of the last column (V9), some entries show up as NA, and I can't manage to successfully extract what's enclosed within the my_id1k and my_id2k entries:

> a <- select(y, V1, V7, V_3, V_9)
> a
   V1    V7            V_3         V_9
1:  1 size5 AJKJjsdfe76r55 49354ytu866
2:  1 size6         xyz804      Sk9000
> b <- select(y, V1, V7, V_3, V_6)
> b
   V1    V7            V_3  V_6
1:  1 size5 AJKJjsdfe76r55 1900
2:  1 size6         xyz804 FI71

So clearly, in one case I need V_9 and in the other case I need V_6. My desired output would be:

1 size5 AJKJjsdfe76r55 49354ytu866
1 size6         xyz804 FI71

Is there anyway that I can specify the usage of V_9 and V_6 in a conditional manner so that the my code is smart enough to recognize that I want to pull down what's enclosed within the my_id1k and my_id2k entries, e.g., via regular expressions?

Upvotes: 1

Views: 101

Answers (2)

alistaire
alistaire

Reputation: 43334

tidyr::extract is a better option than separate or spread, as there's a lot of gunk you don't care about.

extract(df, V9, c('my_id1k', 'my_id2k'), 'my_id1k .(\\S+).;.*my_id(?:2k|k2) .(\\S+).;')
#   V1     V2    V3    V4    V5    V6    V7     V8        my_id1k     my_id2k
# 1  1 number entry size1 size2 value size5 value2 AJKJjsdfe76r55 49354ytu866
# 2  1 number entry size3 size4 value size6 value2         xyz804        FI71

Note this assumes my_id2k and my_idk2 are the same, as you seem to assume as much in the question; my_id1k doesn't vary, so the regex doesn't either. It also assumes my_id1k comes before my_id2k. Be aware of the possibilities when extending it to new data, and adjust the regex accordingly.


Data:

df <- structure(list(V1 = c(1L, 1L), V2 = structure(c(1L, 1L), .Label = "number", class = "factor"), 
    V3 = structure(c(1L, 1L), .Label = "entry", class = "factor"), 
    V4 = structure(1:2, .Label = c("size1", "size3"), class = "factor"), 
    V5 = structure(1:2, .Label = c("size2", "size4"), class = "factor"), 
    V6 = structure(c(1L, 1L), .Label = "value", class = "factor"), 
    V7 = structure(1:2, .Label = c("size5", "size6"), class = "factor"), 
    V8 = structure(c(1L, 1L), .Label = "value2", class = "factor"), 
    V9 = c("my_id1k \"AJKJjsdfe76r55\"; my_label “1900”; my_idk2 \"49354ytu866\"; you_digit \"some\"; my_copy “jkl”;", 
    "my_id1k \"xyz804\"; my_id2k “FI71\"; my_id3k “Sk9000”; my_id4k “ldv”;"
    )), .Names = c("V1", "V2", "V3", "V4", "V5", "V6", "V7", 
"V8", "V9"), row.names = c(NA, -2L), class = "data.frame")

Upvotes: 1

bramtayl
bramtayl

Reputation: 4024

Here's the data I used:

data = structure(list(V1 = c(1L, 1L), V2 = c("number", "number"), V3 = c("entry", 
"entry"), V4 = c("size1", "size3"), V5 = c("size2", "size4"), 
    V6 = c("value", "value"), V7 = c("size5", "size6"), V8 = c("value2", 
    "value2"), V9 = c("my_id1k \"AJKJjsdfe76r55\"; my_label “1900”; my_idk2 \"49354ytu866\"; you_digit \"some\"; my_copy “jkl”;", 
    "my_id1k \"xyz804\"; my_id2k “FI71\"; my_id3k “Sk9000”; my_id4k “ldv”;"
    )), .Names = c("V1", "V2", "V3", "V4", "V5", "V6", "V7", 
"V8", "V9"), class = "data.frame", row.names = c(NA, -2L))

And here's the code

library(dplyr)
library(stringi)
library(tidyr)

result = 
  data %>%
  group_by(V9) %>%
  do(.$V9 %>%
       first %>%
       stri_replace_all_fixed("; ", "\n") %>%
       read.table(text = ., stringsAsFactors = FALSE) ) %>%
  spread(V1, V2) %>%
  left_join(data)

Upvotes: 2

Related Questions