Reputation: 3024
I have a file composed of thousands of lines of this type:
1 number entry size1 size2 value size5 value2 my_id1k "AJKJjsdfe76r55"; my_label “1900”; my_idk2 "49354ytu866"; you_digit "some"; my_copy “jkl”;
1 number entry size3 size4 value size6 value2 my_id1k "xyz804"; my_id2k “FI71"; my_id3k “Sk9000”; my_id4k “ldv”;
I would like to find a way to extract what's enclosed within the my_id1k
and my_id2k
entries (without the double quotes), as well as extract some of the other columns (my code is provided below).
I’d like to use the separate()
and select()
functions in the tidyr
and dplyr
packages for this purpose because they're very fast (and I'm concerned about performance), so I’ve been studying: http://rpackages.ianhowson.com/cran/tidyr/man/separate.html
However, I’m not sure how to specify the into
and sep
options in this kind of heterogeneous case (where my last column has varying length) in order to get my desired output. I clearly have some lines that have more info than others, so I’m wondering how I could write some high-performance tidyr
and dplyr
code that extracts the desired entries as quickly as possible.
Here's my work so far:
> library(dplyr)
> library(tidyr)
> library(data.table)
> x <- fread("myfile_MWE.txt")
> x
V1 V2 V3 V4 V5 V6 V7 V8 V9
1: 1 number entry size1 size2 value size5 value2 my_id1k "AJKJjsdfe76r55"; my_label “1900”; my_idk2 "49354ytu866"; you_digit "some"; my_copy “jkl”;
2: 1 number entry size3 size4 value size6 value2 my_id1k "xyz804"; my_id2k “FI71"; my_id3k “Sk9000”; my_id4k “ldv”;
> y <- separate(x, V9, into = paste("V", 1:15, sep = "_"))
> y
V1 V2 V3 V4 V5 V6 V7 V8 V_1 V_2 V_3 V_4 V_5 V_6 V_7 V_8 V_9 V_10 V_11 V_12 V_13 V_14 V_15
1: 1 number entry size1 size2 value size5 value2 my id1k AJKJjsdfe76r55 my label 1900 my idk2 49354ytu866 you digit some my copy jkl
2: 1 number entry size3 size4 value size6 value2 my id1k xyz804 my id2k FI71 my id3k Sk9000 my id4k ldv NA NA
Clearly, due to the differing length of the last column (V9
), some entries show up as NA
, and I can't manage to successfully extract what's enclosed within the my_id1k
and my_id2k
entries:
> a <- select(y, V1, V7, V_3, V_9)
> a
V1 V7 V_3 V_9
1: 1 size5 AJKJjsdfe76r55 49354ytu866
2: 1 size6 xyz804 Sk9000
> b <- select(y, V1, V7, V_3, V_6)
> b
V1 V7 V_3 V_6
1: 1 size5 AJKJjsdfe76r55 1900
2: 1 size6 xyz804 FI71
So clearly, in one case I need V_9
and in the other case I need V_6
. My desired output would be:
1 size5 AJKJjsdfe76r55 49354ytu866
1 size6 xyz804 FI71
Is there anyway that I can specify the usage of V_9
and V_6
in a conditional manner so that the my code is smart enough to recognize that I want to pull down what's enclosed within the my_id1k
and my_id2k
entries, e.g., via regular expressions?
Upvotes: 1
Views: 101
Reputation: 43334
tidyr::extract
is a better option than separate
or spread
, as there's a lot of gunk you don't care about.
extract(df, V9, c('my_id1k', 'my_id2k'), 'my_id1k .(\\S+).;.*my_id(?:2k|k2) .(\\S+).;')
# V1 V2 V3 V4 V5 V6 V7 V8 my_id1k my_id2k
# 1 1 number entry size1 size2 value size5 value2 AJKJjsdfe76r55 49354ytu866
# 2 1 number entry size3 size4 value size6 value2 xyz804 FI71
Note this assumes my_id2k
and my_idk2
are the same, as you seem to assume as much in the question; my_id1k
doesn't vary, so the regex doesn't either. It also assumes my_id1k
comes before my_id2k
. Be aware of the possibilities when extending it to new data, and adjust the regex accordingly.
Data:
df <- structure(list(V1 = c(1L, 1L), V2 = structure(c(1L, 1L), .Label = "number", class = "factor"),
V3 = structure(c(1L, 1L), .Label = "entry", class = "factor"),
V4 = structure(1:2, .Label = c("size1", "size3"), class = "factor"),
V5 = structure(1:2, .Label = c("size2", "size4"), class = "factor"),
V6 = structure(c(1L, 1L), .Label = "value", class = "factor"),
V7 = structure(1:2, .Label = c("size5", "size6"), class = "factor"),
V8 = structure(c(1L, 1L), .Label = "value2", class = "factor"),
V9 = c("my_id1k \"AJKJjsdfe76r55\"; my_label “1900”; my_idk2 \"49354ytu866\"; you_digit \"some\"; my_copy “jkl”;",
"my_id1k \"xyz804\"; my_id2k “FI71\"; my_id3k “Sk9000”; my_id4k “ldv”;"
)), .Names = c("V1", "V2", "V3", "V4", "V5", "V6", "V7",
"V8", "V9"), row.names = c(NA, -2L), class = "data.frame")
Upvotes: 1
Reputation: 4024
Here's the data I used:
data = structure(list(V1 = c(1L, 1L), V2 = c("number", "number"), V3 = c("entry",
"entry"), V4 = c("size1", "size3"), V5 = c("size2", "size4"),
V6 = c("value", "value"), V7 = c("size5", "size6"), V8 = c("value2",
"value2"), V9 = c("my_id1k \"AJKJjsdfe76r55\"; my_label “1900”; my_idk2 \"49354ytu866\"; you_digit \"some\"; my_copy “jkl”;",
"my_id1k \"xyz804\"; my_id2k “FI71\"; my_id3k “Sk9000”; my_id4k “ldv”;"
)), .Names = c("V1", "V2", "V3", "V4", "V5", "V6", "V7",
"V8", "V9"), class = "data.frame", row.names = c(NA, -2L))
And here's the code
library(dplyr)
library(stringi)
library(tidyr)
result =
data %>%
group_by(V9) %>%
do(.$V9 %>%
first %>%
stri_replace_all_fixed("; ", "\n") %>%
read.table(text = ., stringsAsFactors = FALSE) ) %>%
spread(V1, V2) %>%
left_join(data)
Upvotes: 2