Reputation: 77
I need to parse a file in R which looks like below:
Acc1 "product"="A","product"="B","product"="C"
Acc2 "product"="C","product"="D"
Above is a txt file and there is tab between Acc1 and "product".
The output should look like:
Column1 Column2
Acc1 A
Acc1 B
Acc1 C
Acc2 C
Acc2 D
Can someone help please?
Upvotes: 0
Views: 100
Reputation: 161145
I'm going to suggest you look at a tidyverse solution for this. It can certainly be handled with base-R and data.table
(as others might suggest in comments or answers), but this is a good start.
First, faking the data.
txt <- readLines(textConnection('Acc1 "product"="A A","product"="B","product"="C"
Acc2 "product"="C","product"="D"'))
In your case, you'd probably just do readLines(filename)
.
This next block splits the "Acc" stuff from the rest.
txtsplit <- strsplit(gsub("^(\\S+)\\s+", "\\1|", txt), "\\|")
And finally, the rest of the processing.
library(dplyr)
library(tidyr)
data_frame(
Col1 = sapply(txtsplit, `[[`, 1),
Col2 = sapply(txtsplit, `[[`, 2)
) %>%
mutate(
Col2 = gsub('"product"=', '', Col2),
Col2 = strsplit(Col2, ",")
) %>%
unnest() %>%
mutate(
Col2 = gsub('"', '', Col2)
)
# # A tibble: 5 x 2
# Col1 Col2
# <chr> <chr>
# 1 Acc1 A A
# 2 Acc1 B
# 3 Acc1 C
# 4 Acc2 C
# 5 Acc2 D
There are several good tutorials on using dplyr
and tidyr
, a quick search will find better/newer than I can post here.
BTW: I separated removal of the quotes into a separate mutate
, but it could easily have been handled in the initial gsub
. I chose to keep it separate in case you had more than just single letters in the quotes, where removing them might cause parsing problems later.
Upvotes: 1