Mili
Mili

Reputation: 77

Parse a file in R

I need to parse a file in R which looks like below:

Acc1    "product"="A","product"="B","product"="C"
Acc2    "product"="C","product"="D"

Above is a txt file and there is tab between Acc1 and "product".

The output should look like:

Column1 Column2
Acc1    A
Acc1    B
Acc1    C
Acc2    C
Acc2    D

Can someone help please?

Upvotes: 0

Views: 100

Answers (1)

r2evans
r2evans

Reputation: 161145

I'm going to suggest you look at a tidyverse solution for this. It can certainly be handled with base-R and data.table (as others might suggest in comments or answers), but this is a good start.

First, faking the data.

txt <- readLines(textConnection('Acc1    "product"="A A","product"="B","product"="C"
Acc2    "product"="C","product"="D"'))

In your case, you'd probably just do readLines(filename).

This next block splits the "Acc" stuff from the rest.

txtsplit <- strsplit(gsub("^(\\S+)\\s+", "\\1|", txt), "\\|")

And finally, the rest of the processing.

library(dplyr)
library(tidyr)
data_frame(
  Col1 = sapply(txtsplit, `[[`, 1),
  Col2 = sapply(txtsplit, `[[`, 2)
) %>%
  mutate(
    Col2 = gsub('"product"=', '', Col2),
    Col2 = strsplit(Col2, ",")
  ) %>%
  unnest() %>%
  mutate(
    Col2 = gsub('"', '', Col2)
  )
# # A tibble: 5 x 2
#   Col1  Col2 
#   <chr> <chr>
# 1 Acc1  A A
# 2 Acc1  B    
# 3 Acc1  C    
# 4 Acc2  C    
# 5 Acc2  D    

There are several good tutorials on using dplyr and tidyr, a quick search will find better/newer than I can post here.

BTW: I separated removal of the quotes into a separate mutate, but it could easily have been handled in the initial gsub. I chose to keep it separate in case you had more than just single letters in the quotes, where removing them might cause parsing problems later.

Upvotes: 1

Related Questions