Reputation: 43

How manage columns with (values ; separated) by taking the columns name from values in R

I have a trick problem that i am not able to solve easily in R.

I have this data frame that comes from gtf genome hg19.

The code of the dataframe

data <- data.frame(
    chr.hg19v105 = c("chr9", "chr9", "chr9"),
    start.hg19v105 = c(133589268, 133710641, 133710834),
    end.hg19v105 = c(133763062, 133710833, 133710912),
    region.hg = c(
        "gene_id ABL1; transcript_id NM_007313.2; gene_name ABL1;",
        "gene_id ABL1; transcript_id NM_005157.6; exon_number 1; exon_id NM_005157.6.1; gene_name ABL1;",
        "gene_id ABL1; transcript_id NM_005157.6; exon_number 1; exon_id NM_005157.6.1; gene_name ABL1;"
    )
)

I would like to separate the region column taking the gene_name id - transcipt_id - eson_number ecc as columns.

WARNING not all rows have the same elements.

In addition, I would like to remove the version from the transcript_id column.

It is easy to import it with the dedicated package obtaining the perfect columns separation.

gtf <- rtracklayer::import('./hg19.gtf')

gtf_df=as.data.frame(gtf)

But what I need it is to have the region column as a matching column with the other columns separated.

I have tried with the following code:

library(dplyr) 

library(tidyr)

df_separeted <- df %>% separate(region, into = c("V9","V10","V11","V12","V13","V14","V15","V16","V17","V18","V19","V20","V21","V22","V23","V24","V25","V26", "V30"), sep = ";")

But I am not able to manage the different position of the fields as well as the name columns.

It seems like a simple problem, but I am not able to figure out how to solve it.

Upvotes: 3

Answers (2)

jay.sf

Reputation: 73832

You could make a fields data frame first, then merge it to data frames from region.hg. The latter can be obtained using gsub for cleaning from version numbers (can be adapted if you only want clean transcript_id) then strsplit, rbind, make 1st column to rownames, transpose, coerce as.data.frame. cbind this to data w/o the region.hg column.

> fields <- gregexpr("\\b[a-zA-Z_]+(?=\\s)", data$region.hg, perl=TRUE) |> 
+   regmatches(x=data$region.hg) |> unlist() |> unique()
> fields_df <- as.data.frame(array(,c(0, length(fields)), list(NULL, fields)))
> 
> data[1:3] |> 
+   cbind(
+     data$region.hg |> 
+       strsplit('; ') |> 
+       lapply(gsub, pat='\\..*|;', rep='') |> 
+       lapply(strsplit, ' ') |> 
+       lapply(do.call, what='rbind') |> 
+       lapply(\(x) as.data.frame(t(`rownames<-`(x[, -1, drop=FALSE], x[, 1])))) |> 
+       lapply(merge, fields_df, all=TRUE) |> 
+       do.call(what='rbind')
+   )
  chr.hg19v105 start.hg19v105 end.hg19v105 gene_id transcript_id gene_name exon_number   exon_id
1         chr9      133589268    133763062    ABL1     NM_007313      ABL1        <NA>      <NA>
2         chr9      133710641    133710833    ABL1     NM_005157      ABL1           1 NM_005157
3         chr9      133710834    133710912    ABL1     NM_005157      ABL1           1 NM_005157

Upvotes: 0

yuk

Reputation: 19880

Something like this:

library(tidyverse)

data <- data.frame(
  chr.hg19v105 = c("chr9", "chr9", "chr9"),
  start.hg19v105 = c(133589268, 133710641, 133710834),
  end.hg19v105 = c(133763062, 133710833, 133710912),
  region.hg = c(
    "gene_id ABL1; transcript_id NM_007313.2; gene_name ABL1;",
    "gene_id ABL1; transcript_id NM_005157.6; exon_number 1; exon_id NM_005157.6.1; gene_name ABL1;",
    "gene_id ABL1; transcript_id NM_005157.6; exon_number 1; exon_id NM_005157.6.1; gene_name ABL1;"
  )
)

data_new = data %>% 
  separate_longer_delim(region.hg, delim = "; ") %>% 
  mutate(region.hg = str_remove(region.hg, ";$")) %>% 
  filter(region.hg != "") %>% 
  separate(region.hg, c("name", "value"), sep = " ") %>% 
  pivot_wider(names_from = name, values_from = value) %>% 
  separate(transcript_id, c("transcript_id", "version"), sep = "\\.")

knitr::kable(data_new)

chr.hg19v105	start.hg19v105	end.hg19v105	gene_id	transcript_id	version	gene_name	exon_number	exon_id
chr9	133589268	133763062	ABL1	NM_007313	2	ABL1	NA	NA
chr9	133710641	133710833	ABL1	NM_005157	6	ABL1	1	NM_005157.6.1
chr9	133710834	133710912	ABL1	NM_005157	6	ABL1	1	NM_005157.6.1

^{Created on 2024-12-20 with reprex v2.0.2}

Upvotes: 2

How manage columns with (values ; separated) by taking the columns name from values in R

Answers (2)

Related Questions