Scheggia
Scheggia

Reputation: 43

How manage columns with (values ; separated) by taking the columns name from values in R

I have a trick problem that i am not able to solve easily in R.

I have this data frame that comes from gtf genome hg19.

enter image description here

The code of the dataframe

data <- data.frame(
    chr.hg19v105 = c("chr9", "chr9", "chr9"),
    start.hg19v105 = c(133589268, 133710641, 133710834),
    end.hg19v105 = c(133763062, 133710833, 133710912),
    region.hg = c(
        "gene_id ABL1; transcript_id NM_007313.2; gene_name ABL1;",
        "gene_id ABL1; transcript_id NM_005157.6; exon_number 1; exon_id NM_005157.6.1; gene_name ABL1;",
        "gene_id ABL1; transcript_id NM_005157.6; exon_number 1; exon_id NM_005157.6.1; gene_name ABL1;"
    )
)

I would like to separate the region column taking the gene_name id - transcipt_id - eson_number ecc as columns.

WARNING not all rows have the same elements.

enter image description here

In addition, I would like to remove the version from the transcript_id column.

enter image description here

It is easy to import it with the dedicated package obtaining the perfect columns separation.

gtf <- rtracklayer::import('./hg19.gtf')

gtf_df=as.data.frame(gtf)

But what I need it is to have the region column as a matching column with the other columns separated.

I have tried with the following code:

library(dplyr) 

library(tidyr)

df_separeted <- df %>% separate(region, into = c("V9","V10","V11","V12","V13","V14","V15","V16","V17","V18","V19","V20","V21","V22","V23","V24","V25","V26", "V30"), sep = ";")

But I am not able to manage the different position of the fields as well as the name columns.

It seems like a simple problem, but I am not able to figure out how to solve it.

Upvotes: 3

Views: 83

Answers (2)

jay.sf
jay.sf

Reputation: 73832

You could make a fields data frame first, then merge it to data frames from region.hg. The latter can be obtained using gsub for cleaning from version numbers (can be adapted if you only want clean transcript_id) then strsplit, rbind, make 1st column to rownames, transpose, coerce as.data.frame. cbind this to data w/o the region.hg column.

> fields <- gregexpr("\\b[a-zA-Z_]+(?=\\s)", data$region.hg, perl=TRUE) |> 
+   regmatches(x=data$region.hg) |> unlist() |> unique()
> fields_df <- as.data.frame(array(,c(0, length(fields)), list(NULL, fields)))
> 
> data[1:3] |> 
+   cbind(
+     data$region.hg |> 
+       strsplit('; ') |> 
+       lapply(gsub, pat='\\..*|;', rep='') |> 
+       lapply(strsplit, ' ') |> 
+       lapply(do.call, what='rbind') |> 
+       lapply(\(x) as.data.frame(t(`rownames<-`(x[, -1, drop=FALSE], x[, 1])))) |> 
+       lapply(merge, fields_df, all=TRUE) |> 
+       do.call(what='rbind')
+   )
  chr.hg19v105 start.hg19v105 end.hg19v105 gene_id transcript_id gene_name exon_number   exon_id
1         chr9      133589268    133763062    ABL1     NM_007313      ABL1        <NA>      <NA>
2         chr9      133710641    133710833    ABL1     NM_005157      ABL1           1 NM_005157
3         chr9      133710834    133710912    ABL1     NM_005157      ABL1           1 NM_005157

Upvotes: 0

yuk
yuk

Reputation: 19880

Something like this:

library(tidyverse)

data <- data.frame(
  chr.hg19v105 = c("chr9", "chr9", "chr9"),
  start.hg19v105 = c(133589268, 133710641, 133710834),
  end.hg19v105 = c(133763062, 133710833, 133710912),
  region.hg = c(
    "gene_id ABL1; transcript_id NM_007313.2; gene_name ABL1;",
    "gene_id ABL1; transcript_id NM_005157.6; exon_number 1; exon_id NM_005157.6.1; gene_name ABL1;",
    "gene_id ABL1; transcript_id NM_005157.6; exon_number 1; exon_id NM_005157.6.1; gene_name ABL1;"
  )
)

data_new = data %>% 
  separate_longer_delim(region.hg, delim = "; ") %>% 
  mutate(region.hg = str_remove(region.hg, ";$")) %>% 
  filter(region.hg != "") %>% 
  separate(region.hg, c("name", "value"), sep = " ") %>% 
  pivot_wider(names_from = name, values_from = value) %>% 
  separate(transcript_id, c("transcript_id", "version"), sep = "\\.")

knitr::kable(data_new)
chr.hg19v105 start.hg19v105 end.hg19v105 gene_id transcript_id version gene_name exon_number exon_id
chr9 133589268 133763062 ABL1 NM_007313 2 ABL1 NA NA
chr9 133710641 133710833 ABL1 NM_005157 6 ABL1 1 NM_005157.6.1
chr9 133710834 133710912 ABL1 NM_005157 6 ABL1 1 NM_005157.6.1

Created on 2024-12-20 with reprex v2.0.2

Upvotes: 2

Related Questions