Reputation: 43
I have a trick problem that i am not able to solve easily in R.
I have this data frame that comes from gtf genome hg19
.
The code of the dataframe
data <- data.frame(
chr.hg19v105 = c("chr9", "chr9", "chr9"),
start.hg19v105 = c(133589268, 133710641, 133710834),
end.hg19v105 = c(133763062, 133710833, 133710912),
region.hg = c(
"gene_id ABL1; transcript_id NM_007313.2; gene_name ABL1;",
"gene_id ABL1; transcript_id NM_005157.6; exon_number 1; exon_id NM_005157.6.1; gene_name ABL1;",
"gene_id ABL1; transcript_id NM_005157.6; exon_number 1; exon_id NM_005157.6.1; gene_name ABL1;"
)
)
I would like to separate the region column taking the
gene_name id - transcipt_id - eson_number ecc
as columns.
WARNING not all rows have the same elements.
In addition, I would like to remove the version from the transcript_id
column.
It is easy to import it with the dedicated package obtaining the perfect columns separation.
gtf <- rtracklayer::import('./hg19.gtf')
gtf_df=as.data.frame(gtf)
But what I need it is to have the region
column as a matching column with the other columns separated.
I have tried with the following code:
library(dplyr)
library(tidyr)
df_separeted <- df %>% separate(region, into = c("V9","V10","V11","V12","V13","V14","V15","V16","V17","V18","V19","V20","V21","V22","V23","V24","V25","V26", "V30"), sep = ";")
But I am not able to manage the different position of the fields as well as the name columns.
It seems like a simple problem, but I am not able to figure out how to solve it.
Upvotes: 3
Views: 83
Reputation: 73832
You could make a fields data frame first, then merge
it to data frames from region.hg. The latter can be obtained using gsub
for cleaning from version numbers (can be adapted if you only want clean transcript_id) then strsplit
, rbind
, make 1st column to rownames
, t
ranspose, coerce as.data.frame
. cbind
this to data
w/o the region.hg column.
> fields <- gregexpr("\\b[a-zA-Z_]+(?=\\s)", data$region.hg, perl=TRUE) |>
+ regmatches(x=data$region.hg) |> unlist() |> unique()
> fields_df <- as.data.frame(array(,c(0, length(fields)), list(NULL, fields)))
>
> data[1:3] |>
+ cbind(
+ data$region.hg |>
+ strsplit('; ') |>
+ lapply(gsub, pat='\\..*|;', rep='') |>
+ lapply(strsplit, ' ') |>
+ lapply(do.call, what='rbind') |>
+ lapply(\(x) as.data.frame(t(`rownames<-`(x[, -1, drop=FALSE], x[, 1])))) |>
+ lapply(merge, fields_df, all=TRUE) |>
+ do.call(what='rbind')
+ )
chr.hg19v105 start.hg19v105 end.hg19v105 gene_id transcript_id gene_name exon_number exon_id
1 chr9 133589268 133763062 ABL1 NM_007313 ABL1 <NA> <NA>
2 chr9 133710641 133710833 ABL1 NM_005157 ABL1 1 NM_005157
3 chr9 133710834 133710912 ABL1 NM_005157 ABL1 1 NM_005157
Upvotes: 0
Reputation: 19880
Something like this:
library(tidyverse)
data <- data.frame(
chr.hg19v105 = c("chr9", "chr9", "chr9"),
start.hg19v105 = c(133589268, 133710641, 133710834),
end.hg19v105 = c(133763062, 133710833, 133710912),
region.hg = c(
"gene_id ABL1; transcript_id NM_007313.2; gene_name ABL1;",
"gene_id ABL1; transcript_id NM_005157.6; exon_number 1; exon_id NM_005157.6.1; gene_name ABL1;",
"gene_id ABL1; transcript_id NM_005157.6; exon_number 1; exon_id NM_005157.6.1; gene_name ABL1;"
)
)
data_new = data %>%
separate_longer_delim(region.hg, delim = "; ") %>%
mutate(region.hg = str_remove(region.hg, ";$")) %>%
filter(region.hg != "") %>%
separate(region.hg, c("name", "value"), sep = " ") %>%
pivot_wider(names_from = name, values_from = value) %>%
separate(transcript_id, c("transcript_id", "version"), sep = "\\.")
knitr::kable(data_new)
chr.hg19v105 | start.hg19v105 | end.hg19v105 | gene_id | transcript_id | version | gene_name | exon_number | exon_id |
---|---|---|---|---|---|---|---|---|
chr9 | 133589268 | 133763062 | ABL1 | NM_007313 | 2 | ABL1 | NA | NA |
chr9 | 133710641 | 133710833 | ABL1 | NM_005157 | 6 | ABL1 | 1 | NM_005157.6.1 |
chr9 | 133710834 | 133710912 | ABL1 | NM_005157 | 6 | ABL1 | 1 | NM_005157.6.1 |
Created on 2024-12-20 with reprex v2.0.2
Upvotes: 2