Reputation: 603
I want to generate two groups from gene_id column for instance one group is xxxxx4XX and the other groups could be xxxxx9XX, my data set is here: https://github.com/learnseq/learning/blob/main/RNASeq_post-processing%20(1).csv
I want to make two group so I can compare in between them.
This is the head of my data:
gene_id expr
<fct> <int>
1 ENSG00000000005 6
2 ENSG00000000419 754
3 ENSG00000000457 447
4 ENSG00000000460 426
5 ENSG00000000938 5
6 ENSG00000000971 1
Upvotes: 0
Views: 63
Reputation: 887118
An option with group_split
library(dplyr)
df %>%
group_split(grp = substr(as.character(readr::parse_number(gene_id)),
1, 1), .keep = FALSE)
-output
#[[1]]
# A tibble: 3 x 2
# gene_id expr
# <chr> <int>
#1 ENSG00000000419 754
#2 ENSG00000000457 447
#3 ENSG00000000460 426
#[[2]]
# A tibble: 1 x 2
# gene_id expr
# <chr> <int>
#1 ENSG00000000005 6
#[[3]]
# A tibble: 2 x 2
# gene_id expr
# <chr> <int>
#1 ENSG00000000938 5
#2 ENSG00000000971 1
Or to create the 'grp' as a column
library(stringr)
df %>%
mutate(grp = str_replace(gene_id, '^\\D+0*([1-9]).*', 'xxxxx\\1XX'))
-output
# gene_id expr grp
#1 ENSG00000000005 6 xxxxx5XX
#2 ENSG00000000419 754 xxxxx4XX
#3 ENSG00000000457 447 xxxxx4XX
#4 ENSG00000000460 426 xxxxx4XX
#5 ENSG00000000938 5 xxxxx9XX
#6 ENSG00000000971 1 xxxxx9XX
Upvotes: 1
Reputation: 1261
A simple approach is accomplished using str_sub()
to subset the values in the first column and define the groups' names. As you'll see, each value will fit into a group with the same subsetted name.
Here is the code:
# load environment
library(stringr)
# load data
data_url = 'https://raw.githubusercontent.com/learnseq/learning/main/RNASeq_post-processing%20(1).csv'
df = read.csv(data_url, header = FALSE, stringsAsFactors = FALSE)
# define groups
df$group = as.numeric(str_sub(df$V1, -3, -3))
# print results
head(df)
Here is the output:
V1 V2 group
1 ENSG00000000003 1138 0
2 ENSG00000000005 6 0
3 ENSG00000000419 754 4
4 ENSG00000000457 447 4
5 ENSG00000000460 426 4
6 ENSG00000000938 5 9
Let us know if it solves your problem.
Upvotes: 1
Reputation: 101373
What about the code below?
> split(df,with(df,gsub(".*(\\d)\\d{2}$","\\1",gene_id)))
$`0`
gene_id expr
1 ENSG00000000005 6
$`4`
gene_id expr
2 ENSG00000000419 754
3 ENSG00000000457 447
4 ENSG00000000460 426
$`9`
gene_id expr
5 ENSG00000000938 5
6 ENSG00000000971 1
Data
> dput(df)
structure(list(gene_id = c("ENSG00000000005", "ENSG00000000419",
"ENSG00000000457", "ENSG00000000460", "ENSG00000000938", "ENSG00000000971"
), expr = c(6L, 754L, 447L, 426L, 5L, 1L)), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6"))
Upvotes: 1
Reputation: 39595
You can try an approach extracting the third digit from right and then building a variable to difference by group:
#Data
df <- read.csv('https://raw.githubusercontent.com/learnseq/learning/main/RNASeq_post-processing%20(1).csv',stringsAsFactors = F,header = F)
#Extract
df$V1 <- trimws(df$V1)
df$Var <- substr(df$V1,nchar(df$V1)-2,nchar(df$V1)-2)
#Create groups
df$Group <- ifelse(df$Var==4,'Group4',ifelse(df$Var==9,'Group9','Other'))
Output:
head(df)
V1 V2 Var Group
1 ENSG00000000003 1138 0 Other
2 ENSG00000000005 6 0 Other
3 ENSG00000000419 754 4 Group4
4 ENSG00000000457 447 4 Group4
5 ENSG00000000460 426 4 Group4
6 ENSG00000000938 5 9 Group9
Also:
table(df$Group,exclude = NULL)
Group4 Group9 Other
5934 5867 46939
Upvotes: 1