user432797
user432797

Reputation: 603

generate two groups out of one column

I want to generate two groups from gene_id column for instance one group is xxxxx4XX and the other groups could be xxxxx9XX, my data set is here: https://github.com/learnseq/learning/blob/main/RNASeq_post-processing%20(1).csv

I want to make two group so I can compare in between them.

This is the head of my data:

       gene_id         expr
           <fct>       <int>
1   ENSG00000000005     6
2   ENSG00000000419     754
3   ENSG00000000457     447
4   ENSG00000000460     426
5   ENSG00000000938     5
6   ENSG00000000971     1

Upvotes: 0

Views: 63

Answers (4)

akrun
akrun

Reputation: 887118

An option with group_split

library(dplyr)
df %>% 
  group_split(grp = substr(as.character(readr::parse_number(gene_id)), 
          1, 1), .keep = FALSE)

-output

#[[1]]
# A tibble: 3 x 2
#  gene_id          expr
#  <chr>           <int>
#1 ENSG00000000419   754
#2 ENSG00000000457   447
#3 ENSG00000000460   426

#[[2]]
# A tibble: 1 x 2
#  gene_id          expr
#  <chr>           <int>
#1 ENSG00000000005     6

#[[3]]
# A tibble: 2 x 2
#  gene_id          expr
#  <chr>           <int>
#1 ENSG00000000938     5
#2 ENSG00000000971     1

Or to create the 'grp' as a column

library(stringr)
df %>% 
   mutate(grp = str_replace(gene_id, '^\\D+0*([1-9]).*', 'xxxxx\\1XX'))

-output

#          gene_id expr      grp
#1 ENSG00000000005    6 xxxxx5XX
#2 ENSG00000000419  754 xxxxx4XX
#3 ENSG00000000457  447 xxxxx4XX
#4 ENSG00000000460  426 xxxxx4XX
#5 ENSG00000000938    5 xxxxx9XX
#6 ENSG00000000971    1 xxxxx9XX

Upvotes: 1

rodolfoksveiga
rodolfoksveiga

Reputation: 1261

A simple approach is accomplished using str_sub() to subset the values in the first column and define the groups' names. As you'll see, each value will fit into a group with the same subsetted name.

Here is the code:

# load environment
library(stringr)
# load data
data_url = 'https://raw.githubusercontent.com/learnseq/learning/main/RNASeq_post-processing%20(1).csv'
df = read.csv(data_url, header = FALSE, stringsAsFactors = FALSE)
# define groups
df$group = as.numeric(str_sub(df$V1, -3, -3))
# print results
head(df)

Here is the output:

               V1   V2 group
1 ENSG00000000003 1138     0
2 ENSG00000000005    6     0
3 ENSG00000000419  754     4
4 ENSG00000000457  447     4
5 ENSG00000000460  426     4
6 ENSG00000000938    5     9

Let us know if it solves your problem.

Upvotes: 1

ThomasIsCoding
ThomasIsCoding

Reputation: 101373

What about the code below?

> split(df,with(df,gsub(".*(\\d)\\d{2}$","\\1",gene_id)))
$`0`
          gene_id expr
1 ENSG00000000005    6

$`4`
          gene_id expr
2 ENSG00000000419  754
3 ENSG00000000457  447
4 ENSG00000000460  426

$`9`
          gene_id expr
5 ENSG00000000938    5
6 ENSG00000000971    1

Data

> dput(df)
structure(list(gene_id = c("ENSG00000000005", "ENSG00000000419", 
"ENSG00000000457", "ENSG00000000460", "ENSG00000000938", "ENSG00000000971"
), expr = c(6L, 754L, 447L, 426L, 5L, 1L)), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6"))

Upvotes: 1

Duck
Duck

Reputation: 39595

You can try an approach extracting the third digit from right and then building a variable to difference by group:

#Data
df <- read.csv('https://raw.githubusercontent.com/learnseq/learning/main/RNASeq_post-processing%20(1).csv',stringsAsFactors = F,header = F)
#Extract
df$V1 <- trimws(df$V1)
df$Var <- substr(df$V1,nchar(df$V1)-2,nchar(df$V1)-2)
#Create groups
df$Group <- ifelse(df$Var==4,'Group4',ifelse(df$Var==9,'Group9','Other'))

Output:

head(df)
               V1   V2 Var  Group
1 ENSG00000000003 1138   0  Other
2 ENSG00000000005    6   0  Other
3 ENSG00000000419  754   4 Group4
4 ENSG00000000457  447   4 Group4
5 ENSG00000000460  426   4 Group4
6 ENSG00000000938    5   9 Group9

Also:

table(df$Group,exclude = NULL)

Group4 Group9  Other 
  5934   5867  46939 

Upvotes: 1

Related Questions