elsherbini
elsherbini

Reputation: 1626

How to tidy up a character column?

What I have:

test_df <- data.frame(isolate=c(1,2,3,4,1,2,3,4,5),label=c(1,1,1,1,2,2,2,2,2),alignment=c("--at","at--","--at","--at","a--","acg","a--","a--", "agg"))

> test_df
  isolate label alignment
1       1     1   --at
2       2     1   at--
3       3     1   --at
4       4     1   --at
5       1     2   a--
6       2     2   acg
7       3     2   a--
8       4     2   a--
9       5     2   agg

What I want:

I'd like to explode the alignment field into two columns, position and character:

> test_df
  isolate label aln_pos  aln_char
1       1     1       1  -
2       1     1       2  -
3       1     1       3  a
4       1     1       4  t
...

Not all alignments are the same length, but all alignments with the same label have the same length.

What I've tried:

I was thinking I could use separate to first make each position have its own column, then use gather turn those columns into key value pairs. However, I haven't been able to get the separate part right.

Upvotes: 0

Views: 135

Answers (2)

Seth Wenchel
Seth Wenchel

Reputation: 41

Since you mentioned tidyr::gather, you could try this:

test_df <- data.frame(isolate=c(1,2,3,4,1,2,3,4,5),
                      label=c(1,1,1,1,2,2,2,2,2),
                      alignment=c("--at","at--","--at","--at","a--","acg","a--","a--", "agg"), 
                      stringsAsFactors = FALSE)

library(tidyverse)

test_df %>% 
  mutate(alignment = strsplit(alignment,"")) %>% 
  unnest(alignment)

Upvotes: 1

lmo
lmo

Reputation: 38520

In base R, you can use indexing along with creation of a list with strsplit like this.

# make variable a character vector
test_df$alignment <- as.character(test_df$alignment)
# get list of individual characters
myList <- strsplit(test_df$alignment, split="")

then build the data.frame

# construct data.frame
final_df <- cbind(test_df[rep(seq_len(nrow(test_df)), lengths(myList)),
                          c("isolate", "label")],
                  aln_pos=sequence(lengths(myList)),
                  aln_char=unlist(myList))

Here, we take the first two columns of the original data.frame and repeat the rows using rep with a vector input in its second argument telling it how many times to repeat the corresponding value in its first argument. The number of times is calculated with lengths. The second argument of cbind is a call to sequence taking the same lengths output. this produces counts from 1 to the corresponding length. The third argument is the unlisted character values.

this returns

head(final_df, 10)
    isolate label aln_pos aln_char
1         1     1       1        -
1.1       1     1       2        -
1.2       1     1       3        a
1.3       1     1       4        t
2         2     1       1        a
2.1       2     1       2        t
2.2       2     1       3        -
2.3       2     1       4        -
3         3     1       1        -
3.1       3     1       2        -

Upvotes: 1

Related Questions