user13528575
user13528575

Reputation:

How to use separate in tidyverse to split a column?

I use separate() function to split column: Enterdateofexam2, it is character format, the value is like "25.07","13.09","16.06"... My aim is splitting it into day(25) and month(07), then use convert = true to turn them into numeric for filter in next step.

My code is:

jimma3n <- jimma3 %>%
        select(Enterdateofexam2, Enterdayofexam, UniqueKey,MEDICALRECORD)%>%
        separate(Enterdateofexam2,into=c("day", "month"), sep=".", convert = TRUE)
view (jimma3n)

but the R keeps warning me like:

Expected 2 pieces. Additional pieces discarded in 4088 rows [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].

So could anyone help to figure out which part is wrong about my code? Thanks~~!

Upvotes: 1

Views: 11078

Answers (2)

TarJae
TarJae

Reputation: 79311

The main issue is that you have to define the number of columns you will separate to. If you define 2 columns let's say a and b and you have 3 elements to separate: let's say x y z, then z will be discarded.

With separate you must define the new columns, which is difficult if you don't know how many columns you will need after separate

Consider this example: in row 3 you have 3 elements:

df <- data.frame(x = c("x", "x y", "x y z", NA))
      x
1     x
2   x y
3 x y z
4  <NA>

With this code you define 2 columns to separate to

df %>% separate(x, c("a", "b"))

     a    b
1    x <NA>
2    x    y
3    x    y
4 <NA> <NA>

In row 3 z is discarded because we have define only 2 columns a and b

If we define 3 columns like

df %>% separate(x, c("a", "b", "c"))

The warning for discarding will disappear.

On the other hand you will get a warning for x with fewer then 3 elements, those will be filled with NA.

Upvotes: 1

akrun
akrun

Reputation: 887951

We can use extra argument. Also, by default, the sep is in regex mode - according to ?separate documentation

sep - If character, sep is interpreted as a regular expression. The default value is a regular expression that matches any sequence of non-alphanumeric values.

and . is a metacharacter which can match any character. Therefore, we may need to either escape (\\.) or place it in square brackets ([.]). Also, based on the dput, the column is a list, which should be unnested first before doing the separate

library(dplyr)
library(tidyr)
jimma3 %>%
      select(Enterdateofexam2, Enterdayofexam, UniqueKey,MEDICALRECORD)%>%
      unnest(Enterdateofexam2) %>%
      separate(Enterdateofexam2,into=c("day", "month"), 
              sep="\\.", convert = TRUE, extra = "merge") %>% 
      na.omit

-output

# A tibble: 6 x 5
    day month Enterdayofexam UniqueKey MEDICALRECORD
  <int> <int> <chr>          <chr>     <chr>        
1     7     6 1              530       577207       
2     8     6 2              530       577207       
3     9     6 3              530       577207       
4     2    12 1              531       575333       
5     3    12 2              531       575333       
6     4    12 3              531       575333       

Basically, with sep = ".", it is splitting at every character element and thus the warning popped up

data

jimma3 <- structure(list(Enterdateofexam2 = list(c("", "7.06"), c("", "8.06"
), c("", "9.06"), c("", "2.12"), c("", "3.12"), c("", "4.12")), 
    Enterdayofexam = c("1", "2", "3", "1", "2", "3"), UniqueKey = c("530", 
    "530", "530", "531", "531", "531"), MEDICALRECORD = c("577207", 
    "577207", "577207", "575333", "575333", "575333")), row.names = c(NA, 
-6L), class = c("tbl_df", "tbl", "data.frame"))

Upvotes: 5

Related Questions