Daniel
Daniel

Reputation: 1262

Split an string by number of characters in a column of a data frame to create multiple columns in R?

In r there is data frame (df) as below which has ID string of 10 characters.

df:
        ID         Var1 Var2
    1   0334KLM001  aa  xx
    2   1334HDM002  zvv rr
    3   2334WEM003  qetr qwe
    4   3334OKT004  ff  sdf
    5   4334WER005  ee  sdf
    6   5334BBC006  qly ssg
    7   6334QQQ007  kk  htj
    8   7334AAA008  uu  yjy
    9   8334CBU009  ww  wttt
    10  9334MLO010  aa  dg

Intention:
Using r commands, I'd like to split the ID into two additional columns only by the numbers of Characters. For instance, I'd like to add two Columns (Spl_1 & Spl_2) to df which the Spl_1 has the first four characters of the ID column and the Spl_2 has the rest of ID character (i.e. 5:10).

The expected output would be like this:

output:

    ID         Var1 Var2    Spl_1   Spl_2
1   0334KLM001  aa  xx       0334   KLM001
2   1334HDM002  zvv rr       1334   HDM002
3   2334WEM003  qetr qwe     2334   WEM003
4   3334OKT004  ff  sdf     3334    OKT004
5   4334WER005  ee  sdf     4334    WER005
6   5334BBC006  qly ssg     5334    BBC006
7   6334QQQ007  kk  htj     6334    QQQ007
8   7334AAA008  uu  yjy     7334    AAA008
9   8334CBU009  ww  wttt    8334    CBU009
10  9334MLO010  aa  dg      9334    MLO010

I read the following questions : Q1 & Q2.Although very informative but still I did not get that how to split the sting by character numbers. I found the ‘stringr’ package, which I learn using str_sub command, I can split by number. see below:

 library(stringr)
ID <- "0334KLM001"
str_sub(ID, c(1,5), c(4,10))
[1] "0334"   "KLM001"

But I do not know how to combine with other r command(s) to get output as provided above?

Upvotes: 1

Views: 3971

Answers (1)

akrun
akrun

Reputation: 886928

We can use separate

library(tidyr)
separate(df, ID, into = c("Spl_1", "Spl_2"), sep = 4, remove = FALSE)
#           ID Spl_1  Spl_2 Var1 Var2
#1  0334KLM001  0334 KLM001   aa   xx
#2  1334HDM002  1334 HDM002  zvv   rr
#3  2334WEM003  2334 WEM003 qetr  qwe
#4  3334OKT004  3334 OKT004   ff  sdf
#5  4334WER005  4334 WER005   ee  sdf
#6  5334BBC006  5334 BBC006  qly  ssg
#7  6334QQQ007  6334 QQQ007   kk  htj
#8  7334AAA008  7334 AAA008   uu  yjy
#9  8334CBU009  8334 CBU009   ww wttt
#10 9334MLO010  9334 MLO010   aa   dg

If we want 3 columns, we can pass a vector in sep

separate(df, ID, into = c("Spl_1", "Spl_2", "Spl_3"), sep = c(4,8), remove = FALSE)
#           ID Spl_1 Spl_2 Spl_3 Var1 Var2
#1  0334KLM001  0334  KLM0    01   aa   xx
#2  1334HDM002  1334  HDM0    02  zvv   rr
#3  2334WEM003  2334  WEM0    03 qetr  qwe
#4  3334OKT004  3334  OKT0    04   ff  sdf
#5  4334WER005  4334  WER0    05   ee  sdf
#6  5334BBC006  5334  BBC0    06  qly  ssg
#7  6334QQQ007  6334  QQQ0    07   kk  htj
#8  7334AAA008  7334  AAA0    08   uu  yjy
#9  8334CBU009  8334  CBU0    09   ww wttt
#10 9334MLO010  9334  MLO0    10   aa   dg

If the numbers at the beginning are not of fixed length, use extract

extract(df, ID, into = c("Spl_1", "Spl_2"), "^([0-9]+)(.*)", remove = FALSE)

and for 3 columns,

extract(df, ID, into = c("Spl_1", "Spl_2", "Spl_3"), "(.{4})(.{4})(.*)", remove = FALSE)

Upvotes: 3

Related Questions