user8831872
user8831872

Reputation: 383

Two conditions for split a column

I have a dataframe like this:

I would like to split the second column into many columns based on "?". However it is not easy because there are other question marker in the sting. So the only additional common it that every substring contains the 'http'.

How is it possible to split them? The number of columns in the output example is just an example I don't know exactly how many could be generated.

Example of input data:

 df_in <- data.frame(x = c('x1','x2','x3','x4'),
                     y = c('http://example1.com?https://example2.com', 'NA', 'http://example3.com?id=1234?https://example4/com?http://example6.com', 'http://example5.com'))

the dataframe as printed in console:

 df_in
  x                                                                    y
 x1                             http://example1.com?https://example2.com
 x2                                                                   NA
 x3 http://example3.com?id=1234?https://example4/com?http://example6.com
 x4                                                  http://example5.com

Example of expected output:

df_out <- data.frame(x = c('x1','x2','x3','x4'),
                     col1 = c('http://example1.com', 'NA', 'http://example3.com?id=1234', 'http://example5.com'),
                     col2 = c('https://example2.com', 'NA', 'https://example4/com', 'NA'),
                     col3 = c('NA', 'NA', 'https://example6/com', 'NA'))

The output as printed in the console:

 x                        col1                 col2                 col3
 x1         http://example1.com https://example2.com                   NA
 x2                          NA                   NA                   NA
 x3 http://example3.com?id=1234 https://example4/com https://example6/com
 x4         http://example5.com                   NA                   NA

Upvotes: 5

Views: 795

Answers (2)

Sotos
Sotos

Reputation: 51582

If you have an arbitrary number of domains to split, hence not knowing the number of columns to be produced, you can use cSplit function from splitstackshape package. However, before doing that, we need to add a delimeter right before ?http, i.e.

library(splitstackshape)

df_in$y <-  gsub('(\\w)(\\?h)', '\\1_\\2', df_in$y)
cSplit(df_in 'y', '_?')

#Or all in one line,
cSplit(transform(df_in, y = gsub('(\\w)(\\?h)', '\\1_\\2', y)), 'y', '_?')

which gives,

   x                         y_1                   y_2                  y_3
1: x1         http://example1.com  https://example2.com                   NA
2: x2                          NA                    NA                   NA
3: x3 http://example3.com?id=1234  https://example4/com  http://example6.com
4: x4         http://example5.com                    NA                   NA

Upvotes: 4

akrun
akrun

Reputation: 887048

We can use separate from tidyr to separate the column 'y' into multiple columns by separating at the ? that is before the http

library(tidyr)
df_in %>%
     separate(y, into = paste0("col", 1:3), sep="[?](?=http)")
#   x                        col1                 col2                col3
#1 x1         http://example1.com https://example2.com                <NA>
#2 x2                          NA                 <NA>                <NA>
#3 x3 http://example3.com?id=1234 https://example4/com http://example6.com
#4 x4         http://example5.com                 <NA>                <NA>

Upvotes: 5

Related Questions