Fred12
Fred12

Reputation: 23

R split uneven strings with uneven number of spaces

I'm trying to split uneven strings with multiple spaces. However the number of spaces to be split is not always the same number, e.g.

 "abc          20"
 "csd   10"
 "eds     10     30"

and I'm trying to obtain the following:

"abc" " " "20"
"csd" "10" " "
"eds" "10" "30"

Any idea how to do this? Note that splitting based on a fixed number of spaces is not possible as these vary a bit. I was thinking about splitting on exactly one space either led by or followed by a character or a number, however I have no clue how to do that.

Upvotes: 2

Views: 363

Answers (2)

Sola Cong Mou
Sola Cong Mou

Reputation: 11

I got another solution that saves the labor of counting the spaces :>

s_split = data.frame()
for (i in 1:nrow(df)){
    s= df[i,1]
    new_list = stringr::str_split_1(s,' ')
    temp = as.data.frame(t(new_list[new_list !='']))
    s_split= dplyr::bind_rows(s_split, temp )
} 
s_split

Here is the toy data based on the posts above:

a = "abc          20"
b = "csd   10"
c =  "eds     10     30"
df = as.data.frame(rbind(a,b,c))

Upvotes: 0

G. Grothendieck
G. Grothendieck

Reputation: 269634

1) read.fwf Try read.fwf. Adjust the widths as needed.

s <- c("abc          20", "csd   10", "eds     10     30")  # test data
read.fwf(textConnection(s), widths = c(3, 7, 7))

giving:

   V1 V2 V3
1 abc NA 20
2 csd 10 NA
3 eds 10 30

2) kmeans This approach finds the starting columns, g, of fields 2 and 3 and clusters them into two groups using kmeans. It assumes that field 1 is always present since that seems to be the case in the question. Then if there are two fields on a line it assigns the second field to the group center that it is closest to.

km <- kmeans(unlist(gregexpr(" \\S", s)), 2)
centers <- sort(km$centers)
g <- gregexpr(" \\S", s)
spl <- strsplit(s, " +")
f <- function(s, g) {
  if (length(s) == 2) paste0(s[1], strrep(",", which.min(abs(g - centers))), s[2])
  else paste(s, collapse = ",")
}
read.table(text = mapply(f, spl, g), sep = ",", fill = TRUE, as.is = TRUE)

giving:

   V1 V2 V3
1 abc NA 20
2 csd 10 NA
3 eds 10 30

Upvotes: 3

Related Questions