Reputation: 83
I have a string that looks like this:
# character string
string <- "lambs: cows: 281 chickens: 20 goats: 3 trees: 13"
I want to create a dataframe that looks like this:
# structure
lambs <- NA
cows <- 281
chickens <- 20
goats <- 3
trees <- 13
# dataframe
df <-
cbind(lambs, cows, chickens, goats, trees) %>%
as.data.frame()
This is what I have tried so far:
# split string
test <- strsplit(string, " ")
test
The data is quite unclean so the spacing isn't always consistent, and sometimes there are lambs and sometimes there are no lambs (as in: "lamb: 5 cow: 50"
and "lamb: cow: 40"
. What is the easiest way to do this using tidyverse?
Upvotes: 1
Views: 185
Reputation: 73702
You can try read.table
. The "no lambs" issue can be solved by putting in a zero with gsub
.
r <- na.omit(unlist(read.table(text=gsub(": ", " 0", string), sep=" ")))
r <- replace(r, r == 0, NA)
## long format
type.convert(as.data.frame(matrix(r, ncol=2, byrow=TRUE)), as.is=TRUE)
# V1 V2
# 1 lambs NA
# 2 cows 281
# 3 chickens 20
# 4 goats 3
# 5 trees 13
## wide format
setNames(type.convert(r[seq(r) %% 2 == 0]), r[seq(r) %% 2 == 1])
# lambs cows chickens goats trees
# NA 281 20 3 13
Upvotes: 0
Reputation: 389275
You can use str_match_all
and pass the pattern to extract.
tmp <- stringr::str_match_all(string, '\\s*(.*?):\\s*(\\d+)?')[[1]][, -1]
data <- type.convert(data.frame(tmp), as.is = TRUE)
# X1 X2
#1 lambs NA
#2 cows 281
#3 chickens 20
#4 goats 3
#5 trees 13
This divides data into two columns where the first column is everything before colon (:
) except whitespace and the second column is number followed after it. I have made the number part as optional so as to accommodate cases like 'lambs'
which do not have number.
Upvotes: 2
Reputation: 160952
Try this:
gre <- gregexpr("\\b([A-Za-z]+:\\s*[0-9]*)\\b", string)
regmatches(string, gre)
# [[1]]
# [1] "lambs: " "cows: 281" "chickens: 20" "goats: 3" "trees: 13"
lapply(regmatches(string, gre), strcapture, pattern = "(.*):(.*)", proto = list(anim = character(0), n = character(0)))
# [[1]]
# anim n
# 1 lambs
# 2 cows 281
# 3 chickens 20
# 4 goats 3
# 5 trees 13
frames <- lapply(regmatches(string, gre), strcapture,
pattern = "(.*):(.*)", proto = list(anim = character(0), n = character(0)))
If you have multiple strings (and not just one), then this ensure that each string is processed and then all data is combined.
alldat <- do.call(rbind, frames)
alldat$n <- as.integer(alldat$n)
alldat
# anim n
# 1 lambs NA
# 2 cows 281
# 3 chickens 20
# 4 goats 3
# 5 trees 13
If you instead really need the data in a "wide" format, then
do.call(rbind, lapply(frames, function(z) do.call(data.frame, setNames(as.list(as.integer(z$n)), z$anim))))
# lambs cows chickens goats trees
# 1 NA 281 20 3 13
Upvotes: 1