Reputation: 559
I have a messy bulk of data that I would like to extract information from. Now, I have not quite found a convenient way to extract the information and I hope you can help. My data looks like this:
"\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\nChannels\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\
n\r\nDates\r\nSeptember 25th 2016 To September 26th
2016\r\n\r\n\r\nPlatform\r\nIdea\r\n\r\n\r\nCountry\r\nUnited
States\r\n\r\n\r\nRestricted Countries\r\n\r\n\t\t\t\t\t\t\t\t\tUnited
States\t\t\t\t\t\t\t\t\r\n\r\n\r\nInitial Price\r\n$0.0692\r\n\r\n\r\n"
Now, what I would like to get out of this is:
Channels -
Dates September 25th 2016 To September 26th 2016
Platform Idea
Country United States
Restricted Countries United States
Initial Price $0.0692
I will need to perform this task for a larger number of observations and then store each variable as a vector of all observations. Thus, I do not really need to store the name of the variable (i.e. "Platform"), but only the result ("Idea"). But to do that I need the "Platform" variable name as an "Identifier" I would assume, as the position of the variable in the text varies across observations changes (as does the number of variables - only slightly though).
Now, I think the stringr package is a good way to do this, but I have not found a convenient way to do this.
Upvotes: 4
Views: 204
Reputation: 2210
With a being your input string, the result will be a single data frame with one variable per keyword (missing values for unused keywords), one row for each input:
a <- gsub("\\t*(\\r\\n)+\\t*","/",a)
a <- gsub("(^/|/$)","",a)
a <- gsub("(Channels|Dates|Platform|Country|Restricted Countries|Initial Price)","<\\1>",a)
a <- gsub(">/<",">//<",a)
b <- strsplit(a,"/")
c <- purrr::map(b,
function(x) {
dim(x) <- c(2,length(x)/2)
tidyr::spread(as.data.frame(t(x),stringsAsFactors=FALSE),V1,V2)
})
replyr::replyr_bind_rows(c)
Upvotes: 1
Reputation: 455
Base R solution:
yourstring1 <- "\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\nChannels\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\
n\r\nDates\r\nSeptember 25th 2016 To September 26th
2016\r\n\r\n\r\nPlatform\r\nIdea\r\n\r\n\r\nCountry\r\nUnited
States\r\n\r\n\r\nRestricted Countries\r\n\r\n\t\t\t\t\t\t\t\t\tUnited
States\t\t\t\t\t\t\t\t\r\n\r\n\r\nInitial Price\r\n$0.0692\r\n\r\n\r\n"
# make a placeholder (useful when manipulating strings for easier regex)
yourstring2 <- gsub("\r|\t|\nn|\n", "@", yourstring1, perl = T) # please note the double nn - this is so because a newline character is added when copying from here to R
# split on placeholder if it appears twice or more
yourstring2 <- unlist(strsplit(yourstring2, split = "@{2,}"))
# little cleaning needed
yourstring2 <- gsub(" @", " ", yourstring2)
yourstring2[1:2] <- c(yourstring2[2], "-") # this hard-coded solution works for the particular example, if you have many strings with arbitrarily missing values, you may want to make a little condition for that
# prepare your columns by indexing the character vector
variables <- yourstring2[seq(from = 1, to = length(yourstring2), by = 2)]
values <- yourstring2[seq(from = 2, to = length(yourstring2), by = 2)]
# bind them to dataframe
df <- data.frame(variables, values)
Resulting df:
df
variables values
1 Channels -
2 Dates September 25th 2016 To September 26th 2016
3 Platform Idea
4 Country United States
5 Restricted Countries United States
6 Initial Price $0.0692
EDIT: only now I properly read that, instead of a dataframe, the desired result may be a vector of positions... here is a two-line solution to that
yourstring2 <- gsub("\r|\t|\nn|\n", "", yourstring1, perl = T) #clean the original string (see above yourstring1)
yourvector <- unlist(strsplit(yourstring2, split = "Channels|Dates|Platform|Country|Restricted Countries|Initial Price", perl = T))[-1] # extract
Resulting vector:
> yourvector
[1] ""
[2] "September 25th 2016 To September 26th 2016"
[3] "Idea"
[4] "United States"
[5] "United States"
[6] "$0.0692"
Upvotes: 2
Reputation: 5670
The following regex extracts the values you want. The values are stored in columns 2-7 of the resulting matrix. The code works with a input vector (each entry forms a new row in the matrix)
library(stringr)
input <- "\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\nChannels\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\nDates\r\nSeptember 25th 2016 To September 26th 2016\r\n\r\n\r\nPlatform\r\nIdea\r\n\r\n\r\nCountry\r\nUnited States\r\n\r\n\r\nRestricted Countries\r\n\r\n\t\t\t\t\t\t\t\t\tUnited States\t\t\t\t\t\t\t\t\r\n\r\n\r\nInitial Price\r\n$0.0692\r\n\r\n\r\n"
str_match(input, paste0("[[:space:]]*Channels[[:cntrl:]]+([[:print:]]+)?",
"[[:space:]]*Dates[[:cntrl:]]+([[:print:]]+)?",
"[[:space:]]*Platform[[:cntrl:]]+([[:print:]]+)?",
"[[:space:]]*Country[[:cntrl:]]+([[:print:]]+)?",
"[[:space:]]*Restricted Countries[[:cntrl:]]+([[:print:]]+)?",
"[[:space:]]*Initial Price[[:cntrl:]]+([[:print:]]+)?",
"[[:space:]]*"))
Edit: Sorry, I overlooked that the position of the variables within the text can change between different inputs. In that case you cannot easily extract all variables at once with this method. However, you can still extract them one by one by using just the appropriate line in the regex above. If a variable is not present (like "Channels" in your example) that is not a problem - it will appear as NA
).
Upvotes: 3