Reputation: 559

Extracting values from a messy bulk of data

I have a messy bulk of data that I would like to extract information from. Now, I have not quite found a convenient way to extract the information and I hope you can help. My data looks like this:

"\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\nChannels\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\
       n\r\nDates\r\nSeptember 25th 2016 To September 26th 
         2016\r\n\r\n\r\nPlatform\r\nIdea\r\n\r\n\r\nCountry\r\nUnited 
         States\r\n\r\n\r\nRestricted Countries\r\n\r\n\t\t\t\t\t\t\t\t\tUnited 
         States\t\t\t\t\t\t\t\t\r\n\r\n\r\nInitial Price\r\n$0.0692\r\n\r\n\r\n"

Now, what I would like to get out of this is:

Channels                - 
Dates                   September 25th 2016 To September 26th 2016
Platform                Idea
Country                 United States
Restricted Countries    United States
Initial Price           $0.0692

I will need to perform this task for a larger number of observations and then store each variable as a vector of all observations. Thus, I do not really need to store the name of the variable (i.e. "Platform"), but only the result ("Idea"). But to do that I need the "Platform" variable name as an "Identifier" I would assume, as the position of the variable in the text varies across observations changes (as does the number of variables - only slightly though).

Now, I think the stringr package is a good way to do this, but I have not found a convenient way to do this.

Upvotes: 4

Answers (3)

Nicolas2

Reputation: 2210

With a being your input string, the result will be a single data frame with one variable per keyword (missing values for unused keywords), one row for each input:

a <- gsub("\\t*(\\r\\n)+\\t*","/",a)
a <- gsub("(^/|/$)","",a)
a <- gsub("(Channels|Dates|Platform|Country|Restricted Countries|Initial Price)","<\\1>",a)
a <- gsub(">/<",">//<",a)
b <- strsplit(a,"/")
c <- purrr::map(b,
   function(x) {
        dim(x) <-  c(2,length(x)/2)
        tidyr::spread(as.data.frame(t(x),stringsAsFactors=FALSE),V1,V2)
    })
replyr::replyr_bind_rows(c)

Upvotes: 1

Radim

Reputation: 455

Base R solution:

yourstring1 <- "\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\nChannels\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\
n\r\nDates\r\nSeptember 25th 2016 To September 26th 
2016\r\n\r\n\r\nPlatform\r\nIdea\r\n\r\n\r\nCountry\r\nUnited 
States\r\n\r\n\r\nRestricted Countries\r\n\r\n\t\t\t\t\t\t\t\t\tUnited 
States\t\t\t\t\t\t\t\t\r\n\r\n\r\nInitial Price\r\n$0.0692\r\n\r\n\r\n"

# make a placeholder (useful when manipulating strings for easier regex)
yourstring2 <- gsub("\r|\t|\nn|\n", "@", yourstring1, perl = T) # please note the double nn - this is so because a newline character is added when copying from here to R
# split on placeholder if it appears twice or more
yourstring2 <- unlist(strsplit(yourstring2, split = "@{2,}"))
# little cleaning needed
yourstring2 <- gsub(" @", " ", yourstring2)
yourstring2[1:2] <- c(yourstring2[2], "-") # this hard-coded solution works for the particular example, if you have many strings with arbitrarily missing values, you may want to make a little condition for that
# prepare your columns by indexing the character vector
variables <- yourstring2[seq(from = 1, to = length(yourstring2), by = 2)]
values <- yourstring2[seq(from = 2, to = length(yourstring2), by = 2)]
# bind them to dataframe
df <- data.frame(variables, values)

Resulting df:

df
             variables                                     values
1             Channels                                          -
2                Dates September 25th 2016 To September 26th 2016
3             Platform                                       Idea
4              Country                              United States
5 Restricted Countries                              United States
6        Initial Price                                    $0.0692

EDIT: only now I properly read that, instead of a dataframe, the desired result may be a vector of positions... here is a two-line solution to that

yourstring2 <- gsub("\r|\t|\nn|\n", "", yourstring1, perl = T) #clean the original string (see above yourstring1)
yourvector <- unlist(strsplit(yourstring2, split = "Channels|Dates|Platform|Country|Restricted Countries|Initial Price", perl = T))[-1]  # extract

Resulting vector:

   > yourvector
[1] ""                                          
[2] "September 25th 2016 To September 26th 2016"
[3] "Idea"                                      
[4] "United States"                             
[5] "United States"                             
[6] "$0.0692"

Upvotes: 2

AEF

Reputation: 5670

The following regex extracts the values you want. The values are stored in columns 2-7 of the resulting matrix. The code works with a input vector (each entry forms a new row in the matrix)

library(stringr)

input <- "\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\nChannels\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\nDates\r\nSeptember 25th 2016 To September 26th 2016\r\n\r\n\r\nPlatform\r\nIdea\r\n\r\n\r\nCountry\r\nUnited States\r\n\r\n\r\nRestricted Countries\r\n\r\n\t\t\t\t\t\t\t\t\tUnited States\t\t\t\t\t\t\t\t\r\n\r\n\r\nInitial Price\r\n$0.0692\r\n\r\n\r\n"

str_match(input, paste0("[[:space:]]*Channels[[:cntrl:]]+([[:print:]]+)?",
                        "[[:space:]]*Dates[[:cntrl:]]+([[:print:]]+)?",
                        "[[:space:]]*Platform[[:cntrl:]]+([[:print:]]+)?",
                        "[[:space:]]*Country[[:cntrl:]]+([[:print:]]+)?",
                        "[[:space:]]*Restricted Countries[[:cntrl:]]+([[:print:]]+)?",
                        "[[:space:]]*Initial Price[[:cntrl:]]+([[:print:]]+)?",
                        "[[:space:]]*"))

Edit: Sorry, I overlooked that the position of the variables within the text can change between different inputs. In that case you cannot easily extract all variables at once with this method. However, you can still extract them one by one by using just the appropriate line in the regex above. If a variable is not present (like "Channels" in your example) that is not a problem - it will appear as NA).

Upvotes: 3

Extracting values from a messy bulk of data

Answers (3)

Related Questions