CLK
CLK

Reputation: 27

cleaning data - expanding one column to multiple columns in a dataframe

The textsample below is in one column. Using R, I hope to separate it into 5 columns with the following headings: "Name" , "Location", "Date", "Time", "Warning" . I have tried separate() and strsplit() and haven't succeeded yet. I hope someone here can help.

textsample <- "Name : York-APC-UPS\r\n
    Location : York SCATS Zigzag Road\r\n
    Contact  : Mechanical  services\r\n
    \r\n
    http://York-APC-UPS.domain25.minortracks.wa.gov.au\r\n
    http://192.168.70.56\r\n
    http://FE81::3C0:B8FF:FE6D:8065\r\n
    Serial Number : 5A1149T24253\r\n
    Device Serial Number : 5A1149T24253\r\n
    Date : 12/06/2018\r\n
    Time : 08:45:46\r\n
    Code : 0x0125\r\n
    \r\n
    Warning : A high humidity threshold violation exists for integrated Environmental Monitor TH Sensor 
   (Port 1 Temp 1 at Port 1) reporting over 50%CD.\r\n"

Upvotes: 2

Views: 263

Answers (3)

s_baldur
s_baldur

Reputation: 33643

Base R (assuming values stay on same line):

extract_value <- function(text, key) {
  regmatches(text, gregexpr(sprintf("(?<=%s : )[^\\r]+", key), text, perl = TRUE))
}
cols <- c("Name" , "Location", "Date", "Time", "Warning")
as.data.frame(sapply(cols, \(x) extract_value(textsample, x)))

#           Name               Location       Date     Time
# 1 York-APC-UPS York SCATS Zigzag Road 12/06/2018 08:45:46
#                                                                                                                                          Warning
# 1 A high humidity threshold violation exists for integrated Environmental Monitor TH Sensor \n   (Port 1 Temp 1 at Port 1) reporting over 50%CD.

Upvotes: 1

Adriano Mello
Adriano Mello

Reputation: 2132

I hope you have resolved your problem by now.

Anyway, here's an alternative take based on regex separator instead of declaring all columns. Bit cleaning and formatting.

library(tidyverse)
library(hms)

# ----------------------
textsample <- textsample %>% 
  str_split_1("\r+\n+") %>% 
  as_tibble_col("text") %>% 
  transmute(
    label = str_extract(text, "^.*?(?=\\s:\\s)"),
    text = str_remove(text, "^.*?\\s:\\s"),
    across(where(is.character), \(x) str_squish(x)),
    label = str_replace_all(str_to_lower(label), "\\s", "_")) %>% 
  filter(!is.na(label)) %>% 
  pivot_wider(names_from = label, values_from = text) %>% 
  mutate(date =  dmy(date), time = hms::as_hms(time))

The output:

> textsample
# A tibble: 1 × 9
  name         location               contact             serial_number device_serial_number date       time     code   warning                                                                                                    
  <chr>        <chr>                  <chr>               <chr>         <chr>                <date>     <time>   <chr>  <chr>                                                                                                      
1 York-APC-UPS York SCATS Zigzag Road Mechanical services 5A1149T24253  5A1149T24253         2018-06-12 08:45:46 0x0125 A high humidity threshold violation exists for integrated Environmental Monitor TH Sensor (Port 1 Temp 1 a…

Upvotes: 0

Ian Campbell
Ian Campbell

Reputation: 24878

Here's an approach that should at least get you started:

We can use extract from tidyr extract the text of interest with regular expressions.

Then we can use mutate_all to apply the same str_replace to get rid of the labels.

library(dplyr)
library(tidyr)
library(stringr)
as.data.frame(extsample) %>% 
  extract(1, into=c("Name","Location","Date","Time","Warning"),
             regex = "(Name : .+)[^$]*(Location : .+)[^$]*(Date : .+)[^$]*(Time : .+)[^$]*(Warning : .+)[^$]*") %>%
  mutate_all(list(~str_replace(.,"^\\w+ : ","")))
#          Name               Location       Date     Time
#1 York-APC-UPS York SCATS Zigzag Road 12/06/2018 08:45:46
#                                                                                     Warning
#1 A high humidity threshold violation exists for integrated Environmental Monitor TH Sensor 

This relies on capturing groups with (), see help(tidyr::extract) for details. We use [^$]* to match anything other than the end of the string 0 or more times between the groups.

Note the first argument to extract is 1, which indicates the first (and only) column of the data.frame I made from your example data. Change this as necessary.

Upvotes: 1

Related Questions