user113156
user113156

Reputation: 7107

Splitting text by "|" and populating a table

I have some data which looks like:

                          cardCharacteristics
1        2 habs.|2 baños|72 m²|Bajos|Ascensor
2 3 habs.|2 baños|110 m²|Ascensor|Calefacción
3                     3 habs.|70 m²|2ª Planta
4       2 habs.|2 baños|160 m²|Terraza|Balcón
5   5 habs.|2 baños|176 m²|7ª Planta|Ascensor
6            3 habs.|2 baños|187 m²|4ª Planta

I am trying to split the column into an unspecified number of columns by |. Using the following cSplit_e(., split.col = "cardCharacteristics", sep = "|", type = "character") doesn't get me the result since it splits on all unique values and returns a binary output.

The expected output would be:

tibble(
  "habs" = c(2, 3, 3, 2, 5, 3),
  "baños" = c(2, 2, NA, 2, 2, 2),
  "m^2" = c(72, 110, 70, 160, 176, 187),
  "Floor" = c("Bajos", NA, "2ª Planta", NA, "7ª Planta", "4ª Planta"),
  "Lift" = c("Ascensor", "Ascensor", NA, NA, "Ascensor", NA),
  "Heating" = c(NA, "Calefacción", NA, NA, NA, NA),
  "Terraza" = c(NA, NA, NA, "Terraza", NA, NA),
  "Balcón" = c(NA, NA, NA, "Balcón", NA, NA)
)

or:

   habs baños `m^2` Floor     Lift     Heating     Terraza Balcón
  <dbl> <dbl> <dbl> <chr>     <chr>    <chr>       <chr>   <chr> 
1     2     2    72 Bajos     Ascensor NA          NA      NA    
2     3     2   110 NA        Ascensor Calefacción NA      NA    
3     3    NA    70 2ª Planta NA       NA          NA      NA    
4     2     2   160 NA        NA       NA          Terraza Balcón
5     5     2   176 7ª Planta Ascensor NA          NA      NA    
6     3     2   187 4ª Planta NA       NA          NA      NA

Data:

data = structure(list(cardCharacteristics = c("2 habs.|2 baños|72 m²|Bajos|Ascensor", 
"3 habs.|2 baños|110 m²|Ascensor|Calefacción", "3 habs.|70 m²|2ª Planta", 
"2 habs.|2 baños|160 m²|Terraza|Balcón", "5 habs.|2 baños|176 m²|7ª Planta|Ascensor", 
"3 habs.|2 baños|187 m²|4ª Planta")), row.names = c(NA, 6L
), class = "data.frame")

EDIT:

My progress, is to do the following:

data %>%
mutate(
    habs = str_extract(cardCharacteristics, "(\\d)+(?= habs.)"),
    baños = str_extract(cardCharacteristics, "(\\d)+(?= baños)"),
    mts2 = str_extract(cardCharacteristics, "(\\d)+(?= m²)"),
    floor = str_extract(cardCharacteristics, "(\\d)+(?= 4ª Planta)")
  )

EDIT 2:

The following:

  mutate(
    habs = str_extract(cardCharacteristics, "(\\d)+(?= habs.)"),
    baños = str_extract(cardCharacteristics, "(\\d)+(?= baños)"),
    mts2 = str_extract(cardCharacteristics, "(\\d)+(?= m²)"),
    Terraza = str_extract(cardCharacteristics, "Terraza"),
    Calefacción = str_extract(cardCharacteristics, "Calefacción"),
    Floor = str_extract(cardCharacteristics, "(\\d)+(?=ª Planta)|Bajos"),
  )

Gets me:

                          cardCharacteristics habs baños mts2 Terraza Calefacción Floor
1        2 habs.|2 baños|72 m²|Bajos|Ascensor    2     2   72    <NA>        <NA> Bajos
2 3 habs.|2 baños|110 m²|Ascensor|Calefacción    3     2  110    <NA> Calefacción  <NA>
3                     3 habs.|70 m²|2ª Planta    3  <NA>   70    <NA>        <NA>     2
4       2 habs.|2 baños|160 m²|Terraza|Balcón    2     2  160 Terraza        <NA>  <NA>
5   5 habs.|2 baños|176 m²|7ª Planta|Ascensor    5     2  176    <NA>        <NA>     7
6            3 habs.|2 baños|187 m²|4ª Planta    3     2  187    <NA>        <NA>     4

Which is almost what I need.

Upvotes: 1

Views: 81

Answers (3)

Chris
Chris

Reputation: 2286

Returning to your dictionary idea, you might want a tag:value approach as might be had with write.dcf. My system is a different encoding that yours that make some things hard to test, or with your encoding lead to undesired results, though will likely work on your system. I make the assumption that 'habs.', 'banos' 'm2' are the basics, further entries are additional amenities, and each 'record' starts with 'habs.' Using your data:

# split1 <- strsplit(...
strsplit(unname(unlist(data)), '|', fixed = TRUE)
[[1]]
[1] "2 habs."  "2 baños"  "72 m²"    "Bajos"    "Ascensor"

[[2]]
[1] "3 habs."     "2 baños"     "110 m²"      "Ascensor"    "Calefacción"

These are our records and we want to append \n\n to each as record separator for .dcf

# split1 <- lapply(...
lapply(split1, function(x) c(x, '\n\n'))
[[1]]
[1] "2 habs."  "2 baños"  "72 m²"    "Bajos"    "Ascensor" "\n\n"    

[[2]]
[1] "3 habs."     "2 baños"     "110 m²"      "Ascensor"    "Calefacción"
[6] "\n\n"       

At the moment, the 'tags' are after the 'values', so we have to swap them:

# split1 <- sub(...
sub('(.*) (.*)', '\\2 \\1', unlist(split1))
 [1] "habs. 2"     "baños 2"     "m² 72"       "Bajos"       "Ascensor"   
 [6] "\n\n"        "habs. 3"     "baños 2"     "m² 110"      "Ascensor"   
[11] "Calefacción" "\n\n"        "habs. 3"     "m² 70"       "Planta 2ª"  
[16] "\n\n"

.dcf tag:value so replace ' ' with ':'

# split1 <- gsub(
gsub(' ', ':', split1)
 [1] "habs.:2"     "baños:2"     "m²:72"       "Bajos"       "Ascensor"   
 [6] "\n\n"        "habs.:3"     "baños:2"     "m²:110"      "Ascensor"   
[11] "Calefacción" "\n\n"

We're close at this point, but I'm pretty sure that our un-terminated amenities (Bajos, Ascensor & etc) will be seen as 'malformed', though appears to work with cat

cat(split1)
habs.:2
baños:2
m²:72
Bajos
Ascensor
# but
c(read.dcf(textConnection(split1)))
Error in read.dcf(textConnection(split1)) : 
  Line starting 'Bajos ...' is malformed! # first non ':' terminated

Append ':', plays poorly with my encoding, but should work for yours, my bad results below

cat(gsub("(\\p{L}+)\\b(?![\\p{P}\\p{S}])", '\\1:', split1, perl = TRUE))
habs.:2
 bañ:os:2
 m:²:72
 Bajos:
 Ascensor:

Bad things happen to 'banos' and 'm2'. And perhaps it's all just a bad idea... But, changing the enyey in banos and exponent in m2, things start to get better, but the amenities need either a number or double listing (Ascensor Ascensor)

data$cardCharacteristics[6] <- "3 habs.|2 banos|187 m2|4ª Planta"
split3 <- strsplit(unname(unlist(data)), '|', fixed = TRUE)
split3 <- lapply(split3, function(x) c(x, '\n\n'))
split3 <- sub('(.*) (.*)', '\\2 \\1', unlist(split3))
split3 <- gsub(' ', ':', split3)
split3 <- gsub("(\\p{L}+)\\b(?![\\p{P}\\p{S}])", '\\1:', split3, perl = TRUE)> split3
 [1] "habs.:2"      "banos:2"      "m2:72"        "Bajos:"       "Ascensor:"   
 [6] "\n\n"         "habs.:3"      "banos:2"      "m2:110"       "Ascensor:"   
[11] "Calefacción:" "\n\n"       

A different gsub/regex, also suggested Append ':', ignores internal to string encoding mismatches:

split5 <- strsplit(unname(unlist(data3)), '|', fixed = TRUE)
split5 <- lapply(split5, function(x) c(x, '\n\n'))
split5 <- sub('(.*) (.*)', '\\2 \\1', unlist(split5))
split5 <- gsub(' ', ':', split5)
nocolon <- !grepl(':', split5)
> split5[nocolon] <- paste0(split5[nocolon], ':')
> split5
 [1] "habs.:2"      "baños:2"      "m²:72"        "Bajos:"       "Ascensor:"   
 [6] "\n\n:"

We just have to gsub \n\n to \n\n

split5 <- gsub('\n\n:', '\n\n', split5)
split5_df <- data.frame(read.dcf(textConnection(split5)))
split5_df
  habs. baños  m. Bajos Ascensor Calefacción Planta Terraza Balcón
1     2     2  72                       <NA>   <NA>    <NA>   <NA>
2     3     2 110  <NA>                        <NA>    <NA>   <NA>
3     3  <NA>  70  <NA>     <NA>        <NA>     2ª    <NA>   <NA>
4     2     2 160  <NA>     <NA>        <NA>   <NA>               
5     5     2 176  <NA>                 <NA>     7ª    <NA>   <NA>
6     3     2 187  <NA>     <NA>        <NA>     4ª    <NA>   <NA>

split5_df$Ascensor[which(split5_df$Ascensor == '')] <- c('old','new','scary')

split5_df$Calefacción[which(split5_df$Calefacción == '')] <- 'elec'

And we haven't seen piscina yet, but .dcf will know.

Upvotes: 2

Chris Ruehlemann
Chris Ruehlemann

Reputation: 21410

You were very close indeed, just the way you extracted Floorwas contrary to your aim (the lookahead prevented the extraction of the ª Plantasubstring!):

data %>% 
  mutate(
    habs = str_extract(cardCharacteristics, "(\\d)+(?= habs.)"),
    baños = str_extract(cardCharacteristics, "(\\d)+(?= baños)"),
    mts2 = str_extract(cardCharacteristics, "(\\d)+(?= m²)"),
    Terraza = str_extract(cardCharacteristics, "Terraza"),
    Calefacción = str_extract(cardCharacteristics, "Calefacción"),
    Floor = str_extract(cardCharacteristics, "Bajos|\\d+ª Planta"),  # <--- Corrected here
  )
                          cardCharacteristics habs baños mts2 Terraza Calefacción     Floor
1        2 habs.|2 baños|72 m²|Bajos|Ascensor    2     2   72    <NA>        <NA>     Bajos
2 3 habs.|2 baños|110 m²|Ascensor|Calefacción    3     2  110    <NA> Calefacción      <NA>
3                     3 habs.|70 m²|2ª Planta    3  <NA>   70    <NA>        <NA> 2ª Planta
4       2 habs.|2 baños|160 m²|Terraza|Balcón    2     2  160 Terraza        <NA>      <NA>
5   5 habs.|2 baños|176 m²|7ª Planta|Ascensor    5     2  176    <NA>        <NA> 7ª Planta
6            3 habs.|2 baños|187 m²|4ª Planta    3     2  187    <NA>        <NA> 4ª Planta

Upvotes: 3

Julian
Julian

Reputation: 9260

This is probably solvable using the tidyr package. This could be a starting point for you. sep accepts regex, so you could maybe utilize your regex from the str_extract above.

  library(dplyr)
  library(tidyr)
  data %>% separate(cardCharacteristics,
   sep = "\\|",
   into = c(
     "habs", "baños", "m^2", "Floor",
     "Lift"
   )
 )  

Results in

habs   baños       m^2     Floor        Lift
1 2 habs. 2 baños     72 m²     Bajos    Ascensor
2 3 habs. 2 baños    110 m²  Ascensor Calefacción
3 3 habs.   70 m² 2ª Planta      <NA>        <NA>
4 2 habs. 2 baños    160 m²   Terraza      Balcón
5 5 habs. 2 baños    176 m² 7ª Planta    Ascensor
6 3 habs. 2 baños    187 m² 4ª Planta        <NA>

Upvotes: 1

Related Questions