Ziroque
Ziroque

Reputation: 47

How to separate rows by comma outside brackets in R

How can I separate rows by comma outside brackets in R?

I know that disasters <- disasters %>% separate_rows(Location, sep = ",") can be used to separate rows by comma, but how to make a condition on "out of the brackets only"?

I have a data table (disasters) that looks like:

<!-- +------------+------------+--------------------------+--------------------------------------------------------------------------------------------------------------------------+--+--+ -->
<!-- | Start.date |  End.date  |         Country          |                                                          Region                                                          |  |  | -->
<!-- +------------+------------+--------------------------+--------------------------------------------------------------------------------------------------------------------------+--+--+ -->
<!-- | 2005-10-01 | 2005-10-16 | Costa Rica               | Quepos (Aguirre district, Puntarenas province), Guanacaste province                                                      |  |  | -->
<!-- | 2016-11-23 | 2016-11-25 | Costa Rica               | Upala, Bagaces                                                                                                           |  |  | -->
<!-- | ...        | ...        | ...                      | ...                                                                                                                      |  |  | -->
<!-- | 2007-12-11 | 2007-12-17 | Dominican Republic (the) | Bonao city (Monsenor Nouel district, Monsenor Nouel province), Nagua district (Maria Trinidad Sanches province), Arenoso |  |  | -->
<!-- | ...        | ...        | ...                      | ...                                                                                                                      |  |  | -->
<!-- +------------+------------+--------------------------+--------------------------------------------------------------------------------------------------------------------------+--+--+ -->

It should be transformed to be like:

<!-- +------------+------------+--------------------------+---------------------------------------------------------------+--+--+ -->
<!-- | Start.date |  End.date  |         Country          |                            Region                             |  |  | -->
<!-- +------------+------------+--------------------------+---------------------------------------------------------------+--+--+ -->
<!-- | 2005-10-01 | 2005-10-16 | Costa Rica               | Quepos (Aguirre district, Puntarenas province)                |  |  | -->
<!-- | 2005-10-01 | 2005-10-16 | Costa Rica               | Guanacaste province                                           |  |  | -->
<!-- | 2016-11-23 | 2016-11-25 | Costa Rica               | Upala                                                         |  |  | -->
<!-- | 2016-11-23 | 2016-11-25 | Costa Rica               | Bagaces                                                       |  |  | -->
<!-- | ...        | ...        | ...                      | ...                                                           |  |  | -->
<!-- | 2007-12-11 | 2007-12-17 | Dominican Republic (the) | Bonao city (Monsenor Nouel district, Monsenor Nouel province) |  |  | -->
<!-- | 2007-12-11 | 2007-12-17 | Dominican Republic (the) | Nagua district (Maria Trinidad Sanches province)              |  |  | -->
<!-- | 2007-12-11 | 2007-12-17 | Dominican Republic (the) | Arenoso                                                       |  |  | -->
<!-- | ...        | ...        | ...                      | ...                                                           |  |  | -->
<!-- +------------+------------+--------------------------+---------------------------------------------------------------+--+--+ -->

Your kind help is appreciated. The following is part of the data.table:

   Start.date   End.date                  Country ISO
1: 2004-01-05 2004-01-05           American Samoa ASM
2: 2004-09-02 2004-09-03            Bahamas (the) BHS
3: 2004-08-25 2004-09-08            Bahamas (the) BHS
4: 2005-10-19 2005-10-25            Bahamas (the) BHS
5: 2007-10-28 2007-11-02 Dominican Republic (the) DOM
                                                                                                                                                                                                                                                                                                                                                                                                                                                     Location
1:                                                                                                                                                                                                                                                                                                                                                                                                                                             American Samoa
2:                                                                                                                                                                                                                                                                                                                                 Abacos, Andros, Berry Islands, Bimini, Eleuthera, Grand Bahama, New Providence islands (Administrative unit not available)
3: Acklins, Berry Islands, Biminis, Black Point, Cat Island, Central Abaco, Central Andros, Central Eleuthera, City of Freeport, Crooked Island, East Grand Bahama, Exuma, Grand Cay, Harbour Island, Hope Town, Inagua, Long Island, Mangrove Cay, Mayaguana, Moore's Island, New Providence, North Abaco, North Andros, North Eleuthera, Ragged Island, Rum Cay, San Salvador, South Abaco, South Andros, South Eleuthera, Spanish Wells, West Grand Bahama
4: Acklins, Berry Islands, Biminis, Black Point, Cat Island, Central Abaco, Central Andros, Central Eleuthera, City of Freeport, Crooked Island, East Grand Bahama, Exuma, Grand Cay, Harbour Island, Hope Town, Inagua, Long Island, Mangrove Cay, Mayaguana, Moore's Island, New Providence, North Abaco, North Andros, North Eleuthera, Ragged Island, Rum Cay, San Salvador, South Abaco, South Andros, South Eleuthera, Spanish Wells, West Grand Bahama
5:                                                                                                                           Santo Domingo, Distrito Nacional, San Cristobal, Peravia, Azua, Barahona, Pedernales, Independencia, Baoruco, San Juan, Santiago, Puerto Plata, Espaillat, Salcedo, Duarte, La Vega, Monte Plata, Monsenor Nouel, Hato Mayor, El Seibo, Dajabon, Monte Cristi, Santiago Rodriguez, La Altagracia, San Pedro de Macoris provinces
   Latitude Longitude Magnitude.value Magnitude.scale Disaster.type
1:       NA        NA             310             Kph         Storm
2:       NA        NA              NA             Kph         Storm
3:       NA        NA              NA             Kph         Storm
4:       NA        NA              NA             Kph         Storm
5:    18.53    -70.06              NA             Kph         Storm
   Disaster.subtype Associated.disaster          Associated.disaster2 Total.deaths
1: Tropical cyclone                  --                            --            0
2: Tropical cyclone                  --                            --            2
3: Tropical cyclone                  --                            --            1
4: Tropical cyclone                  --                            --            1
5: Tropical cyclone               Flood Slide (land, mud, snow, rock)          129
   Total.affected Total.damage...000.US.. Insured.losses...000.US..
1:          23060                  150000                         0
2:           8000                 1000000                    230000
3:              0                       0                         0
4:           1500                       0                         0
5:          79728                   77700                         0
       Disaster.name Disaster.No.
1:              Heta    2004-0004
2:           Frances    2004-0455
3:              Ivan    2004-0462
4: Hurricane "Wilma"    2005-0585
5:              Noel    2007-0523

Upvotes: 1

Views: 159

Answers (2)

akrun
akrun

Reputation: 887148

One option is to specify the sep as a regex lookaround

library(tidyr)
library(dplyr)
disasters %>%
      separate_rows(C, sep = "(?<=Region.),\\s*|(?<=\\)),\\s*")
# A tibble: 5 x 3
#  A        B        C                    
#  <chr>    <chr>    <chr>                
#1 Country1 Oct 2000 Region1              
#2 Country1 Oct 2000 Region2              
#3 Country1 Dec 2016 Region1              
#4 Country1 Dec 2016 Region2(City1, City2)
#5 Country1 Dec 2016 Region5  

Using the updated dataset, one option is

lst1 <- strsplit(disasters2$Region, "\\(.*?\\)(*SKIP)(*F)|,\\s*", perl = TRUE)     
nm1 <- setdiff(names(disasters2), "Region")
out <- transform(disasters2[rep(seq_len(nrow(disasters2)), 
     lengths(lst1)),nm1], Region = unlist(lst1))    


row.names(out) <- NULL
out
#  Start.date   End.date    Country                                         Region
#1 2005-10-10 2005-10-16 Costa Rica Quepos (Aguirre district, Puntarenas province)
#2 2005-10-10 2005-10-16 Costa Rica                            Guanacaste province
#3 2016-11-23 2016-11-25 Costa Rica                                          Upala
#4 2016-11-23 2016-11-25 Costa Rica                                        Bagaces

data

disasters <- structure(list(A = c("Country1", "Country1"), B = c("Oct 2000", 
"Dec 2016"), C = c("Region1, Region2", "Region1, Region2(City1, City2), Region5"
)), class = "data.frame", row.names = c(NA, -2L))

disasters2 <-   structure(list(Start.date = c("2005-10-10", "2016-11-23"), End.date = c("2005-10-16", 
"2016-11-25"), Country = c("Costa Rica", "Costa Rica"), 
Region = c("Quepos (Aguirre district, Puntarenas province), Guanacaste province", 
"Upala, Bagaces")), class = "data.frame", row.names = c(NA, -2L
))

Upvotes: 2

Eyayaw
Eyayaw

Reputation: 1081

This might serve your cause.

txt <- c("Region1, Region2",
 "Region1, Region2(City1, City2), Region5",
 "RegionA, RegionB (CityA, CityB, CityC)",
"RegionF, RegionG" )   

#split at `,` followed by a `space` but not by `City` then a `number` or a `capital letter`
strsplit(txt, "(,\\s(?!City[[:upper:]]|\\d))", perl = TRUE)
[[1]]
[1] "Region1" "Region2" 

[[2]]
[1] "Region1"       "Region2(City1" "City2)"        "Region5"      

[[3]]
[1] "RegionA"                       "RegionB (CityA, CityB, CityC)"

[[4]]
[1] "RegionF" "RegionG" 

Upvotes: 0

Related Questions