Reputation: 47
How can I separate rows by comma outside brackets in R?
I know that disasters <- disasters %>% separate_rows(Location, sep = ",")
can be used to separate rows by comma, but how to make a condition on "out of the brackets only"?
I have a data table (disasters) that looks like:
<!-- +------------+------------+--------------------------+--------------------------------------------------------------------------------------------------------------------------+--+--+ -->
<!-- | Start.date | End.date | Country | Region | | | -->
<!-- +------------+------------+--------------------------+--------------------------------------------------------------------------------------------------------------------------+--+--+ -->
<!-- | 2005-10-01 | 2005-10-16 | Costa Rica | Quepos (Aguirre district, Puntarenas province), Guanacaste province | | | -->
<!-- | 2016-11-23 | 2016-11-25 | Costa Rica | Upala, Bagaces | | | -->
<!-- | ... | ... | ... | ... | | | -->
<!-- | 2007-12-11 | 2007-12-17 | Dominican Republic (the) | Bonao city (Monsenor Nouel district, Monsenor Nouel province), Nagua district (Maria Trinidad Sanches province), Arenoso | | | -->
<!-- | ... | ... | ... | ... | | | -->
<!-- +------------+------------+--------------------------+--------------------------------------------------------------------------------------------------------------------------+--+--+ -->
It should be transformed to be like:
<!-- +------------+------------+--------------------------+---------------------------------------------------------------+--+--+ -->
<!-- | Start.date | End.date | Country | Region | | | -->
<!-- +------------+------------+--------------------------+---------------------------------------------------------------+--+--+ -->
<!-- | 2005-10-01 | 2005-10-16 | Costa Rica | Quepos (Aguirre district, Puntarenas province) | | | -->
<!-- | 2005-10-01 | 2005-10-16 | Costa Rica | Guanacaste province | | | -->
<!-- | 2016-11-23 | 2016-11-25 | Costa Rica | Upala | | | -->
<!-- | 2016-11-23 | 2016-11-25 | Costa Rica | Bagaces | | | -->
<!-- | ... | ... | ... | ... | | | -->
<!-- | 2007-12-11 | 2007-12-17 | Dominican Republic (the) | Bonao city (Monsenor Nouel district, Monsenor Nouel province) | | | -->
<!-- | 2007-12-11 | 2007-12-17 | Dominican Republic (the) | Nagua district (Maria Trinidad Sanches province) | | | -->
<!-- | 2007-12-11 | 2007-12-17 | Dominican Republic (the) | Arenoso | | | -->
<!-- | ... | ... | ... | ... | | | -->
<!-- +------------+------------+--------------------------+---------------------------------------------------------------+--+--+ -->
Your kind help is appreciated. The following is part of the data.table:
Start.date End.date Country ISO
1: 2004-01-05 2004-01-05 American Samoa ASM
2: 2004-09-02 2004-09-03 Bahamas (the) BHS
3: 2004-08-25 2004-09-08 Bahamas (the) BHS
4: 2005-10-19 2005-10-25 Bahamas (the) BHS
5: 2007-10-28 2007-11-02 Dominican Republic (the) DOM
Location
1: American Samoa
2: Abacos, Andros, Berry Islands, Bimini, Eleuthera, Grand Bahama, New Providence islands (Administrative unit not available)
3: Acklins, Berry Islands, Biminis, Black Point, Cat Island, Central Abaco, Central Andros, Central Eleuthera, City of Freeport, Crooked Island, East Grand Bahama, Exuma, Grand Cay, Harbour Island, Hope Town, Inagua, Long Island, Mangrove Cay, Mayaguana, Moore's Island, New Providence, North Abaco, North Andros, North Eleuthera, Ragged Island, Rum Cay, San Salvador, South Abaco, South Andros, South Eleuthera, Spanish Wells, West Grand Bahama
4: Acklins, Berry Islands, Biminis, Black Point, Cat Island, Central Abaco, Central Andros, Central Eleuthera, City of Freeport, Crooked Island, East Grand Bahama, Exuma, Grand Cay, Harbour Island, Hope Town, Inagua, Long Island, Mangrove Cay, Mayaguana, Moore's Island, New Providence, North Abaco, North Andros, North Eleuthera, Ragged Island, Rum Cay, San Salvador, South Abaco, South Andros, South Eleuthera, Spanish Wells, West Grand Bahama
5: Santo Domingo, Distrito Nacional, San Cristobal, Peravia, Azua, Barahona, Pedernales, Independencia, Baoruco, San Juan, Santiago, Puerto Plata, Espaillat, Salcedo, Duarte, La Vega, Monte Plata, Monsenor Nouel, Hato Mayor, El Seibo, Dajabon, Monte Cristi, Santiago Rodriguez, La Altagracia, San Pedro de Macoris provinces
Latitude Longitude Magnitude.value Magnitude.scale Disaster.type
1: NA NA 310 Kph Storm
2: NA NA NA Kph Storm
3: NA NA NA Kph Storm
4: NA NA NA Kph Storm
5: 18.53 -70.06 NA Kph Storm
Disaster.subtype Associated.disaster Associated.disaster2 Total.deaths
1: Tropical cyclone -- -- 0
2: Tropical cyclone -- -- 2
3: Tropical cyclone -- -- 1
4: Tropical cyclone -- -- 1
5: Tropical cyclone Flood Slide (land, mud, snow, rock) 129
Total.affected Total.damage...000.US.. Insured.losses...000.US..
1: 23060 150000 0
2: 8000 1000000 230000
3: 0 0 0
4: 1500 0 0
5: 79728 77700 0
Disaster.name Disaster.No.
1: Heta 2004-0004
2: Frances 2004-0455
3: Ivan 2004-0462
4: Hurricane "Wilma" 2005-0585
5: Noel 2007-0523
Upvotes: 1
Views: 159
Reputation: 887148
One option is to specify the sep
as a regex lookaround
library(tidyr)
library(dplyr)
disasters %>%
separate_rows(C, sep = "(?<=Region.),\\s*|(?<=\\)),\\s*")
# A tibble: 5 x 3
# A B C
# <chr> <chr> <chr>
#1 Country1 Oct 2000 Region1
#2 Country1 Oct 2000 Region2
#3 Country1 Dec 2016 Region1
#4 Country1 Dec 2016 Region2(City1, City2)
#5 Country1 Dec 2016 Region5
Using the updated dataset, one option is
lst1 <- strsplit(disasters2$Region, "\\(.*?\\)(*SKIP)(*F)|,\\s*", perl = TRUE)
nm1 <- setdiff(names(disasters2), "Region")
out <- transform(disasters2[rep(seq_len(nrow(disasters2)),
lengths(lst1)),nm1], Region = unlist(lst1))
row.names(out) <- NULL
out
# Start.date End.date Country Region
#1 2005-10-10 2005-10-16 Costa Rica Quepos (Aguirre district, Puntarenas province)
#2 2005-10-10 2005-10-16 Costa Rica Guanacaste province
#3 2016-11-23 2016-11-25 Costa Rica Upala
#4 2016-11-23 2016-11-25 Costa Rica Bagaces
disasters <- structure(list(A = c("Country1", "Country1"), B = c("Oct 2000",
"Dec 2016"), C = c("Region1, Region2", "Region1, Region2(City1, City2), Region5"
)), class = "data.frame", row.names = c(NA, -2L))
disasters2 <- structure(list(Start.date = c("2005-10-10", "2016-11-23"), End.date = c("2005-10-16",
"2016-11-25"), Country = c("Costa Rica", "Costa Rica"),
Region = c("Quepos (Aguirre district, Puntarenas province), Guanacaste province",
"Upala, Bagaces")), class = "data.frame", row.names = c(NA, -2L
))
Upvotes: 2
Reputation: 1081
This might serve your cause.
txt <- c("Region1, Region2",
"Region1, Region2(City1, City2), Region5",
"RegionA, RegionB (CityA, CityB, CityC)",
"RegionF, RegionG" )
#split at `,` followed by a `space` but not by `City` then a `number` or a `capital letter`
strsplit(txt, "(,\\s(?!City[[:upper:]]|\\d))", perl = TRUE)
[[1]]
[1] "Region1" "Region2"
[[2]]
[1] "Region1" "Region2(City1" "City2)" "Region5"
[[3]]
[1] "RegionA" "RegionB (CityA, CityB, CityC)"
[[4]]
[1] "RegionF" "RegionG"
Upvotes: 0