Antoni Parellada
Antoni Parellada

Reputation: 4791

How to select rows of a dataframe in R when there are multiple duplicates?

I am trying to take a look and plot the number of cases of COVID19 each date in the US and China using the dataset in this Johns Hopkins website. I got a great answer here as to how to get subset the data for a similar plot sampling three different countries, thanks to Ben Bolker:

require(RCurl)
require(foreign)
require(tidyverse) # To tip the df from long row of dates to cols (pivot_longer())
x = getURL("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_19-covid-Confirmed.csv")

corona = (read_csv(x)
          %>% pivot_longer(cols = -c(`Province/State`, `Country/Region`, Lat, Long),
                           names_to = "date",
                           values_to = "cases")
          %>% select(`Country/Region`, date, cases)
          %>% mutate(date=as.Date(date,format="%m/%d/%y"))
          %>% drop_na(cases)
          %>% rename(country="Country/Region")
)

cc <- (corona
       %>% filter(country %in% c("Italy","Spain", "Korea, South"))
)

ccw <- (cc
        %>% pivot_wider(names_from="country",values_from="cases")
        %>% filter(cumsum(Italy>0 | Spain>0)>=5)
)

Unfortunately, this is much more complicated in the US, and possibly China, because there are many entries for the United States, depending on the column Province/State, which in the US would correspond to county and state. Hence, there is overcounting (on the one hand the cases in each state are entered in different rows, and in addition, the same numbers reappear in other rows, broken down by county).

I just managed to get the data for the US without duplicates using the function state.name as below, but I don't have a similar dataset with the names of the pertinent political geographic partitions of China.

Here is an example of the problem:

enter image description here

How can avoid duplicating counts each day due to these overlapping rows - the fact that they didn't split in two columns the state and counties?

Here is the working code for the US (China still pending):

require(RCurl)
require(foreign)
require(tidyverse) # To tip the df from long row of dates to cols (pivot_longer())



x = getURL("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_19-covid-Confirmed.csv")

corona = (read_csv(x)
          %>% pivot_longer(cols = -c(`Province/State`, `Country/Region`, Lat, Long),
                           names_to = "date",
                           values_to = "cases")
          %>% select(`Province/State`, `Country/Region`, date, cases)
          %>% mutate(date=as.Date(date,format="%m/%d/%y"))
          %>% drop_na(cases)
          %>% rename(country="Country/Region")
          %>% rename(state="Province/State")
)

cc_with_states <- corona[is.element(corona$state,state.name),]
cc <- cc_with_states[,2:4]
us <- aggregate(cc[ ,3], FUN="sum", by=list(as.Date(cc$date)))
cc[,2:3] <- us 
cc <- cc[1:nrow(us),]

ccw <- (cc
        %>% pivot_wider(names_from="country",values_from="cases")
        %>% filter(US>1)
)

Upvotes: 0

Views: 207

Answers (2)

Antoni Parellada
Antoni Parellada

Reputation: 4791

There was some uncertainty as to what I was actually asking, but I got what I wanted:

For the United states cases, there is the very handy state.name allowing subsetting on a Boolean operation that leaves out all the political demarcations (counties) that are not States, and that amounted to massive over-counting of cases.

In the case of China, I have to use raw force, since I couldn't fine a similar vector of provinces, which after some domain familiarization seem to be what I needed. Once I got that, I was able to exclude rows with overlapping counts for cities and municipalities in China.

This is the vector of provinces and autonomous administrations in China:

provinces <- c('Hubei', 'Guangdong', 'Henan', 'Zhejiang', 'Hunan', 'Anhui',
               'Jiangxi','Shandong','Jiangsu','Sichuan','Heilongjiang','Hebei',
               'Fujian','Guangxi','Shaanxi','Yunnan','Hainan','Guizhou','Shanxi',
               'Gansu','Hong Kong','Liaoning','Jilin','Xinjiang','Inner Mongolia',
               'Ningxia','Qinghai','Macau','Tibet')

Evidently the problem stems from the raw data in the Github repository using a slashed column for Province/State.

After that it was just using a modification on Ben Bolker's answer to a recent related question:

require(RCurl)
require(foreign)
require(tidyverse) # To tip the df from long row of dates to cols (pivot_longer())


x = getURL("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_19-covid-Confirmed.csv")

corona = (read_csv(x)
          %>% pivot_longer(cols = -c(`Province/State`, `Country/Region`, Lat, Long),
                           names_to = "date",
                           values_to = "cases")
          %>% select(`Province/State`, `Country/Region`, date, cases)
          %>% mutate(date=as.Date(date,format="%m/%d/%y"))
          %>% drop_na(cases)
)

united <- corona[corona$`Country/Region`=='US',]
cc_no_states <- united[!united$`Province/State`%in%state.name,]
ccn <- cc_no_states[,2:4]

cc_with_states <- corona[is.element(corona$`Province/State`,state.name),]
cc <- cc_with_states[,2:4]

mix <- rbind(ccn,cc)
mix <- aggregate(mix[,3], FUN="sum", by=list(as.Date(mix$date)))


cc[,2:3] <- mix
cc <- cc[1:nrow(mix),]

china <- corona[corona$`Country/Region`=='China',]
provinces <- c('Hubei', 'Guangdong', 'Henan', 'Zhejiang', 'Hunan', 'Anhui',
               'Jiangxi','Shandong','Jiangsu','Sichuan','Heilongjiang','Hebei',
               'Fujian','Guangxi','Shaanxi','Yunnan','Hainan','Guizhou','Shanxi',
               'Gansu','Hong Kong','Liaoning','Jilin','Xinjiang','Inner Mongolia',
               'Ningxia','Qinghai','Macau','Tibet')
china_prov <- china[is.element(china$`Province/State`,provinces),]
ccchina <- china_prov[,2:4]
temp <- aggregate(ccchina[ ,3], FUN="sum", by=list(as.Date(ccchina$date)))
ccchina[,2:3] <- temp 
ccchina <- ccchina[1:nrow(temp),]

ccw <- (cc
        %>% pivot_wider(names_from=`Country/Region`,values_from="cases")
        %>% filter(US>1)
)

ccw_china <- (ccchina
        %>% pivot_wider(names_from=`Country/Region`,values_from="cases")
        %>% filter(China>1)
)


plot(ccw_china$date, ccw_china$China, type="l", lwd=3, lty=3,
     ylab='', 
     xlab='',
     log='y',
     col=5,
     axes=FALSE,
     main = "Log-lin cumulative COVID-19 cases in US v China",
     cex.main=0.9)

at1 <- seq(min(ccw_china$date), max(ccw_china$date)+1, by=2);
axis.Date(1, at=at1, format="%b %d", las=2, cex.axis=0.7)


at2 <- 2^seq(1,30,by=1)
axis(side=2, at2, cex.axis=0.7)

abline(h=at2, lty=2, col="grey90")  # Add faint grid lines
abline(v=at1, lty=2, col="grey90")  # Add faint grid lines

lines(ccw$date, ccw$US, lwd=3, col=4)

legend(ccw_china$date[1], 70000, legend=c("China", "US"),
       col=c(5, 4), lty=c(3,1), lwd=3, cex=0.8,
       box.lty=0)


plot(ccw_china$date, ccw_china$China, type="l", lwd=3, lty=3,
     ylab='', 
     xlab='',
     xaxt="n",
     col=5,
     cex.axis=0.7,
     las=2,
     main = "COVID-19 cumulative cases in the US versus China",
     cex.main=0.9)

at1 <- seq(min(ccw_china$date), max(ccw_china$date)+1, by=2);
axis.Date(1, at=at1, format="%b %d", las=2, cex.axis=0.7)

lines(ccw$date, ccw$US, lwd=3, col=4)

legend(ccw_china$date[1], 70000, legend=c("China", "US"),
       col=c(5, 4), lty=c(3,1), lwd=3, cex=0.8,
       box.lty=0)

enter image description here

Upvotes: 1

rg255
rg255

Reputation: 4169

It sounds like (from the comments) that you simply want to sum each day into a single row per country (e.g. all separate state measurements in the US into one single row for the US). It is simple and fast to get the cases per day per country using data.table:

library(data.table)
library(RCurl)
# Read in data as a data.table
dt1 <- fread(getURL("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_19-covid-Confirmed.csv"))

# Data.table solution: summing each column (specified with .SDcols and .SD), grouping by country/region
dt2 <- dt1[, lapply(.SD, sum), 
  .SDcols = names(dt1[, -c("Country/Region", "Province/State", "Lat", "Long")]), 
  by = "Country/Region"]

Since you've now updated your own answer, rather than just answering one of my comments, to show that you are indeed trying to get a single row per country... This plot recreates yours, showing you that this data.table method works in creating a single row per country. In full:

dt2 <- melt(dt1[, lapply(.SD, sum), 
  .SDcols = names(dt1[, -c("Country/Region", "Province/State", "Lat", "Long")]), 
  by = "Country/Region"], id.vars = "Country/Region")
names(dt2)[1] <- "Country"


    # Plot it
 dt2[Country == "China", plot(log(value), typ = "l", col = "red", 
  xlab = "Days Since January 21st", xlim = c(0,70), xaxs = "i", xaxt = "n",
  ylab = "Number of Cases", ylim = c(6,12), yaxs = "i", yaxt = "n"
  )]
dt2[Country == "US", points(log(value), typ = "l", col = "blue")]
axis(1, at = seq(0, 100, 10))
axis(2, at = seq(0, 100, 2), labels = round(exp(seq(0, 100, 2)), 0))

enter image description here

Upvotes: 0

Related Questions