Why does dplyr fail to aggregate my data?

I am working on merging a dataframe df0 with a geographical object. Previously, I used dplyr to add a column of interest to my geographical data, for this I used the approach suggested [here][1]. It works fine with my big dataset, however I have been trying to use the same approach with a simpler data and I do not manage to replicate. Here is an overview of the problem.

  1. df0 is a list that contains two columns: "Country" and "PF". It looks like this:
                              Country PF
1                        Afghanistan   3
2                            Albania   3
3                            Algeria   3
4                     American Samoa   0
5                            Andorra   3
6                             Angola   3
7                           Anguilla   0
8                  Antigua & Barbuda   0
9                          Argentina   1
10                           Armenia   3
11                             Aruba   0
  1. The geographical object is defined using the rnaturalearth package as follows:
library(rnaturalearth)
library(rnaturalearthdata)
world <- ne_countries(scale = "medium", returnclass = "sf")
world$Country<-noquote(world$name)

This is how the resulting world$Country looks like:

1] Aruba                     Afghanistan               Angola                   
  [4] Anguilla                  Albania                   Aland                    
  [7] Andorra                   United Arab Emirates      Argentina                
 [10] Armenia                   American Samoa            Antarctica               
 [13] Ashmore and Cartier Is.   Fr. S. Antarctic Lands    Antigua and Barb.        
 [16] Australia                 Austria                   Azerbaijan               
 [19] Burundi                   Belgium                   Benin                    
 [22] Burkina Faso              Bangladesh                Bulgaria   

The idea is to associate the column "PF" to the object world. To do this, I use the piece of code:

library(dplyr)
df_sum <- df0%>% 
  filter(Country %in% world$Country) %>%
  group_by(Country) %>%
  summarise(PF= mean(PF))

world$PF<- df_sum$PF[match(world$Country, df_sum$Country)]

Normally, this does the job. However, for some reason it is not working this time. I have noticed that the object df_sum contains zero observations after running the code, which means that the first part of the code is the one failing. I feel like probably I am missing some very basic notion, as an amateur programmer. Could you help me out?

Edit in response to the answer provided

Indeed I suspect that the problem comes from df0. This is how I treat it:

df0<-read.csv("C:/Users/public_funding.csv",sep=",")
df0$X<-NULL
colnames(df0)<-c("Country","PF")
#df0$Country<-levels(droplevels(df0$Country))
#df0$Country<-unlist(df0$Country)
head(df0)
nrow(df0)

This is how the data looks like: [![df0$Country][2]][2]

[![df0$Country][3]][3]

I thought that my problems were generated by the list structure that can be seen in the images. That's the reason you can see in my code that I tries using both df0$Country<-levels(droplevels(df0$Country)) and df0$Country<-unlist(df0$Country), but they did not work. [1]: Merging a Shapefile and a dataframe [2]: https://i.sstatic.net/cBva8.png [3]: https://i.sstatic.net/QYz2N.png

Upvotes: 0

Views: 83

Answers (2)

It turns out that the problem was indeed in df0. After carefully going trough it I realized there was a blank space after each country name for some reasons. So my code was saved by simply applying:

df0$Country<-trimws(df0$Country, "r")

Upvotes: 0

NovaEthos
NovaEthos

Reputation: 500

I recreated df0, ran the rest of your code, and it worked fine for me:

library(rnaturalearth)
library(rnaturalearthdata)
library(rgeos)
library(dplyr)

df0 <- data.frame(Country = c("Afghanistan", "Albania", "Algeria", "American Samoa",
                              "Andorra", "Angola", "Anguilla", "Antigua & Barbuda",
                              "Argentina", "Armenia", "Aruba"), 
                  PF = c(3,3,3,0,3,3,0,0,1,3,0), stringsAsFactors = FALSE)
world <- ne_countries(scale = "medium", returnclass = "sf")
world$Country<-noquote(world$name)

df_sum <- df0 %>% 
  filter(Country %in% world$Country) %>%
  group_by(Country) %>%
  summarise(PF= mean(PF))

world$PF<- df_sum$PF[match(world$Country, df_sum$Country)]
> world$PF
  [1]  0  3  3  0  3 NA  3 NA  1  3  0 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
 [35] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA  3 NA NA NA NA NA
 [69] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
[103] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
[137] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
[171] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
[205] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
[239] NA NA NA

> df_sum
# A tibble: 10 x 2
   Country           PF
   <chr>          <dbl>
 1 Afghanistan        3
 2 Albania            3
 3 Algeria            3
 4 American Samoa     0
 5 Andorra            3
 6 Angola             3
 7 Anguilla           0
 8 Argentina          1
 9 Armenia            3
10 Aruba              0

Since you said the df_sum contains zero observations after running the code, I wonder if it's a problem with df0. Try recreating df0 from scratch like I did, and if you get the same output, the problem is likely coming from how you're pulling df0.

Upvotes: 1

Related Questions