Reputation: 553
Suppose I have a dataframe such as this:
df <- data.frame(First_Name=c("John","Alex","James","John","Alex","John","James"),
Last_Name = c("Smith","Smith","Jones","Jones","Johnson","Ryan","Murphy"),
Frequency = c(4,7,8,9,3,10,4))
First_Name Last_Name Frequency
1 John Smith 4
2 Alex Smith 7
3 James Jones 5
4 John Jones 9
5 Alex Johnson 3
6 John Ryan 10
7 James Murphy 4
and I would like to collapse the first column (First Name) into unique values only and sort the name based on their cumulative frequency. For this example, if we do this and sort in descending order, the order would be : John, James, Alex since their respective cumulative frequencies are 23, 10, and 12.
Coming from a python background, I was thinking of storing the values in a tuple, but I'm not sure if there's an equivalent in R (at the moment, I'm only familiar with vectors).
Additionally, if I try to pull out only the unique values from the "First_Name" column and store it in a vector:
unique(c(df$First_Name))
R outputs this:
[1] 3 1 2
which seems to indicate they converted the names into integers.
So my questions are:
1) How would I go about getting only the unique first names and sorting them based on their cumulative values?
2) Why does R convert the names into integers if I try to store them in a vector?
Upvotes: 1
Views: 80
Reputation: 17648
You can try a tidyverse
solution
library(tidyverse)
df %>%
group_by(First_Name) %>%
summarise(Sum=sum(Frequency))
# A tibble: 3 x 2
First_Name Sum
<fct> <dbl>
1 Alex 10
2 James 12
3 John 23
Upvotes: 0
Reputation: 933
R changes characters to factors by default, which can be irritating. You can check this by the command:
str(df)
Which shows
'data.frame': 7 obs. of 3 variables:
$ First_Name: Factor w/ 3 levels "Alex","James",..: 3 1 2 3 1 3 2
$ Last_Name : Factor w/ 5 levels "Johnson","Jones",..: 5 5 2 2 1 4 3
$ Frequency : num 4 7 8 9 3 10 4
If you add the stringsAsFactors flag, you'll avoid this:
df <- data.frame(First_Name=c("John","Alex","James","John","Alex","John","James"),
Last_Name = c("Smith","Smith","Jones","Jones","Johnson","Ryan","Murphy"),
Frequency = c(4,7,8,9,3,10,4), stringsAsFactors = FALSE)
str(df)
'data.frame': 7 obs. of 3 variables:
$ First_Name: chr "John" "Alex" "James" "John" ...
$ Last_Name : chr "Smith" "Smith" "Jones" "Jones" ...
$ Frequency : num 4 7 8 9 3 10 4
But either way, you can group by unique and take any function of that subset using aggregate:
aggregate(Frequency ~ First_Name, data=df, FUN="sum")
First_Name Frequency
1 Alex 10
2 James 12
3 John 23
The table can be reordered using the order function
out <- out[rev(order(out$Frequency)),]
First_Name Frequency
3 John 23
2 James 12
1 Alex 10
If you want to reorder the first table by the cumulative summary, keeping it as it stands, you'll then need to merge.
out$rank <- 1:nrow(out)
new <- merge(df, out, by.x='First_Name', by.y='First_Name')
final <- new[order(new$rank),]
First_Name Last_Name Frequency.x Frequency.y rank
1 Alex Smith 7 10 1
2 Alex Johnson 3 10 1
3 James Jones 8 12 2
4 James Murphy 4 12 2
5 John Smith 4 23 3
6 John Jones 9 23 3
7 John Ryan 10 23 3
Then just drop the columns you don't need.
Upvotes: 1