R: Sorting Columns in Dataframe based on Associated values?

Question

Suppose I have a dataframe such as this:

df <- data.frame(First_Name=c("John","Alex","James","John","Alex","John","James"),
             Last_Name = c("Smith","Smith","Jones","Jones","Johnson","Ryan","Murphy"),
             Frequency = c(4,7,8,9,3,10,4))

  First_Name Last_Name Frequency
1       John     Smith         4
2       Alex     Smith         7
3      James     Jones         5
4       John     Jones         9
5       Alex   Johnson         3
6       John      Ryan        10
7      James    Murphy         4

and I would like to collapse the first column (First Name) into unique values only and sort the name based on their cumulative frequency. For this example, if we do this and sort in descending order, the order would be : John, James, Alex since their respective cumulative frequencies are 23, 10, and 12.

Coming from a python background, I was thinking of storing the values in a tuple, but I'm not sure if there's an equivalent in R (at the moment, I'm only familiar with vectors).

Additionally, if I try to pull out only the unique values from the "First_Name" column and store it in a vector:

unique(c(df$First_Name))

R outputs this:

[1] 3 1 2

which seems to indicate they converted the names into integers.

So my questions are:

1) How would I go about getting only the unique first names and sorting them based on their cumulative values?

2) Why does R convert the names into integers if I try to store them in a vector?

lilster · Accepted Answer

R changes characters to factors by default, which can be irritating. You can check this by the command:

str(df)

Which shows

'data.frame':   7 obs. of  3 variables:
 $ First_Name: Factor w/ 3 levels "Alex","James",..: 3 1 2 3 1 3 2
 $ Last_Name : Factor w/ 5 levels "Johnson","Jones",..: 5 5 2 2 1 4 3
 $ Frequency : num  4 7 8 9 3 10 4

If you add the stringsAsFactors flag, you'll avoid this:

df <- data.frame(First_Name=c("John","Alex","James","John","Alex","John","James"),
             Last_Name = c("Smith","Smith","Jones","Jones","Johnson","Ryan","Murphy"),
             Frequency = c(4,7,8,9,3,10,4), stringsAsFactors = FALSE)
str(df)
'data.frame':   7 obs. of  3 variables:
 $ First_Name: chr  "John" "Alex" "James" "John" ...
 $ Last_Name : chr  "Smith" "Smith" "Jones" "Jones" ...
 $ Frequency : num  4 7 8 9 3 10 4

But either way, you can group by unique and take any function of that subset using aggregate:

aggregate(Frequency ~ First_Name, data=df, FUN="sum")
  First_Name Frequency
1       Alex        10
2      James        12
3       John        23

The table can be reordered using the order function

out <- out[rev(order(out$Frequency)),]
  First_Name Frequency
3       John        23
2      James        12
1       Alex        10

If you want to reorder the first table by the cumulative summary, keeping it as it stands, you'll then need to merge.

out$rank <- 1:nrow(out)
new <- merge(df, out, by.x='First_Name', by.y='First_Name')
final <- new[order(new$rank),]

  First_Name Last_Name Frequency.x Frequency.y rank
1       Alex     Smith           7          10    1
2       Alex   Johnson           3          10    1
3      James     Jones           8          12    2
4      James    Murphy           4          12    2
5       John     Smith           4          23    3
6       John     Jones           9          23    3
7       John      Ryan          10          23    3

Then just drop the columns you don't need.

R: Sorting Columns in Dataframe based on Associated values?

Answers (2)

Related Questions