Reputation: 171
I am new to R and looking for calculating h index.
H index is the popular measure to quantify scientific productivity.
Formally, if f
is the function that corresponds to the number of citations for each publication, we compute the h index as follows:
First we order the values of f
from the largest to the lowest value. Then, we look for the last position in which f
is greater than or equal to the position (we call h this position).
For example, if we have a researcher with 5 publications A, B, C, D, and E with 10, 8, 5, 4, and 3 citations, respectively, the h index is equal to 4 because the 4th publication has 4 citations and the 5th has only 3. In contrast, if the same publications have 25, 8, 5, 3, and 3 citations, then the index is 3 because the fourth paper has only 3 citations.
Can anyone suggest smarter way to solve this
a <- c(10,8,5,4,3)
I expect the output of h index value as 4.
Upvotes: 3
Views: 1969
Reputation: 164
The most efficient way to perform this calculation is with the eddington
package's E_num
function. The Hirsch index for authors is essentially identical to a metric used in cycling called the Eddington number. The eddington
package uses an efficient algorithm implemented in Rcpp and does not require the data to be pre-sorted.
install.packages("eddington")
citation_counts <- c(10,8,5,4,3)
eddington::E_num(citation_counts)
## [1] 4
As an aside, if one prefers to use base R, the answer provided by @oelshie can be further simplified into a one-liner.
get_h_index <- function(citations) {
sum(sort(citations, decreasing = TRUE) >= seq_along(citations))
}
get_h_index(citation_counts)
## [1] 4
Abstracting the computation into a function also simplifies the dplyr
code provided by @aterhorst. Note the function also eliminates the need for the arrange
step since sorting is performed within the function call.
tibble::tibble(citation_counts) |>
dplyr::summarize(h_index = get_h_index(citation_counts))
## # A tibble: 1 × 1
## h_index
## <int>
## 1 4
Please note that the base R solution is slower than the eddington
package by a factor of about 14. If the workflow requires computation of a large number of Hirsch indices, I highly recommend installing and using the eddington
package.
Upvotes: 1
Reputation: 685
A dplyr version if citation data is in a dataframe (thanks to https://stackoverflow.com/users/5313511/oelshie):
a <- data.frame(cites = c(10,8,5,4,3))
b <- a %>%
arrange(desc(cites)) %>%
summarise(h_index = sum(cites >= seq_along(cites)))
b
h_index
1 4
Upvotes: 0
Reputation: 43
I propose a shorter + more flexible function that takes whatever numeric vector of citations you include (sorted or unsorted, with or without zeros, only zeros, etc.)
hindex <- function(x) {
tx <- sort(x, decreasing = T)
print(sum(tx >= seq_along(tx)))
}
Upvotes: 3
Reputation: 145785
Assuming the input is already sorted, I would use this:
tail(which(a >= seq_along(a)), 1)
# [1] 4
You could, of course, put this in a little function:
h_index = function(cites) {
if(max(cites) == 0) return(0) # assuming this is reasonable
cites = cites[order(cites, decreasing = TRUE)]
tail(which(cites >= seq_along(cites)), 1)
}
a1 = c(10,8, 5, 4, 3)
a2 = c(10, 9, 7, 1, 1)
h_index(a1)
# [1] 4
h_index(a2)
# [1] 3
h_index(1)
# [1] 1
## set this to be 0, not sure if that's what you want
h_index(0)
# [1] 0
Upvotes: 7