djMohit
djMohit

Reputation: 171

How to write function to calculate H index in R?

I am new to R and looking for calculating h index.

H index is the popular measure to quantify scientific productivity. Formally, if f is the function that corresponds to the number of citations for each publication, we compute the h index as follows:

First we order the values of f from the largest to the lowest value. Then, we look for the last position in which f is greater than or equal to the position (we call h this position).

For example, if we have a researcher with 5 publications A, B, C, D, and E with 10, 8, 5, 4, and 3 citations, respectively, the h index is equal to 4 because the 4th publication has 4 citations and the 5th has only 3. In contrast, if the same publications have 25, 8, 5, 3, and 3 citations, then the index is 3 because the fourth paper has only 3 citations.

Can anyone suggest smarter way to solve this

a <- c(10,8,5,4,3)

I expect the output of h index value as 4.

Upvotes: 3

Views: 1969

Answers (4)

user8617947
user8617947

Reputation: 164

The most efficient way to perform this calculation is with the eddington package's E_num function. The Hirsch index for authors is essentially identical to a metric used in cycling called the Eddington number. The eddington package uses an efficient algorithm implemented in Rcpp and does not require the data to be pre-sorted.

install.packages("eddington")

citation_counts <- c(10,8,5,4,3)

eddington::E_num(citation_counts)
## [1] 4

As an aside, if one prefers to use base R, the answer provided by @oelshie can be further simplified into a one-liner.

get_h_index <- function(citations) {
  sum(sort(citations, decreasing = TRUE) >= seq_along(citations))
}

get_h_index(citation_counts)
## [1] 4

Abstracting the computation into a function also simplifies the dplyr code provided by @aterhorst. Note the function also eliminates the need for the arrange step since sorting is performed within the function call.

tibble::tibble(citation_counts) |> 
  dplyr::summarize(h_index = get_h_index(citation_counts))
## # A tibble: 1 × 1
##   h_index
##     <int>
## 1       4

Please note that the base R solution is slower than the eddington package by a factor of about 14. If the workflow requires computation of a large number of Hirsch indices, I highly recommend installing and using the eddington package.

Upvotes: 1

aterhorst
aterhorst

Reputation: 685

A dplyr version if citation data is in a dataframe (thanks to https://stackoverflow.com/users/5313511/oelshie):

a <- data.frame(cites = c(10,8,5,4,3))

b <- a %>% 
  arrange(desc(cites)) %>%
  summarise(h_index = sum(cites >= seq_along(cites)))
b

  h_index
1       4

Upvotes: 0

theRman
theRman

Reputation: 43

I propose a shorter + more flexible function that takes whatever numeric vector of citations you include (sorted or unsorted, with or without zeros, only zeros, etc.)

hindex <- function(x) {
    tx <- sort(x, decreasing = T)
    print(sum(tx >= seq_along(tx)))
}

Upvotes: 3

Gregor Thomas
Gregor Thomas

Reputation: 145785

Assuming the input is already sorted, I would use this:

tail(which(a >= seq_along(a)), 1)
# [1] 4

You could, of course, put this in a little function:

h_index = function(cites) {
  if(max(cites) == 0) return(0) # assuming this is reasonable
  cites = cites[order(cites, decreasing = TRUE)]
  tail(which(cites >= seq_along(cites)), 1)
}

a1 = c(10,8, 5, 4, 3)
a2 = c(10, 9, 7, 1, 1)

h_index(a1)
# [1] 4

h_index(a2)
# [1] 3

h_index(1)
# [1] 1

## set this to be 0, not sure if that's what you want
h_index(0)
# [1] 0

Upvotes: 7

Related Questions