Leetcode H-Index

 

Problem Description

Given an array of citations (each citation is a non-negative integer) of a researcher, write a function to compute the researcher’s h-index.
According to the definition of h-index on Wikipedia: “A scientist has index h if h of his/her N papers have at least h citations each, and the other N − h papers have no more than h citations each.”
For example, given citations = [3, 0, 6, 1, 5], which means the researcher has 5 papers in total and each of them had received 3, 0, 6, 1, 5 citations respectively. Since the researcher has 3 papers with at least 3 citations each and the remaining two with no more than 3 citations each, his h-index is 3.
Note: If there are several possible values for h, the maximum one is taken as the h-index.

 

There are many blogs to describe this problem. I found the following two articles are easiest to understand. Thanks for the author.

The following articles are from:

http://traceformula.blogspot.com/2015/09/h-index-leetcode.html

http://traceformula.blogspot.com/2015/09/h-index-ii-leetcode.html

Solution

As we know, H-Index is a measure of productivity and citation impact of a researcher. It is also called Hirsch index or Hirsch number after the name of the physicist Jorge Hirsch.

Picture 1: H-Index for decreasing citations array (Credit: Wiki)
 
Supposed that citations is an array of numbers of citations of a researcher. We can easily see that the following formula holds if citations is sorted in decreasing order:
 
h-index of citations = max (min (citations[i], i) for all i from 1 to number of papers)

Therefore, we can easily implement the calculation of h-index by sorting the array in decreasing order then applying the above formula. Below is the 2-line Python code in O(nlogn) time.

By sorting, it takes O(nlogn) time complexity.

However, we can do much better by an O(n) time algorithm. Pause yourself for a few minutes to think about it before continue reading ^_^.

We have some simple observation here, but it really helps to improve the performance. For each number h from 0 to n where n is the number of papers, we need to find out how many citations of a paper equal h called equal_h. Based on this, we can find the number of citations values that is at least h, and no more than h. To find the number of citations value that is at least h, we take sum of equal_h[i] for i from h to n!!! And similarly, to find the number of citations values that is no more than h citations each, we can sum up equal_h[i] for ifrom 0 to i. And we can find the h-index by simple running from the back of the array equal_h.
So if running from the back of the array equal_h , and h is the first index that satisfies the following conditions:
equal_h[h] + equal_h[h+1] + … + equal_h[n] >= h (*)
equal_h[0] + equal_h[1] + … + equal_h[h] >= N-h (**)
Then h is what we are looking for.
However, we have:
equal_h[0] + equal_h[1] + … + equal_h[n] = n
Therefore:
equal_h[0] + equal_h[1] + … + equal_h[h] = n- (equal_h[h+1] + … + equal_h[n])
Another note is that since h is the first element satisfies the 2 above conditions, then h+1 does not satisfies one of them, meaning that either
(1): { equal_h[h+1] + equal_h[h+2] + … + equal_h[n] <= h
or (2): { equal_h[h+1] + equal_h[h+2] + … + equal_h[n] >= h+1
and equal_h[0] + equal_h[1] + … + equal_h[h+1] < N-(h+1) }
(1) suggests that:
equal_h[0] + equal_h[1] + … + equal_h[h] >=N-h which is (**)
(2) suggests that:
equal_h[h+2] + equal_h[h+3] + … + equal_h[n] >= h+2
This inequality will be repeated until equal_h[n] >= n , which is wrong.

So all we need is to find the first h satisfies the condition (*), and we do not need to check the condition (**).
Below are the codes implement in different languages.

I. Python code – O(n) 

II. Java Code – O(n)
 

H-Index II [LeetCode]

 
Problem Description
Follow up for H-Index: What if the citations array is sorted in ascending order? Could you optimize your algorithm?
Solution
In the previous version, we sorted citations in reverse order. In this version of problem, the citations array is in ascending order.
I. O(n) Solution
As we already know in the previous version of finding H-Index, we can solve the problem in linear time. Below is the 1-line Python solution:
 
 
II. O(logn) Solution
Since we already knew about how to solve the problem in linear time, we try to find another solution. As usual, the idea is to find O(logn) solution by binary search.
Our normal logic when using binary search is that we have a start andend point. We will check the middle= (start + end) /2. (In some language, to avoid overflow we can use (end-start)/2 + start )
What we need to find out now is when we should stop the search, or going to left or right.

1. When we should stop?
Suppose that middle is not in final stage (where start > end) and we decide to stop. There are (n-middle) elements are at least citations[middle], and other (middle ) elements are at most citations[middle]. This suggests that ifcitations[middle] = (n-middle) = h (where n is the number of elements incitations),  then we have at least h elements are at least h, and other middle = n-h elements are at most h. So h is an H-index.
Is it h  the largest? Suppose there is another h” > h and h” is also an H-index. Consider the element at n-h”. Since there are at least h” elements which are at least h”, then citations[n-h”] >= h”>h. However, we have n-h” < n-h, thereforecitations[n-h”] < citations[n-h] = h, or we have h > h. This is contradictory. So his the largest H-index. This means that we should stop the search here!

2. When we should go to the left and right
The analysis in (1) suggest that we should compare citations[middle] and (n-middle) in order to decide whether we should go to left or right in binary search.

a. If citations[middle] < n-middle
Is it possible to stop or go left in this case? Suppose it is. That means there is a number h >= n-middle, and h is an H-index. Then there are h elements which are at least h. Therefore:
citations[n-h] > h >= n-middle > citations[middle]
However,
h >= n-middle  => middle >= n-h
That means citations[n-h] <= citations[middle]. This is contradictory.
So we should go to the right in this case.

b. If citations[middle] > n-middle
This is not same as the case (a). Easily we can see that h = n-middle is an H-index. But there might be other elements on the left satisfy the H-index condition. So we simply go to the left to search more.
Note that the current position is end + 1 after we set the end pointer to middle-1. And also, while searching on the left, if we go to the left another time, then current h= n-middle is not the largest H-index. But if we never go to the left again, then the pointer end keeps unchanged, so we can simply return n-end-1when the search is terminated.
We come to coding now.

I. Python Code

II. Java Code