Effective expert selection for anonymous peer review is a critical step in the process of publishing research in referred scientific journals, which remains the most important platform for the dissemination of scientific knowledge. Demand for peer review is increasing as the number of researchers, journals and publications increases. In the field of biomedicine alone, the estimated number of titles currently published stands between 13,000 and 14,000, of which 5,300 are indexed in the MEDLINE database (Journals database at the NCBI, [1]).
Upon successful submission of a manuscript for publication in a journal, editors attempt to quickly identify suitable reviewers, sometimes with the assistance of authors who may have been prompted to provide several suggestions. This procedure demands a fair knowledge of the experts in the manuscript’s area of knowledge from both authors and editors. Inevitably, authors and editors will concentrate their requests on a small set of referees, usually senior authors that they know and trust. As a result, senior authors are overloaded with demands.
The bibliography offers a resource to find experts. However, the increase in the rate of production of new research makes it increasingly difficult to track all the publications coming out from even narrow fields of research and many authors that could potentially be good reviewers may not be requested. An approach to ease this problem is the development of computational methods to assist authors and editors in reviewer selection based on the literature. Such methods have the potential to facilitate the task and should produce less biased and more systematic expert selections than manual protocols.
Ideally, we would like the computer to point to potential reviewers for a given manuscript using just the manuscript content as input. One straightforward strategy in this direction is to search the database of (peer-reviewed) scientific literature for the most similar documents to the manuscript we want to review, and then suggest the authors of these documents as experts. A widely used measure of document similarity is the cosine between the abstracts of the documents encoded as vectors of the frequency of words they contain [2]. For example, BioMed Central editorial uses this approach by proposing to the associate editor that is handling the submission, the cosine values between the abstract of the submitted manuscript and abstracts referenced in MEDLINE.
There are two web server applications to find similar abstracts in MEDLINE. An early one is eTBlast [3], which uses the same principle with a more elaborated measure of text similarity that takes into account word frequencies and word order in the text. A more recent one is Jane [4], a straight implementation of Lucene's [5] MoreLikeThis algorithm, which does not take into account the words' order but their relevance according to their frequencies in a whole corpus.
Here we propose a more comprehensive approach to computational selection of peer reviewers, which relies on comparison of a word profile of the manuscript not to that of other single manuscripts but to the collection of manuscripts authored by each potential peer reviewer. This approach necessitates building word profiles for authors. However, the problem that considerably complicates the matter of identifying the manuscripts authored by one individual, especially in an automatic way, is that many authors share last names and initials with other authors. Particularities of names across countries further complicate the issue. For example, most Chinese last names are extremely common, with the eleven most common being shared by about 40% of the Chinese population, yet their wide variety of given names is lost by Western abbreviation practices [6]. Ambiguity due to given name abbreviation is a problem that affects other Asian scientists as well [7, 8] in a manner well beyond the matter discussed here [9].
Dealing with author name ambiguity remains a hard problem. For the biomedical community, an obvious and ideal solution would be to have each author assigned a unique identifier in MEDLINE upon their first publication. However, this solution has no trivial implementation, as it would require the combined effort from a coordinator organization, such as the NCBI, and the whole body of the scientific community. Even if implemented today, this would not resolve the name ambiguity of the large body of prior literature.
Meanwhile, the problem is only worsening due to the ever-increasing number of scientists. Computational efforts, mostly industry-led, are being made by implementing algorithms that, by parsing MEDLINE, would partially address this matter. Some initiatives combine registration of users with profile generation and their degree of integration with companies and accessibility to their methods and data are very heterogeneous. ResearcherID from Thomson Reuters (http://www.researcherid.com/) is a company resource linked to other Thomson databases such as ISI Web of Knowledge. BioMed Experts from Collexis (http://www.biomedexperts.com/) contains a collection of automatically generated profiles for authors in MEDLINE based, according to their web site, on the concepts associated to the identifiers and on co-authors. Users can register and modify the profiles. Author-ity is an academic effort to generate author profiles [10] and offers a database of disambiguated author names in MEDLINE for download and a web interface to query it (http://arrowsmith.psych.uic.edu/arrowsmith_uic/author2.html).
Among the strategies to disambiguate authors that share the same name are the use of keywords that identify a particular subject of research, collaborators co-signing publications with the authors (networks of collaborators), physical location extracted from the affiliation data usually complemented with the years of publication, journal subject class (e.g. journals in the area of cardiology), and even co-citations in web pages [11].
Here, we have chosen to implement an approach based on co-authorship because it is straightforward, and in principle can be easily applied in an unbiased manner to every single name. Accordingly, we have attempted to disambiguate every author name in MEDLINE by co-authors and assigned different identifiers to each disambiguated instance. Next, for each identifier we derived profiles of keywords extracted from the abstracts of the references in MEDLINE associated to it. Given a manuscript, our method uses these profiles to suggest peer-reviewers based on the similarity to the keyword profile deduced from the manuscript.