As a first step in this project, the team developed a prototype for mapping scholarly content against five of the UN’s SDGs, with Digital Science chosen as a technology partner. Keyword search strings for five SDGs were defined with input from subject matter experts to produce training sets based on publications from the Dimensions platform. The training sets were then used to apply natural language processing and supervised machine learning, resulting in a classification scheme based on five SDGs: SDG 3: Good Health and Well-being; SDG 4: Quality Education; SDG 7: Affordable and Clean Energy; SDG 11: Sustainable Cities and Communities; and SDG 16: Peace, Justice, and Strong Institutions. Initial results from this work was released in December 2019.
In early 2020, Digital Science applied the resulting method and algorithm to the outstanding 12 goals, releasing results for all 17 goals in April 2020, and making these freely and permanently accessible via Dimensions.
As a result of this mapping, it was possible to undertake a closer look at this complete corpus of SDG content. The mapping of research content to the SDGs has been the subject of several projects, including the SDG bibliometrics analysis of the Aurora Universities Network or the SDG dashboard of the VU Amsterdam. The results of these projects depend heavily on the methodology chosen as well as the interpretation and translation of the SDGs into relevancy mappings. In general, three different methodologies can be applied: i) content that explicitly mentions the SDGs, ii) a set of keywords that try to ‘translate’ the SDG targets into search strings (the method currently used by the majority of other SDG relevancy mapping projects), and iii) a supervised machine learning algorithm which again is based on keyword search strings (the method chosen for this project). In a recent blog post, Ismael Rafols from the Centre for Science and Technology Studies (CWTS) in Leiden highlighted the differences in approaches and concluded that “indicators on the contributions of science to the SDGs are not (yet) robust”. Researchers from the University of Bergen came to a similar conclusion when they compared the results of keyword search strings that they developed with the SDG classifier from Elsevier used in their SciVal product (also based on keyword search strings). They found little overlap between the two result sets, although they used the same ‘keyword search string’ methodology, and concluded that “currently available SDG rankings and tools should be used with caution at their current stage of development.” Despite these known limitations, we believe that this corpus represents a meaningful subset of the overall scientific literature to further investigate the main research questions of this project.
Our bibliometric analysis investigates whether we can see any signs that OA is particularly beneficial for user groups outside of the core academic readership. In Springer Nature’s previously mentioned Hybrid OA white paper, the effects of OA on Hybrid journals were examined by comparing OA and Subscription documents in terms of usage (downloads), research impact (citations), and broader impact (Altmetric attention). In a similar approach this study also investigates these same usage and attention measures. Since the OA status is not the only factor that influences these metrics, we created a model that aims to correct the influence of variables at the document level (SDG, subject field, publication type and whether the document acknowledged any external funding), at the author level (institutional reputation, based on the proxy of the Times Higher Education World University Rankings, and country) and the journal level (Journal Impact Factor, as a proxy for perceived journal prestige). We used negative binomial regression models for all the analyses.
SDG-related content from the period 2010 to 2019 was downloaded from the Dimensions interface in May 2020 and further enriched with usage data from Springer Nature and data from Altmetric. In addition, only content containing information in all necessary metadata fields and defined as one of three publication types – article, proceeding or chapter – were considered.
Since all metrics build over time, the analysis focuses on publications from a single year in order to guarantee a ‘like-for-like’ comparison. The publication year 2017 was chosen (as an average time-frame of three years since the metrics were pulled in mid-2020), which seems to be a good compromise in terms of recency on the one hand and sufficient time-frame to build the various metrics on the other. In all, there are 36,823 Springer Nature documents included in the downloads analysis (where COUNTER usage data was available for Springer Nature records), and a larger sample of 358,293 documents for citations and Altmetric attention across all publishers available via the Dimensions database.
In addition, we looked at this research question from a global perspective, but also limited the analysis to all documents that have at least one author affiliated with a Dutch institution.
Our user survey explores who is reading content, looking in particular at non-academic users. Who are they, how many are there, for what purposes do they use research content, and how do they differ from the core academic user base?
During May, June and July 2020, Springer Nature ran a survey of readers of its online documents hosted on nature.com, link.springer.com and biomedcentral.com. Visitors to a document page who remained for more than 30 seconds were shown a slide-out banner which invited them to take part in an online survey which asked them about themselves and their use of the content. The survey was hosted online on Qualtrics. In total, 5,994 people answered the survey.
It is possible that this method may have resulted in selection bias. Firstly, the 30-second delay before showing the survey invitation was intended to exclude users who had only a passing interest in the document concerned, but as a result may have skewed the sample slightly towards more engaged users. Secondly, users interested or particularly engaged in open research might have been more attracted by the survey announcement. And thirdly, there might be also a general selection bias of people who are willing to take part in surveys.
A copy of the questionnaire has been made available along with the anonymised raw data in Zenodo.
We segmented respondents into three groups, based on the type of organisation they worked in, their stated role and job title, as well as the degree to which primary research is a major driver for their work. The three groups were named “Core”, “Halo” and “General” users. Core users were defined as those conducting and publishing research, primarily from academia. The non-core groups were then split depending on whether they were in a role where research is likely to play a major influence or motivation (Halo) or not (General). We then looked at the different proportions of usage by segment between OA and Subscription and SDG-tagged and non-SDG tagged content, as well as exploring variance in how content is used.
Figure 1. Slide out banner used to call for participation