A shopping spree for data
My French colleagues Antoine Blanchard and Diane Thierry (from Datactivist) and me had the opportunity to do a study on APC costs for the French Ministry of Higher Education and Research (MESR).
Our proposal – to build a dataset with articles by French authors from the ground up and use that for a calculation of the APC costs – was accepted. We had a plan to do this, but – as is the case with many plans – we had to adjust it many times to circumvent unexpected obstacles.
It became a shopping spree for data. We started with data from the French Open Science Barometer (BSO). They delivered data of all journal articles with a French author for the year 2013-2020. This we put in our shopping cart (my colleague Diane did all the shopping cart work with R) and started shopping around for more relevant data.
Who are the corresponding authors? We shopped these data from the Web of Science. Which articles were Open Access published in Gold or Hybrid journals? We shopped these data from OpenAlex. Couperin gave us a dataset with journal titles they had contracts with and with DOAJ and QOAM data we identified hybrid, Gold and Diamond journal titles. BSO already had made the connection with OPEN APC data for the prices. I have to admit that sometimes our shopping efforts failed: we had originally misunderstood the exact data set-up of some datasets so that incorporation was not possible.
However, in the end we were able to reach our goal: a dataset with trustworthy data of articles by French authors from 2012 to 2020. This made a retrospective analysis possible and based upon that Antoine and Diane built a model with R for a prospective analysis.
Our whole data shopping spree and the analyses made it a super interesting study, while – I think – the results have interesting implications for the future. I show below the main results.
You see in the picture that – if all trends as seen in 2012 to 2020 continue unchanged – the French institutions will pay around €50 million in 2030 on APCs in the wild (the red line). We also made – with the help of Couperin – an estimate of the subscription expenditures in that year (the grey line). Finally we calculated the cost in a fully Open Access world, which we defined as 90% APC-paid OA and 10% Diamond OA.
Personally I found it very interesting that the hybrid situation (subscriptions and APCs in the wild) would cost around 150 million euro in 2030, while the full OA situation would take around 170 million of APC payments. Not much of a difference if you take into account all the margins of error of these calculations.
Let me finish with a warning: predicting the future is always a tricky business. However, with using a lot of data and some modelling, we noticed at presentations that some people think these predictions are absolutely sure to happen. But in reality, these predictions are based on the 2013-2020 data and are nothing more (or less) that extending the trend-lines of that period, albeit with a very complicated statistical model.