unpackaging services for research code developers


logos staff week

  • What is a library?
  • it’s a collection of routines that may be compiled and made ready to be used in programmes
  • What does a librarian do to help researchers to manage their source code ?
  • they build a collection of routines that may be standardized and made available to be used in research projects

Figure 1: Library services as a package

  • FOSS means Free Open Source Software, Open source means Open source code, but what source code is ?
  • function | algorithm > source code (human readable) > binaries (machine readable)

function or algorithm

“I would like a script that when I run it says Good morning, Hello, good evening or good night according to the hour of the day”

source code

import datetime

# Check the current hour
hour = datetime.datetime.now().hour

# Determine the greeting based on the hour
if 6 <= hour < 12:
    print("Good morning")
elif 12 <= hour < 18:
    print("Hello")
elif 18 <= hour < 22:
    print("Good evening")
else:
    print("Good night")

binaries



1. Opportunities

Figure 2

  • advocates of FOSS Software (👍 Koha Ex-libris ,👍 Zotero Endnote, 👍 LibreOffice Microsoft Office , etc)
  • GDPR is easier to enforce with auditable code
  • if code is auditable, development is open to contributions
  • Open Science works better with Open Source Software
  • researchers are not informaticiens, neither are we

researchers are not informaticians, neither are we

  • Researchers are more reluctant to share their code than their data
  • GenAI tools make source code easier to produce but more difficult to adapt and review
  • 😺 understanding allows adaptation and autonomy
  • 😾 from scratch generation with GenAI leads to dependency and complexity

Open Science works better with FOSS : A matter of Transparency

library(lubridate)
df$date <- ydm(df$date)
class(df$date)
  • 📓 Ziemann et al. (2023)
  • 📓 Hinsen & Rougier (2017)
Figure 3

Open Science works better with FOSS : a matter of accessibility

Figure 4

2. a more complex lifecycle

Figure 5

software life cycle is more complex than dataset’s

Going upstream from the Archival phase : Software Heritage

  • Software Heritage
  • link with HAL archive
Figure 6

Going upstream from the Archival phase : Software Heritage

Figure 7
swh:1:cnt:0c1741c1fb0150f111625d0227
7407f628c31bac;origin=https://github
.com/virtualagc/virtualagc;visit=swh
:1:snp:cdcd2bc43331a436e8c659ba93175e
f7d7eb339b;anchor=swh:1:rev:4e5d304eb
7cd5589b924ffb8b423b6f15511b35d;path=
/Luminary116/THE_LUNAR_LANDING.agc;
lines=244-260

goes direct to

and can be rended through a Bibtex citation

📓 Gruenpeter (2020)

Going upstream from the Archival phase : HAL

HAL plays a pivotal role in the findability of research outputs and artifacts?

Going upstream from the Archival phase : the forge

  • version control is required for open science
  • a forge is not an archive
  • Institutional gitlabs are usually closed by default to external collaboration
Figure 8

a forge is not an archive

repositories on forges can be damaged or deleted by their owners, including by accident, and the forges themselves may be closed down, as it happened to two popular forges that were shut down in 2015 (Gitorious) and 2016 (Google code), invalidating all the URLs that link to them.

📓 Rougier et al. (2025)

3. Documentation, tests and licences

Figure 9

mapping metadata files

Documentation is the glue that binds a data science project together

📓 Ziemann et al. (2023)

  • Software Management Plan (form) -> Data Management Plan
  • README file (instructions) -> Software Heritage
  • CODEMETA file (JSON) -> HAL / Software Heritage
  • Citation File Format (YAML) -> Software Heritage
  • Software Description Templates required by journals -> Reviewers

Reusability by researchers through appropriate licence

Figure 10

Reusability by researchers through appropriate licence

Dual licence :

  • GPL licence for individuals (always reusable for researchers)

  • GPL OR permissive licences sold to commercial entities

4. Replicability

Figure 11

Replicability challenges : definition

  • reproducibility = same software but other data
  • replicability = same software, same data

Replicability challenges : parameters

Figure 12
  • jargon : “Dust” means “Pour sugar”
  • ingredients (some won’t be available within a couple of years)
  • some programme available in the oven
  • specialized frying pans
  • power, oven’s volume and heat

replicability implies different actions at different levels

  • Dust -> “Pour sugar” -> Language evolution
  • Ingredients -> Software Dependencies
  • some programme available in the oven -> Operating systems
  • specialized frying pans -> System Dependencies
  • power, oven volume -> GPU, system firmware

software dependency levels : virtual environments

renv::init() # open a virtual environment, captured packages versions are saved in a .local folder whithin the project
renv::status() # make sure that a new package has not been loaded without having been captured and indexed in renv.lock file
renv::screenshot() # capture all packages loaded in R (not only those mentioned in the source code)
renv::restore() # restore all packages captured from other's person project in our own environment

📓 Package Renv (n.d.)

system dependency levels : containers

Figure 13

system dependency levels : containers

Figure 14

Managing system and software dependencies with a package manager dedicated to replicability

Findability : HAL
Accessibility : Software Heritage
Interoperability : virtual environments, containers, replicable packages managers
Reusability : Licence, Documentation, tests

Data Without Software Are Just Numbers

📓 Davenport et al. (2020)

figures

figure source et crédits
Figure 1 Library as a R package : made with ChatGPT with the following prompt : “fill an hexagone (the common shape for R packages logos) with a drawing of a uni library”
Figure 2  Foss_librarians, by Frederick Noronha
Figure 3 meme whose origins are lost in the roots of Digital times
Figure 4 meme designed with Framasoft
Figure 6 screenshot of the Lunar landing Guidance code archived on the Software Heritage website
Figure 7  Eagle Lunar Module, Wikipedia CC-by-sa
Figure 8 Github-like anvill : made with ChatGPT with the following promopt : “draw an anvill and fill it with the colored monthly commit diagramme that can be found on github repositories but change the color from green to blue”
Figure 12 Joyful baking moment, public domain
Figure 13 Docker Rationale
Figure 14 Dockerfile with Renv
Figure 10 Hotwax System
Figure 5 simplified pull request
Figure 9 Read the fucking Manual, RTFM, Michele M. F.
Figure 11 Now and Future Ruby’s code

software used for this presentation

R version 4.3.3 (2024-02-29)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 24.04.2 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.12.0 
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0

locale:
 [1] LC_CTYPE=fr_FR.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=fr_FR.UTF-8        LC_COLLATE=fr_FR.UTF-8    
 [5] LC_MONETARY=fr_FR.UTF-8    LC_MESSAGES=fr_FR.UTF-8   
 [7] LC_PAPER=fr_FR.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=fr_FR.UTF-8 LC_IDENTIFICATION=C       

time zone: Europe/Paris
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
 [1] compiler_4.3.3    fastmap_1.2.0     cli_3.6.3         tools_4.3.3      
 [5] htmltools_0.5.8.1 rstudioapi_0.17.1 yaml_2.3.10       rmarkdown_2.29   
 [9] knitr_1.49        jsonlite_1.8.9    xfun_0.50         digest_0.6.37    
[13] rlang_1.1.4       evaluate_1.0.1   

Références

Davenport, J. H., Grant, J., & Jones, C. M. (2020). Data Without Software Are Just Numbers. Data Science Journal, 19(1). https://doi.org/10.5334/dsj-2020-003
Gruenpeter, M. (2020). Intrinsic and extrinsic identifiers. In Software Heritage. https://www.softwareheritage.org/2020/07/09/intrinsic-vs-extrinsic-identifiers/
Hinsen, K., & Rougier, N. P. (2017, April). ReScience. Open Science, Transparence Et Évaluation. Perspectives Et Enjeux Pour Les Chercheurs. https://hal.science/hal-01573262
Package renv : Présentation et retour d’expérience. (n.d.). Retrieved May 13, 2025, from https://elisemaigne.pages.mia.inra.fr/2021_package_renv/presentation.html#41
Rougier, N., Di Cosme, R., Hinsen, C., Maurice, C., Le Berre, D., Monat, R., Louvet, V., Jullien, N., Granger, S., & Maumet, C. (2025). Code beyond FAIR. https://inria.hal.science/hal-04930405v1
Ziemann, M., Poulain, P., & Bora, A. (2023). The five pillars of computational reproducibility: Bioinformatics and beyond. Briefings in Bioinformatics, 24(6), bbad375. https://doi.org/10.1093/bib/bbad375