10  Basic reproducibility: freezing packages

We are at a stage where the analysis is done. Converting our scripts into Rmds was quite easy to justify because writing the Rmds is also writing the report that we need to send to our boss (or our research paper, etc). But it might be harder to justify writing further documentation or package the functions we’ve had to write for reuse later and otherwise ensure that the analysis is and stays reproducible. So we are going to start with the simplest and most cost-effective solution to the reproducibility issue, which is recording the versions of the packages that were used. This is quite easy and quick to do and provides at least some hope that the analysis will stay reproducible. But this will not do anything to make the analysis more easily re-usable, will not improve the documentation, nor ensure that what we wrote is indeed correct. For this, we would need to write tests, which are missing from our current analysis. We only wrote one test, which made sure that all the communes were accounted for. This is why going with a package is so useful (and I need to stress here that the aim of writing a package is NOT to upload it to CRAN): packages offer us a great framework for documenting, testing and sharing our code (even if only sharing internally in your company/team, or even just future you). So in order to take advantage of what packaging our code has to offer, we will learn about the {fusen} package in the next chapter. {fusen} will enable us to convert our Rmd files into a package quite quickly; your analysis is much closer to being a package than you think. As you shall see in the next chapter, going from our Rmds to a fully functioning package is much easier than you expect, even if you’ve never written a package in your life.

So I hope that I made my point clear: it is not recommended to stop at this stage, but I also recognize that we live in the real world with real physical constraints. So because we live in this imperfect world, sometimes we need to deliver imperfect work. So let’s see what we can do that is very cheap in terms of effort and time, but that still allows us to have some hope of having our analysis reproducible, all thanks to the {renv} package. It is quite easy to get started with {renv}: simply install it, and get a record of the used packages and their versions with one single command. This record gets saved inside a file that can then be used to restore this project’s library in the future, and without interfering with the other packages that you already have installed on your machine. You see, {renv} creates a per-project library (a library is the set of R packages installed on your machine) which means that you can have as many versions of {dplyr} as needed (one per project). The right version of {dplyr} will be installed and used for the right project only, and without interfering with other installed versions.

Let’s see how this works by creating such a project-specific library for our little project.

10.1 Recording packages’ version with {renv}

So now that you’ve used functional and literate programming, we need to start thinking about the infrastructure surrounding our code. By infrastructure I mean:

  • the R version;
  • the packages used for the analysis;
  • and otherwise the whole computational environment, even the computer hardware itself.

{renv} enables you to create Reproducible Environments and is a package that takes care of point number 2: it allows you to easily record the packages that were used for a specific project. This record is a file called renv.lock which will appear at the root of your project once you’ve set up {renv} and run it. You can use {renv} once you’re done with an analysis like in our case, or better yet, immediately at the start of the project. You can keep updating the renv.lock file as you add or remove packages from your analysis. The renv.lock file can then be used to restore the exact same package library that was used for your analysis on another computer, or on the same computer but in the future.

This works because {renv} does more than simply create a list of the used packages and recording their versions inside the renv.lock file: it actually creates a per-project library that is completely isolated from the main, default, R library on your machine, but also from the other {renv} libraries that you might have set up for your other projects (remember, the library is the set of R packages installed on your computer). To save time when setting up an {renv} library, packages simply get copied over from your main library instead of being re-downloaded and re-installed (if the required packages are already installed in your default library).

To get started, install the {renv} package (make sure to start a fresh R session):

install.packages("renv")

and then go to the folder containing the Rmds we wrote together in the previous chapter. Make sure that you have the two following files in that folder:

  • save_data.Rmd, the script that downloads and prepares the data;
  • analyse_data.Rmd, the script that analyses the data.

Also, make sure that the changes are correctly backed up on Github.com, so if you haven’t already, commit and push any change to the rmd branch. Because we will be experimenting with a new feature, create a new branch called renv. You should know the drill by now, but if not simply follow along:

owner@localhost ➤ git checkout -b renv
Switched to a new branch 'renv'

We will now be working on this branch. Simply work as usual, but when pushing, make sure to push to the renv branch:

owner@localhost ➤ git add .
owner@localhost ➤ git commit -m "some changes"
owner@localhost ➤ git push origin renv

Once this is done, start an R session, and simply type the following in a console:

renv::init()

You should see the following:

* Initializing project ...
* Discovering package dependencies ... Done!
* Copying packages into the cache ... [76/76] Done!
The following package(s) will be updated in the lockfile:

# CRAN ===============================
***and then a long list of packages***

The version of R recorded in the lockfile will be updated:
- R              [*] -> [4.2.2]

* Lockfile written to 'path/to/housing/renv.lock'.
* Project 'path/to/housing' loaded. [renv 0.16.0]
* renv activated -- please restart the R session.

Let’s take a look at the files that were created (if you prefer using your file browser, feel free to do so, but I prefer the command line):

owner@localhost ➤ ls -la
total 1070
drwxr-xr-x 1 owner Domain Users     0 Feb 27 12:44 .
drwxr-xr-x 1 owner Domain Users     0 Feb 27 12:35 ..
-rw-r--r-- 1 owner Domain Users    27 Feb 27 12:44 .Rprofile
drwxr-xr-x 1 owner Domain Users     0 Feb 27 12:40 .git
-rw-r--r-- 1 owner Domain Users   306 Feb 27 12:35 README.md
-rw-r--r-- 1 owner Domain Users  2398 Feb 27 12:38 analyse_data.Rmd
drwxr-xr-x 1 owner Domain Users     0 Feb 27 12:44 renv
-rw-r--r-- 1 owner Domain Users 20502 Feb 27 12:44 renv.lock
-rw-r--r-- 1 owner Domain Users  6378 Feb 27 12:38 save_data.Rmd

As you can see, there are two new files and one folder. The new files are the renv.lock file that I mentioned before and a file called .Rprofile. The new folder is simply called renv. The renv.lock is the file that lists all the packages used for the analysis. .Rprofile files are files that get read by R automatically at startup (as discussed at the very beginning of part one of this book). You should have a system-wide one that gets read on startups of R, but if R discovers an .Rprofile file in the directory it starts on, then that file gets read instead. Let’s see the contents of this file (you can open this file in any text editor, like Notepad on Windows, but then again I prefer the command line):

owner@localhost ➤ cat .Rprofile

The files contains the single line:

source("renv/activate.R")

activate.R is an R script that was also create by renv::init(), which you can find in the renv folder. Let’s take a look at the contents of this folder:

owner@localhost ➤ ls -la renv
total 107
drwxr-xr-x 1 owner Domain Users     0 Feb 27 12:44 .
drwxr-xr-x 1 owner Domain Users     0 Feb 27 12:35 ..
-rw-r--r-- 1 owner Domain Users    27 Feb 27 12:44 activate.R
drwxr-xr-x 1 owner Domain Users     0 Feb 27 12:40 .gitignore
drwxr-xr-x 1 owner Domain Users     0 Feb 27 12:40 library
-rw-r--r-- 1 owner Domain Users  6378 Feb 27 12:38 settings.dcf

So inside the renv folder, there is another folder called library: this is the folder that contains our isolated library for just this project. That’s something that we would not want to back up on Github as it grows quite large. To avoid tracking this folder and backing it up on Github.com a file called .gitignore (notice the . at the start of the name) gets used. This is a file that contains the paths to other files and folders that should be ignored. If you open it, you will see that the library/ folder is listed there so it will be ignored. You can have as many .gitignore files as necessary, but if you put one in the root folder of your project, then that .gitignore will work project-wide.

For example, if you are working with sensitive data, you could also add a .gitignore file in the root of the project’s directory, and simply list the folder containing the sensitive data. Create this file using your favourite text editor and simply add, for example if you’re working with sensitive data, the following:

datasets/

This will prevent the datasets/ folder from being tracked and backed up.

Let’s restart a fresh R session in our project’s directory; you should see the following startup message:

* Project 'path/to/housing' loaded. [renv 0.16.0]

This means that this R session will use the packages installed in the isolated library we’ve just created. Let’s now take a look at the renv.lock file:

owner@localhost ➤ cat renv.lock
{
"R": {
  "Version": "4.2.2",
  "Repositories": [
  {
   "Name": "CRAN",
   "URL": "https://packagemanager.rstudio.com/all/latest"
  }
  ]
},
"Packages": {
  "MASS": {
    "Package": "MASS",
    "Version": "7.3-58.1",
    "Source": "Repository",
    "Repository": "CRAN",
    "Hash": "762e1804143a332333c054759f89a706",
    "Requirements": []
  },
  "Matrix": {
    "Package": "Matrix",
    "Version": "1.5-1",
    "Source": "Repository",
    "Repository": "CRAN",
    "Hash": "539dc0c0c05636812f1080f473d2c177",
    "Requirements": [
      "lattice"
    ]

    ***and many more packages***

The renv.lock file is a json file listing all the packages, as well as their dependencies, that are used for the project. At the top of the file, the R version that was used to generate the renv.lock file is also named. It is important to remember that when you’ll use {renv} to restore a project’s library on a new machine, the R version will not be restored: so in the future, you might be restoring this project and running old versions of packages on a newer version of R, which may sometimes be a problem (but we’re going to discuss this later).

So… that’s it. You’ve generated the renv.lock file, which means that future you, or someone else, can restore the library that you used to write this analysis. All that’s required is for that person (or future you) to install {renv} and then use the renv.lock file that you generated to restore the library. Let’s see how this works by cloning the following Github repository on this link1 (forked from this one here2):

owner@localhost ➤ git clone git@github.com:b-rodrigues/targets-minimal.git

You should see a targets-minimal folder on your computer now. Start an R session in that folder and type the following command:

renv::restore()

You should be prompted to activate the project before restoring:

This project has not yet been activated.
Activating this project will ensure the project library
is used during restore.
Please see `?renv::activate` for more details.

Would you like to activate this project before restore? [Y/n]: 

Type Y and you should see a list of packages that need to be installed. You’ll get asked once more if you want to proceed, type y and watch as the packages get installed. If you pay attention to the links, you should see that many of them get pulled from the CRAN archive, for example:

Retrieving 'https://cloud.r-project.org/src/contrib/Archive/vroom/vroom_1.5.5.tar.gz'

Notice the word “Archive” in the url? That’s because this project uses {vroom} 1.5.5, but as of writing (early 2023), {vroom} is at version 1.6.1.

Now, maybe you’ve run renv::restore(), but the installation of the packages may have failed. If that’s the case, let me explain what likely happened.

I tried restoring the project’s library on two different machines: a Windows laptop and a Linux workstation. renv::restore() failed on the Windows laptop, but succeeded on the Linux workstation.

Why does that happen? Well in the case of the Windows laptop, compilation of the {dplyr} package failed. This is likely because my Windows laptop does not have the right version of Rtools installed. If you look inside the renv.lock file that came with the targets-minimal project, you should notice that the recorded R version is 4.1.0, but I’m running R 4.2.2 on my computers. So libraries get compiled using Rtools 4.2 and not Rtools 4.0 (which includes the libraries for R 4.1 as well).

So in order to run this project successfully, I should install the right version of R and Rtools which on Windows should not be too complicated. But that might be a problem on other operating systems. Does that mean that {renv} is useless? No, not at all.

At a minimum, {renv} ensures that a project’s library doesn’t interfere with another project’s library while you’re working on different projects. You can be working on several projects and be sure that if you update your library (for example, to use a nice new function from one specific package), that update will only affect the project where the library was updated, and not the other libraries from the other projects. When working with a system-wide library, updating packages for a project could introduce breaks in other projects. Without {renv}, such breaks could happen because another function coming from some other package that also got updated and that you use in another long-term project got removed, or renamed, or simply works differently now. This types of issues can be avoided by using a per-project library, which is exactly what {renv} does.

But also, apart from that already quite useful feature, renv.lock files provide a very useful blueprint for Docker, which we are going to explore in a future chapter. Only to give you a little taste of what’s coming: since the renv.lock file lists the R version that was used to record the packages, we can start from a Docker image that contains the right version of R. From there, restoring the project using renv::restore() should succeed without issues. If you have no idea what this all means, do not worry, you will know by the end of the book, so hang in there.

So should you use {renv}? I see two scenarios where it makes sense:

  • You’re done with the project and simply want to keep a record of the packages used. Simply call renv::init() at the end of the project and commit and push the renv.lock file on Github.
  • You want to use {renv} from the start to isolate the project’s library from your whole R installation’s library to avoid any interference (I would advise you to do it like this).

In the next section, we’ll quickly review how to use {renv} on a “daily basis”.

10.1.1 Daily {renv} usage

So let’s say that you start a new project and want to use {renv} right from the start. You start with an empty directory, and add a template .Rmd file, and let’s say it looks like this:

---
title: "My new project"
output: html_document
---

```{r setup, include=FALSE}
library(dplyr)
```


## Overview

## Analysis

Before continuing, make sure that it correctly compiles into a HTML file by running rmarkdown::render("test.Rmd") in the correct directory.

In the setup chunk you load the packages that you need. Now, save this file, and start a fresh session in the same directory and run renv::init(). You should see the familiar prompts described above, as well as the renv.lock file (which will only contain {dplyr} and its dependencies).

Now, after the library(dplyr) line, add the following library(ggplot2) (or any other package that you use on a daily basis). Make sure to save the .Rmd file and try to render it again by using rmarkdown::render("test.Rmd") (or if you’re using RStudio, by clicking the right button), but, spoiler alert, it won’t work. Instead you should see this:

Quitting from lines 7-9 (my_new_project.Rmd)
Error in library(ggplot2) : there is no package called 'ggplot2'

Don’t be confused: remember that {renv} is now activated, and that each project where {renv} is enabled has its own project-specific library. You may have {ggplot2} installed on your system-wide library, but this {renv}-enabled project does not have it yet on its own specific library. This means that you need to install {ggplot2} for your project. To do so, simply start an R session within your project and run install.packages("ggplot2"). If the version installed on your system-wide library is the latest version available on CRAN, the package will simply be copied over from the system-wide to the project-specific library, if not, the latest version will be downloaded onto your project’s library. You can now update the renv.lock file. This is done using renv::snapshot(); this will show you a list of new packages to record inside the renv.lock file and ask you to continue:

**list of many packages over here**

Do you want to proceed? [y/N]: 
* Lockfile written to 'path/to/my_new_project/renv.lock'.

If you now open the renv.lock file, and look for the string "ggplot2" you should see it listed there alongside its dependencies. Let me reiterate: this version of {ggplot2} is now unique to this project. You can work on other projects with other versions of {ggplot2} without interfering with this one. If for some reason you didn’t want to have the latest version of {ggplot2} installed in the project-specific library, you could have installed an older version, still thanks to {renv}. For example, to install an older version of {AER}:

renv::install("AER@1.0-0") # this is a version from August 2008

But just like in the previous section, where we wanted to restore an old project that used {renv}, installation of older packages may fail. If you need to use old packages there are other easier approaches which we are also going to explore in this chapter.

When you install more of the required packages for your project, you need to call renv::snapshot() to add the packages to the renv.lock file. Once you’re done with your project, call renv::snapshot() one last time to make sure that every dependency is correctly accounted for. Don’t forget to version the renv.lock file using Git!

10.1.2 Collaborating with {renv}

{renv} is also quite useful when collaborating. You can start the project and generate the lock file, and when your teammates clone the repository from Github, they can get the exact same package versions as you. All of you only need to make sure that everyone is running the same R version to avoid any issues.

There is a vignette on just this that I invite you to read for more details, see here3.

10.1.3 {renv}’s shortcomings

As useful as {renv} is, it also comes with some shortcomings (which I’ve already alluded to before). It is quite important to understand what {renv} does and what it doesn’t do, and why {renv} alone is not enough to ensure the long-term reproducibility of your projects.

The first problem, and I’m repeating myself here, is that {renv} only records the R version used for the project, but does not restore it when calling renv::restore(). You need to install the right R version yourself. On Windows this should be fairly easy to do, but then you need to start juggling R versions and know which scripts need which R version, which can get confusing.

There is the {rig} package that makes it easy to install and switch between R versions that you could check out4 if you’re interested. However, I don’t think that {rig} should be used for our purposes. I believe that it is safer to use Docker instead, and we shall see how to do so in the coming chapters.

The other issue of using {renv} is that future you, or your teammates or people that want to reproduce your results need to install packages that may be quite difficult to install, either because they’re very old by now, or because their dependencies are difficult to satisfy. Have you ever tried to install a package that depended on {rJava}? Or the {rgdal} package? Installing these packages can be quite challenging because they need specific system requirements that may be impossible for you to install (either because you don’t have admin rights on your workstation, or because the required version of these system dependencies is not available anymore). Having to install these packages (and potentially quite old versions at that) can really hinder the reproducibility of your project. Here again, Docker provides a solution. Future you, your teammates or other people simply need to be able to run a Docker container, which is a much lower bar than installing these old libraries.

Note also that during your journey, you will want to improve your scripts, you maybe will also want to upgrade some dependencies to take advantage of recent developments in some packages. This will require upgrading the {renv} environment. To do so, run update.packages() to update the packages, and then renv::snapshot() to to generate a new renv.lock file. But how will you be sure that these upgrades will not impair some other parts of your code? Turning the analysis into a package will help with this, as I’ll explain in the next chapter, because once the analysis is turned into a package, testing gets much easier. If the tests indicate that something went wrong, you can simply restore the previous lock file using Git and restore the old library.

I want to stress that this does not mean that {renv} is useless: we will keep using it, but together with Docker to ensure the reproducibility of our project. As I’ve written above already, at a minimum {renv} ensures that a project’s library doesn’t interfere with another project’s library and this is in itself already quite useful. The renv.lock file also provides a blueprint that can be used (by you or someone else) to build a Docker image for long-term reproducibility.

Let’s now quickly discuss two other packages before finishing this chapter, which provide an answer to the question: how to rerun an old analysis if no renv.lock file was made available at the time the analysis was made?.

10.2 Becoming an R-cheologist

So let’s say that you need to run an old script, and there’s no renv.lock file around for you to restore the right package library. There might still be a solution (apart from running the script on the current version on R and packages, and hope that everything goes well), but for this you need to at least know roughly when that script was written. Let’s say that you know that this script was written back in 2017, somewhere around October. If you know that, you can use the {rang} or {groundhog} packages to download the packages as of October 2018 in a separate library and then run your script.

{rang} is fairly recent as of writing (February 2023) so I won’t go into too much detail now, as it is likely that the package will keep evolving rapidly in the coming weeks. So if you want to use it and follow its development, take a look at its Github repository here5 and read the prepint6 (Chan and Schoch 2023).

{groundhog} (website7) is another option that has been around for more time and is fairly easy to use. Suppose that you have a script from October 2018 that looks like this:

library(purrr)
library(ggplot2)

data(mtcars)

myplot <- ggplot(mtcars) +
  geom_line(aes(y = hp, x = mpg))

ggsave("/home/project/output/myplot.pdf", myplot)

If you want to run this script with the versions of {purrr} and {ggplot2} that were current in October 2017, you can achieve this by simply changing the library() calls:

groundhog::groundhog.library("
    library(purrr)
    library(ggplot2)",
    "2017-10-04"
    )

data(mtcars)

myplot <- ggplot(mtcars) +
  geom_line(aes(y = hp, x = mpg))

ggsave("/home/project/output/myplot.pdf", myplot)

but you will get the following message:

-----------------------------------------------
|IMPORTANT.
|    Groundhog says: you are using R-4.2.2, but the version of R current
|    for the entered date, '2017-10-04', is R-3.4.x. It is recommended
|    that you either keep this date and switch to that version of R, or
|    you keep the version of R you are using but switch the date to
|    between '2022-04-22' and '2023-01-08'.
|
|    You may bypass this R-version check by adding:
|    `tolerate.R.version='4.2.2'`as an option in your groundhog.library()
|    call. Please type 'OK' to confirm you have read this message.
|   >ok

Groundhog is speaking to us, advising us to switch to the version of R that was current at that time the script was written. If we want to ignore the message’s advice, and add tolerate.R.version = '4.2.2', we may get the script to run anyways:

groundhog.library("
    library(purrr)
    library(ggplot2)",
    "2017-10-04",
    tolerate.R.version = "4.2.2")

data(mtcars)

myplot <- ggplot(mtcars) +
  geom_line(aes(y = hp, x = mpg))

ggsave("/home/project/output/myplot.pdf", myplot)

But just like for {renv} (or {rang}), installation of the packages can fail, and for the same reasons (unmet system requirements most of the time).

So here again, the solution is to take care of the missing piece of the reproducibility puzzle, which is the whole computational environment itself.

10.3 Conclusion

In this chapter you had a first (maybe a bit sour) taste of reproducibility. This is because while the tools presented here are very useful, they will not be sufficient if we want our project to be truly reproducible, especially in the long term. There are too many things that can go wrong when re-installing old package versions, so we must instead provide a way for users to not have to do it at all. This is where Docker is going to be helpful. But before that, we need to hit the development bench again. We are actually not quite done with our project; before going to full reproducibility, we should turn our analysis into a package. And as you will see, this is going to be much, much, easier than you might expect. You already did 95% of the job! There are many advantages to turning our analysis into a package, and not only from a reproducibility perspective.


  1. https://is.gd/jMVfCu↩︎

  2. https://is.gd/AAnByB↩︎

  3. https://is.gd/sXpWVp↩︎

  4. https://is.gd/dvH2Sj↩︎

  5. https://is.gd/sQu7NV↩︎

  6. https://arxiv.org/abs/2303.04758↩︎

  7. https://groundhogr.com↩︎