17 The end
Congratulations, you’re done with the book and I hope you learned a thing or two.
Part 1 focused on teaching you best practices, tools and techniques to make your code as clean as possible. In part 2, I taught you how to turn your project into a pipeline, and then how to make this pipeline reproducible using {renv}
and Docker. To summarise, here are all the things that we need to think about to write a RAP:
- Write code that is as clean as possible: keep it DRY, document and test it well;
- Record package dependencies of the project;
- Record the R version that you use;
- Use a tool that builds the project for you;
- Record the computational environment.
If you tick all these boxes, you, or anyone else, should not have any problems reproducing the results of your project. While it may seem that ticking these boxes takes up valuable time from other tasks, if you use the techniques and tools that I’ve showed you in part 1, this should not be the case, and you might end up even gaining time. The only exception to this will be preparing a Docker image, but if you supply at the very least an renv.lock
file, creating a Docker image to run a project could even be done much later, and only if it’s really needed (and maybe even by someone else).
“So what?”
If you’ve reached this conclusion and are still thinking “meh, yeah, reproducibility is nice and all, but… so what?” I hope that this last attempt of mine to convince you that RAPs are important will be successful.
So, why bother building RAPs? Firstly, there are purely technical considerations. It is not impossible that in quite a near future, we will work on ever thinner clients while the heavy-duty computations will run on the cloud. Should this be the case, being comfortable with the topics discussed in this book will be valuable. Also, in this very near future, large language models will be able to set up most, if not all, of the required boilerplate code to set up a RAP. This means that you will be able to focus on analysis, but you still need to understand what are the different pieces of a RAP, and how they fit together, in order to understand the code that the large language model prepared for you, but also to revise it if needed. And it is not a stretch to imagine that simple analyses could be taken over by large language models as well. So you might very soon find yourself in a position where you will not be the one doing an analysis and setting up a RAP, but instead check, verify and adjust an analysis and a RAP built by an AI. Being familiar with the concepts laid out in this book will help you successfully perform these tasks in a world where every data scientist will have AI assistants.
But more importantly, the following factors are inherently part of data analysis:
- transparency;
- sustainability;
- scalability.
It doesn’t matter if you’re working in research, for a public institution or a private sector company: the three points above are incredibly important and it’s impossible to perform data analysis without taking these into consideration, regardless of whether AIs take over some, or most, of the tasks you perform today. In the case of research, the publish or perish model has distorted incentives so much that unfortunately a lot of researchers are focused on getting published as quickly as possible, and see the three factors listed above as hurdles to getting published quickly. Herculean efforts have to be made to reproduce studies that are not reproducible, and more often than not, people that try to reproduce the results are unsuccessful. Thankfully, things are changing and there are more and more efforts being made to make research reproducible by design, and not as an afterthought. In the private sector, tight deadlines lead to the same problem: analysts think that making the project reproducible is an hindrance to being able to deliver on time. But here as well, taking the time to make the project reproducible will help with making sure that what is delivered is of high quality, and it will also help with making reusing existing code for future projects much easier, even further accelerating development.
Data analysis, at whatever level and for whatever field, is not just about getting to a result, the way to get to the result is part of it! This is true for science, this is true for industry, this is true everywhere. You get to decide where on the iceberg of reproducibility you want to settle, but the lower, the better.
So why build RAPs? Well, because there’s no alternative if you want to perform your work seriously.