<- "https://is.gd/1vvBAc"
url
<- tempfile(fileext = ".xlsx")
raw_data
download.file(url, raw_data, method = "auto", mode = "wb")
<- excel_sheets(raw_data)
sheets
<- function(..., sheet){
read_clean read_excel(..., sheet = sheet) |>
mutate(year = sheet)
<- map(
raw_data
sheets,~read_clean(raw_data,
skip = 10,
sheet = .)
|>
) bind_rows() |>
clean_names()
<- raw_data |>
raw_data rename(
locality = commune,
n_offers = nombre_doffres,
average_price_nominal_euros = prix_moyen_annonce_en_courant,
average_price_m2_nominal_euros = prix_moyen_annonce_au_m2_en_courant,
average_price_m2_nominal_euros = prix_moyen_annonce_au_m2_en_courant
|>
) mutate(locality = str_trim(locality)) |>
select(year, locality, n_offers, starts_with("average"))
}
9 Rewriting our project
In this chapter, we will use what we’ve learned until now to rewrite our project. As a reminder, here are the scripts we wrote together:
- save_data.R: https://is.gd/7PhUjd
- analysis.R: https://is.gd/X7XXJg
The analysis.R
file already includes one change: the one from the chapter on collaborating with Github, where Bruno wrote a function to make the plots for each commune.
If you skipped part one of the book, or for any other reason do not have a Github repository with these two files yet, then now is the time to do so. Create a repository and name it housing_lux or anything you’d like, and put these two files there. I will assume that you have these files safely versioned, and will not be telling you systematically when to commit and push. Simply do so as often as you’d like! You should have a repository with a master or main branch containing these two scripts. On your computer, calling git status
in Git Bash (on Windows) or in a terminal (for Linux and macOS) should result in this:
owner@localhost ➤ git status
On branch master
nothing to commit, working tree clean
If that’s the case, congrats, we can start working. Start by creating a new branch, and call it rmd
:
owner@localhost ➤ git checkout -b rmd
Switched to a new branch 'rmd'
We will now be working on this branch, simply work as usual, but when pushing, make sure to push to the rmd
branch:
owner@localhost ➤ git add .
owner@localhost ➤ git commit -m "some changes"
owner@localhost ➤ git push origin rmd
This will push whatever changes you’ve made to files to the rmd
branch. By using two branches like this, you keep the original .R
scripts in the main branch, and then will end up with the .Rmd
files in the rmd
branch.
Before moving forward now is the right moment to actually discuss why you would want to convert the script into Rmds. There are several reasons. First, as argued in the chapter on literate programming, a document that mixes prose and code is easier to read and share than a script. Next, since this Rmd file can get knitted into any type of document (PDF, Word, etc…), it also makes it easier to arrive at what interests us, the output. A script is simply a means, it’s not an end. The end is (in most cases) a document so we might as well use literate programming to avoid the cursed loop of changing the script, editing the document, going back to the script, etc.
But there is yet another benefit; even if the Rmd file is not supposed to get shared with anyone else, we will, later on, use it as our starting point for the Rmd first method of package development as promoted by Sébastien Rochette, the author of {fusen}
. This Rmd first method involves making use of a development Rmd file that contains all the usual steps that we would take to create a package. This is in contrast with the usual package development process, in which we would type the required commands to build the package in the terminal. The functions, tests, and documentation that we want to add to the package get defined using Rmd files as well. This makes them much easier to read and also share with a non-technical audience. All these Rmd files can then be converted (or inflated in {fusen}
jargon) to create a fully working package. If this sounds complicated or confusing, don’t worry. Trust the process, push on, and all the pieces of the puzzle will elegantly fit together in a couple of chapters.
In the following sections I will rewrite the scripts by using functional and literate programming: if you don’t want to rewrite everything, don’t worry, I link the final Rmd files at the end of each section. But I would advise that you follow along by writing everything as it will make absorbing the contents much simpler.
9.1 An Rmd for cleaning the data
So, let’s start with the save_data.R
script. Since we are going to use functional programming and literate programming, we are going to start from an empty .Rmd
file. So open an empty .Rmd
file and start with the following lines:
---
title: "Nominal house prices data in Luxembourg - Data cleaning"
author: "Put your name in here"
date: "`r Sys.Date()`"
---
```{r, warning=FALSE, message=FALSE}
library(dplyr)
library(ggplot2)
library(janitor)
library(purrr)
library(readxl)
library(rvest)
library(stringr)
```
## Downloading the data
We start by writing a header to define the title of the document, the name of the author and the current date using inline R code (you can also hardcode the date as a string if you prefer). We then load packages in a chunk with options warning=FALSE
and message=FALSE
which will avoid showing packages’ startup messages in the knitted document.
Then we start with a new section called ## Downloading the data
. We then add a paragraph explaining from where and how we are going to download the data:
This data is downloaded from the Luxembourguish [Open Data
Portal](https://data.public.lu/fr/datasets/prix-annonces-des-logements-par-commune/)
(the data set called *Série rétrospective des prix
annoncés des maisons par commune, de 2010 à 2021*),
and the original data is from the "Observatoire de
l'habitat". This data contains prices for houses sold
since 2010 for each luxembourguish commune.
The function below uses the permanent URL from the Open Data
Portal to access the data, but I have also rehosted the data, and use my link to download the data (for archival purposes):
This is much more detailed than using comments in a script, one of the benefits of literate programming. Then comes a function to download and get the data. This function simply wraps the lines from our original script that did the downloading and the cleaning. As a reminder, here are the lines from the original script, which I will then rewrite as a function:
and here is the same code, but as a function:
```{r, eval = FALSE}
get_raw_data <- function(url = "https://is.gd/1vvBAc"){
raw_data <- tempfile(fileext = ".xlsx")
download.file(url,
raw_data,
mode = "wb")
sheets <- excel_sheets(raw_data)
read_clean <- function(..., sheet){
read_excel(..., sheet = sheet) %>%
mutate(year = sheet)
}
raw_data <- map_dfr(
sheets,
~read_clean(raw_data,
skip = 10,
sheet = .)) %>%
clean_names()
raw_data %>%
rename(
locality = commune,
n_offers = nombre_doffres,
average_price_nominal_euros = prix_moyen_annonce_en_courant,
average_price_m2_nominal_euros = prix_moyen_annonce_au_m2_en_courant,
average_price_m2_nominal_euros = prix_moyen_annonce_au_m2_en_courant
) %>%
mutate(locality = str_trim(locality)) %>%
select(year, locality, n_offers, starts_with("average"))
} ```
As you see, it’s almost exactly the same code. So why use a function? Our function has the advantage that it uses the url of the data as an argument. This means that we can use it on other datasets (let’s remember that we are here focusing on prices of houses, but there’s another dataset of prices of apartments) or use it on an updated version of this dataset (which gets updated yearly). We can now more easily re-use this function later on (especially once we’ve turned this Rmd into a package in the next chapter). You can decide to show the source code of the function or hide it with the chunk option include=FALSE
or echo=FALSE
(the difference between include
and echo
is that include
hides both the source code chunk and the output of that chunk). Showing the source code in the output of your Rmd file can be useful if you want to share it with other developers. The next part of the Rmd file is simply using the function we just wrote:
```{r}
raw_data <- get_raw_data(url = "https://is.gd/1vvBAc") ```
We can now continue by explaining what’s wrong with the data and what cleaning steps need to be taken:
We need clean the data: "Luxembourg" is "Luxembourg-ville" in 2010 and 2011,
then "Luxembourg". "Pétange" is also spelt non-consistently, and we also need
to convert columns to the right type. We also directly remove rows where the
locality contains information on the "Source":
```{r}
clean_raw_data <- function(raw_data){
raw_data %>%
mutate(locality = ifelse(grepl("Luxembourg-Ville", locality),
"Luxembourg",
locality),
locality = ifelse(grepl("P.tange", locality),
"Pétange",
locality)
) %>%
filter(!grepl("Source", locality)) %>%
mutate(across(starts_with("average"), as.numeric))
}
```
```{r}
flat_data <- clean_raw_data(raw_data) ```
The chunk above explains what we’re doing and why we’re doing it, and so we write a function (based on what we already wrote). Here again, the advantage of having this as a function will make it easier to run on updated data.
We now continue with establishing a list of communes:
We now need to make sure that we got all the communes/localities
in there. There were mergers in 2011, 2015 and 2018. So we need
to account for these localities.
We’re now scraping data from Wikipedia of former Luxembourguish communes:
```{r}
get_former_communes <- function(
url = "https://is.gd/lux_former_communes",
min_year = 2009,
table_position = 3
){
read_html(url) %>%
html_table() %>%
pluck(table_position) %>%
clean_names() %>%
filter(year_dissolved > min_year)
}
```
```{r}
former_communes <- get_former_communes()
```
We can scrape current communes:
```{r}
get_current_communes <- function(
url = "https://is.gd/lux_communes",
table_position = 2
){
read_html(url) |>
html_table() |>
pluck(table_position) |>
clean_names() |>
filter(name_2 != "Name") |>
rename(commune = name_2) |>
mutate(commune = str_remove(commune, " .$"))
}
```
```{r}
current_communes <- get_current_communes() ```
This is quite a long chunk, but there is nothing new in here, so I won’t explain it line by line. What’s important is that the code doing the actual work is all being wrapped inside functions. I reiterate: this will make reusing, testing and documenting much easier later on. Using the objects former_communes
and current_communes
we can now build the complete list:
Let’s now create a list of all communes:
```{r}
get_test_communes <- function(former_communes, current_communes){
communes <- unique(c(former_communes$name, current_communes$commune))
# we need to rename some communes
# Different spelling of these communes between wikipedia and the data
communes[which(communes == "Clemency")] <- "Clémency"
communes[which(communes == "Redange")] <- "Redange-sur-Attert"
communes[which(communes == "Erpeldange-sur-Sûre")] <- "Erpeldange"
communes[which(communes == "Luxembourg City")] <- "Luxembourg"
communes[which(communes == "Käerjeng")] <- "Kaerjeng"
communes[which(communes == "Petange")] <- "Pétange"
communes
}
```
```{r}
former_communes <- get_former_communes()
current_communes <- get_current_communes()
communes <- get_test_communes(former_communes, current_communes) ```
Once again, we write a function for this. We need to merge these two lists, and need to make sure that the spelling of the communes’ names is unified between this list and between the communes’ names in the data.
We now run the actual test:
Let’s test to see if all the communes from our dataset are represented.
```{r}
setdiff(flat_data$locality, communes)
```
If the above code doesn’t show any communes, then this means that we are accounting for every commune.
This test is quite simple, and we will see how to create something a bit more robust and useful later on.
Now, let’s extract the national average from the data and create a separate dataset with the national level data:
Let’s keep the national average in another dataset:
```{r}
make_country_level_data <- function(flat_data){
country_level <- flat_data %>%
filter(grepl("nationale", locality)) %>%
select(-n_offers)
offers_country <- flat_data %>%
filter(grepl("Total d.offres", locality)) %>%
select(year, n_offers)
full_join(country_level, offers_country) %>%
select(year, locality, n_offers, everything()) %>%
mutate(locality = "Grand-Duchy of Luxembourg")
}
```
```{r}
country_level_data <- make_country_level_data(flat_data) ```
and finally, let’s do the same but for the commune level data:
We can finish cleaning the commune data:
```{r}
make_commune_level_data <- function(flat_data){
flat_data %>%
filter(!grepl("nationale|offres", locality),
!is.na(locality))
}
```
```{r}
commune_level_data <- make_commune_level_data(flat_data) ```
We can finish with a chunk to save the data to disk:
We now save the dataset in a folder for further analysis (keep chunk option to
`eval = FALSE` to avoid running it when knitting):
```{r, eval = FALSE}
write.csv(commune_level_data,
"datasets/house_prices_commune_level_data.csv",
row.names = FALSE)
write.csv(country_level_data,
"datasets/house_prices_country_level_data.csv",
row.names = FALSE) ```
This last chunk is something I like to add to my Rmd files.
Instead of showing it in the final document but not evaluating its contents using the chunk option eval = FALSE
, like I did, you could use include = FALSE
, so it doesn’t appear in the compiled document at all. The first time you compile this document, you could change the option to eval = TRUE
, so that the data gets written to disk, and then change it to eval = FALSE
to avoid overwriting the data on subsequent knittings. This is up to you, and it also depends on who the audience of the knitted output is (do they want to see this chunk at all?).
Ok, and that’s it. You can take a look at the finalised file here1. You can now remove the save_data.R
script, as you have successfully ported the code over to an RMarkdown file. If you have not done it yet, you can commit these changes and push.
Let’s now do the same thing for the analysis script.
9.2 An Rmd for analysing the data
We will follow the same steps as before to convert the analysis script into an analysis RMarkdown file. Instead of showing the whole file here, I will show you two important points.
The first point is removing redundancy. In the original script, we had the following lines:
#Let’s compute the Laspeyeres index for each commune:
<- commune_level_data %>%
commune_level_data group_by(locality) %>%
mutate(p0 = ifelse(year == "2010",
average_price_nominal_euros,NA)) %>%
fill(p0, .direction = "down") %>%
mutate(p0_m2 = ifelse(year == "2010",
average_price_m2_nominal_euros,NA)) %>%
fill(p0_m2, .direction = "down") %>%
ungroup() %>%
mutate(
pl = average_price_nominal_euros/p0*100,
pl_m2 = average_price_m2_nominal_euros/p0_m2 * 100)
#Let’s also compute it for the whole country:
<- country_level_data %>%
country_level_data mutate(p0 = ifelse(year == "2010",
average_price_nominal_euros,NA)) %>%
fill(p0, .direction = "down") %>%
mutate(p0_m2 = ifelse(year == "2010",
average_price_m2_nominal_euros,NA)) %>%
fill(p0_m2, .direction = "down") %>%
mutate(
pl = average_price_nominal_euros/p0*100,
pl_m2 = average_price_m2_nominal_euros/p0_m2 * 100)
As you can see, this is almost exactly the same code twice. The only difference between the two code snippets, is that we need to group by commune when computing the Laspeyeres index for the communes (remember, this index will make it easy to make comparisons). Instead of repeating 99% of the lines, we should create a function that will group the data if the data is the commune level data, and not group the data if it’s the national data. Here is this function:
<- function(dataset, start_year = "2010"){
get_laspeyeres
<- deparse(substitute(dataset))
which_dataset
<- if(grepl("commune", which_dataset)){
group_var quo(locality)
else {
} NULL
}%>%
dataset group_by(!!group_var) %>%
mutate(p0 = ifelse(year == start_year,
average_price_nominal_euros,NA)) %>%
fill(p0, .direction = "down") %>%
mutate(p0_m2 = ifelse(year == start_year,
average_price_m2_nominal_euros,NA)) %>%
fill(p0_m2, .direction = "down") %>%
ungroup() %>%
mutate(
pl = average_price_nominal_euros/p0*100,
pl_m2 = average_price_m2_nominal_euros/p0_m2*100)
}
So, the first step is naming the function. We’ll call it get_laspeyeres()
, and it’ll be a function of two arguments. The first is the data (commune or national level data) and the second is the starting date of the data. This second argument has a default value of “2010”. This is the year the data starts, and so this is the year the Laspeyeres index will have a value of 100.
The following lines are probably the most complicated:
<- deparse(substitute(dataset))
which_dataset
<- if(grepl("commune", which_dataset)){
group_var quo(locality)
else {
} NULL
}
The first line replaces the variable dataset
by its bound value (that’s what substitute()
does) for example, commune_level_data
, and then converts this variable name into a string (using deparse()
). So when the user provides commune_level_data
, which_dataset
will be defined as equal to "commune_level_data"
. We then use this string to detect whether the data needs to be grouped or not. So if we detect the word “commune” in the which_dataset
variable, we set the grouping variable to locality
, if not to NULL
. But you might have the following questions: why is locality
given as an input to quo()
, and what is quo()
?
A simple explanation: locality
is a variable in the commune_level_dataset
. If we don’t quote it using quo()
, our function will look for a variable called locality
in the body of the function, but since there is no variable defined that is called locality
in there, the function will look for this variable in the global environment. But this is not a variable defined in the global environment either, it is a column in our dataset. So we need a way to tell this to our function: don’t worry about evaluating this just yet, I’ll tell you when it’s time.
So by using quo()
, we can delay evaluation. So how can we tell the function that it’s time to evaluate locality
? This is where we need !!
(pronounced bang-bang). You’ll see that !!
gets used on group_var
inside locality
:
group_by(!!group_var)
So if we are calling the function on commune_level_dataset
, then group_var
is equal to locality
, if not it’s NULL
. !!group_var
means that now it’s time to evaluate group_var
(or rather, locality
). Because !!group_var
gets replaced by quo(locality)
, and because group_by()
is a {dplyr}
function that knows how to deal with quoted variables, locality
gets looked up among the columns of the data frame. If it’s NULL
nothing happens, so the data doesn’t get grouped.
This is a big topic unto itself, so if you want to know more you can start by reading the famous {dplyr}
vignette called Programming with dplyr here2. In case you use {dplyr}
a lot, I recommend you study this vignette because mastering tidy evaluation (the name of this framework) is key to becoming comfortable with programming using {dplyr}
(and other tidyverse packages). You can also read the chapter I wrote on this in my other free ebook3.
The next lines of the script that we need to port over to the Rmd are quite standard, we write code to create some plots (which were already refactored into a function in the chapter on collaborating on Github). But remember, we want to have an Rmd file that can be compiled into a document that can be read by humans. This means that to make the document clear, I suggest that we create one subsection by commune that we plot. Thankfully, we have learned all about child documents in the literate programming chapter, and this is what we will be using to avoid having to repeat ourselves. The first part is simply the function that we’ve already written:
```{r}
make_plot <- function(commune){
commune_data <- commune_level_data %>%
filter(locality == commune)
data_to_plot <- bind_rows(
country_level_data,
commune_data
)
ggplot(data_to_plot) +
geom_line(aes(y = pl_m2,
x = year,
group = locality,
colour = locality))
}
```
Now comes the interesting part:
```{r, results = "asis"}
res <- lapply(communes, function(x){
knitr::knit_child(text = c(
'\n',
'## Plot for commune: `r x`',
'\n',
'```{r, echo = FALSE}',
'print(make_plot(x))',
'```'
),
envir = environment(),
quiet = TRUE)
})
cat(unlist(res), sep = "\n")
```
I won’t explain this now in great detail, since that was already done in the chapter on literate programming. Before continuing, really make sure that you understand what is going on here. Take a look at the finalised file here4. You’ll notice that at the start of the RMarkdown file, I also load some package and the data saved by the save_data.Rmd
RMarkdown file.
You can see how the outputs look like by browsing to the links below:
- save_data.html, compiled from the save_data.Rmd source5
- analyse_data.html, compiled from the analyse_data.Rmd source6
Of course, you could compile the files into Word documents or PDF, depending on your needs, and you could of course write many more details than me. I wanted to keep it short; the point of this chapter was to show you how to use literate programming and not to write a very detailed analysis.
9.3 Conclusion
This chapter was short, but quite dense, especially when we converted the analysis script to an Rmd, because we’ve had to use two advanced concepts, tidy evaluation and Rmarkdown child documents. Tidy evaluation is not a topic that I wanted to discuss in this book, because it doesn’t have anything to do with the main topic at hand. However, part of building a robust, reproducible pipeline is to avoid repetition. In this sense, programming with {dplyr}
and tidy evaluation are quite important. As suggested before, take a look at the linked vignette above, and then the chapter from my other free ebook. This should help get you started.
The end of this chapter marks an important step: many analyses stop here, and this can be due to a variety of reasons. Maybe there’s no time left to go further, and after all, you’ve got the results you wanted. Maybe this analysis is useful, but you don’t necessarily need it to be reproducible in 5, 10 years, so all you want is to make sure that you can at least rerun it in some months or only a couple of years later (but be careful with this assessment, sometimes an analysis that wasn’t supposed to be reproducible for too long turns out to need to be reproducible for way longer than expected…).
Because I want this book to be a pragmatic guide, I will now talk about putting the least amount of effort to make your current analysis reproducible, and this is by freezing package versions, which I will show you in the next chapter.