setwd("H:/Username/Projects/housing_regression/")
2 Before we start
This is not an introductory book, so before tackling the topics presented here, make sure that you are familiar with the different topics presented below. If you read this chapter and everything is obvious or known to you, then you should have no trouble following along. If instead what you read here is cryptic, then take some time to improve your understanding of these topics first.
2.1 Essential knowledge
It’s important to know the parts that constitute R. Let’s make something clear: R is not RStudio, or whatever interface you are using to interact with R. R is a domain-specific interpreted programming language. R is domain-specific because its primary use is in performing statistics. Interpreted, because results get returned immediately when you execute a script in the console. In other words, when you write 1+1
in the console, you get back 2
immediately. There are programming languages, called compiled programming languages, that require code to be compiled into binaries before execution. C is such a language. The fact that R is interpreted makes interactive exploratory data analysis easy, but also introduces certain negative aspects. I will discuss these in detail in the book. R’s console is an example of a REPL – Read-Eval-Print-Loop – environment. Code gets read, evaluated, printed and the read state gets returned, starting the loop over.
To make working with R easier, you should not write code in the console and execute it, but instead write it in a text file. You can keep these text files, update and share them with collaborators. Such text files are called scripts. You could write these scripts using the most basic text editor included in your operating system (that would be Notepad.exe on Windows for example), but you should instead use a text editor made specifically to make programming easier. Popular choices among R users include RStudio, Visual Studio Code, or maybe something more exotic like Emacs combined with ESS (my personal choice). Whatever text editor you choose, take time to configure it and learn how to use it. You will spend many, many, many hours inside that text editor. The code you write in that text editor is what’s going to feed you and your family. Learn your chosen text editor’s keyboard shortcuts and other advanced features. This initial investment will pay for itself many times over. Also, you need to know what an actual text file is. A document written in Word (with the .docx
extension) is not a text file. It looks like text, but is not. The .docx
format is a much more complex format with many layers of abstraction. “True” plain text files can be opened with the simplest text editor included in your operating system. I’ve had students trying to create text files with word processors like MS Word and then being confused when things would not work.
As stated before, R is a domain-specific programming language mainly used for doing statistics, or whatever modernized term you may prefer like “data science”. Its base capabilities can be extended by installing packages. For example, a base installation of R provides you with useful functions like mean()
or sd()
, to compute the average or standard deviation of a vector of numbers, or rnorm()
to compute random variates from a Gaussian (Normal) distribution. However, there is no function available to train a random forest. If you need to train a random forest you need to install a package using the install.packages("randomForest")
command. This installs the {randomForest}
package (in the rest of the book, I will surround package names with curly braces). The collection of packages installed is called a “library”. Packages get downloaded from CRAN, the Comprehensive R Archive Network. There is no doubt in my mind that the reason R became so popular is because it is quite easy to write packages for it; and this is something that we will learn as well! Some packages are written with other programming languages, very often Fortran or C++. The code included in these packages is then compiled and can be executed by R using a user-facing function. For example, if you dig into the source code of the {randomForest}
package, you will find C and Fortran code. This is important to know, because sometimes R packages need to be compiled by install.packages()
, and this compilation can sometimes fail (especially on Linux, but more on that later in the book).
When you use R, you will load data sets, create plots, train models, etc. These data sets, plots, models, are all objects and they get saved in the global environment. To see a list of objects currently available in the global environment, type ls()
in the R console. When you quit R, you get asked to save the workspace: this will save the current state of the global environment and load it next time you start R. I highly recommend for you to not save the workspace. If you are using RStudio you can change this behaviour in the global options (under Workspace, set Save workspace to .RData on exit to Never). Other editors might have a similar option. Saving and loading the workspace makes it impossible to start with a fresh R session (unless you start R with the --vanilla
flag), which can cause issues that are difficult to pinpoint.
You should also be comfortable with paths and your computer’s file system. Comfortable means having no problems finding where a file gets downloaded for example, or being able to navigate to any folder, either through a GUI file browser or through a terminal (if you’re familiar with navigating your computer using the terminal, you will have an easier time with this book than if you didn’t). I also highly recommend that you strive to use relative paths in your scripts, and not absolute paths. In other words: don’t start your scripts with a line such as:
but instead, use “Projects” if you’re using RStudio, or similar features from your preferred IDE. This way, you can use relative paths instead. This makes collaboration much easier. Using “Projects” in RStudio, if you need to load data, you can simply write:
<- read.csv("data.csv") dataset
and don’t need to set working directories using setwd()
, which obviously will not exist on your collaborators computer.
There is also the {here}
package that makes using relative paths easier, but I won’t discuss it in this book. If you’re interested you can read this post1.
You should be familiar with writing functions. This book has a whole chapter on functional programming, and I will teach you how to write functions, but if you’re already familiar with this, then it will make going through that chapter easier.
Finally, you should know how to ask for help. If you need help with this book, feel free to open an issue on the book’s Github repo here2, or open a thread on the book’s Leanpub forum (if you bought a copy) over here3. Just like for this book, if you have an issue with an R package, look for its repository: many packages’ source code is hosted on Github (but not always). You can also try to reach out to the author, or open a thread on Stackoverflow. Whatever you do, make sure that you do your homework first:
- Read the documentation. Maybe you’re using the tool wrong.
- Take note of the error message. Error messages can be cryptic sometimes, but as you gain in experience, you will learn to decrypt them.
- Write down the simplest script possible that reproduces the issue you’re facing. This is called an MRE, “Minimal Reproducible Example”. If you need to open a thread asking for help, post this MRE, this will make helping you much easier. For general advice on how to write an MRE, you can read this classic blog post4.
Finally, keep in mind the following saying from my father, a mason (the ones that lay bricks, not the ones meeting in secrecy to govern the world):
The tools are always right. If you’re using a tool and it’s not behaving as expected, it is much more likely that your expectations are wrong. Take this opportunity to review your knowledge of the tool.