Chapter 2 What is R?

R is an open-source programming language. It is run either interactively or by an interpreter from commands saved in text files, or “scripts”. This makes it similar to programming languages like Python or Ruby.

However, R differs from those languages in its primary use case. Though R is becoming increasingly general-purpose, it was originally conceived as an open-source replacement for the statistical programming language ‘S’. As such, R is first and foremost a statistical programming language and, as a result, is widely used in academic data science and statistics programmes around the world.

Growing out from its academic roots, R is now often considered the ‘lingua franca’ of data, and is in widespread use across major industries and governments around the world. R is well-suited to processing and analysing data, and features a graphics system that is capable of producing print quality charts with relative ease.

The ongoing development of the R language is governed by the not-for-profit R Foundation, which was founded by members of the core R development team.

There is also an industry body, called the ‘R Consortium’, which works with the R Foundation, predominantly by sponsoring infrastructure projects.

2.1 Target environments

R is available pre-compiled for Windows, MacOS and some Linux distributions. In addition, as an open-source language, the source code is available for inspection or compilation for other platforms.

2.2 Interactive Use

One of the most common ways for people to use R is interactively. R users will often conduct what is referred to as “Exploratory Data Analysis” (EDA) in this manner. During EDA, one or more datasets are loaded into R and the user literally “explores” the data; cutting it up or combining it in different ways, running analyses and experiments, and producing charts to get a feel for the types of insights they derive from it.

Often, the results of such interactive use will make their way into a report or presentation or, in some other way, contribute to informed decision making within the user’s organisation.

2.3 Non-Interactive Use

Sometimes called “unattended” or “batch” use, this differs from the exploratory use described in the previous section. Batch use is the opposite; R commands written into a script file can be run against a dataset to produce a defined output, such as a chart or another dataset which contains the results of the analysis. Such scripts are designed to be repeatable and will often have the additional rigour and robustness written into them that one would expect from any other programming language, this makes this mode of operation a good fit for ‘production’ use cases.

The batch way of using R usually involves the invoking of a pre-written script either manually or automatically, perhaps triggered at a particular time or by a specific event. It is also generally the mode of operation that is used when writing code to run on High Performance Compute (HPC) clusters for complex simulations.

2.4 Input Data Sources

All manner of input sources are supported within R - the most common being flat files of some sort, like plain text, JSON, Excel or CSV files, or databases such as SQL Server, Oracle, or MySQL. It is also possible to connect R to Spark clusters or use it to drive Hadoop jobs. If that weren’t enough, additional packages will let you work perform web scraping, or work directly with web APIs and pull data in that way.

2.5 R Inside Databases

Given the huge surge in interest in R over the last decade, it is perhaps not surprising that several database vendors have sought to embed the analytical power of R within the database itself. This usually works in a similar way to a database stored procedure. It is a piece of code, stored within the database, that when run produces an output, which is itself generally another dataset.

There are several benefits to running R inside the database, but perhaps the most important is that the entire analysis, including the data storage, is done in situ and as the potential bottleneck of moving the data across the network to be processed is removed.

Vendors with versions of R that run inside the database include, Oracle (with Oracle R Enterprise, or ORE), Microsoft (with ‘Machine Learning Services for SQL Server’) and Teradata (with Aster R).

2.6 Commercial Versions of R

The open-source version of R – licensed under the GPL – is heavily used in commercial and government environments. Companies like Google, Walmart and Amazon all use R to generate insights that can help drive business. Government departments around the world use R to provide valuable insight into some of the most pressing problems in today’s society. Despite the success of open-source R, there are several commercial distributions available, each targeting a different need.

Microsoft has “Microsoft ML Server”, which incorporates their own R distribution. This version of R is compiled using the commercial Intel “Math Kernel Libraries”, which claim improved performance when compared to the open-source maths libraries generally used. It also ships with additional proprietary extension libraries and is well integrated in the Microsoft ecosystem. This distribution aims to provide faster results and work better with larger datasets.

If you work in a heavily regulated industry, Mango Solutions offers an extensively validated build of R, called ValidR. This is distribution of open-source R that has had thorough and rigorous additional testing applied to it.

TIBCO, makers of the Spotfire analytics product, has a product called TERR (TIBCO Enterprise Runtime for R) which is an alternative, non-GPL implementation of the R interpreter, designed with improved speed in mind. It is embedded in several TIBCO products, including Spotfire.

These – along with the in-database versions of R mentioned above – are the commercial versions you’ll most likely encounter in the field, though none is in as widespread use as the official open-source version.