Big Data/Analytics Zone is brought to you in partnership with:

I am a programmer and architect (the kind that writes code) with a focus on testing and open source; I maintain the PHPUnit_Selenium project. I believe programming is one of the hardest and most beautiful jobs in the world. Giorgio is a DZone MVB and is not an employee of DZone and has posted 636 posts at DZone. You can read more from them at their website. View Full User Profile

An Introduction to the R Language

02.02.2012
| 7385 views |
  • submit to reddit
R is a language for statistical computing. In a world of big data and scientific approaches to startup ideas, you can have the advantage of a tool in your box for statistical analysis and mathematical computations that is more powerful than a general purpose language.

Yet another language?

R is oriented to mathematics instead of general purpose computation, and has many similarities with Matlab and Octave; for example, it is accessible to people not having a computer science background. Moreover, it is open source.

The support for mathematical computations is reflected in more libraries included out of the box, for managing distributions, estimations, and inference tests. R has also a simplified syntax for mathematical expressions (e. g. the ~ operator to specify regression).

R is still strong on other core concepts, unlike Matlab: there are possibilities for classes and objects (a la Clojure), anonymous functions and closures, and named parameters. It includes a whole environment, more than a language interpreter: some functions implement graphing capabilities (farewell gnuplot) and a shell with completion and history.

The practical side of R is taken care of by bunbled utilities for reading data from files, databases (MySQL, ORacle, JDBC, ODBC), and for saving the results (e.g. the whole current workspace or single variables).

Like Matlab and Octave, R can be used for quick prototyping; after an algorithm is implemented and validated, it can be translated in C or other languages. The reasons for the translation include better performance, and the portability of the code on different machines, operating systems, or programmers.

Installation

The installation process depends on your system, but four Linux distributition are supported via repositories (with RPM and Deb packages).

People from other domains (like statistics) install R everyday, so it's not like compiling the kernel or hunting for missing libraries. Eveyrthing is already compiled and automated, which is a plus with respect to other niche languages which need to download different groups of JARs just because it is assumed a programmer can handle it.

R basic syntax and data structures

Before beginning, you must know that R's assignment operator is <- and not =. = will work in many cases, but <- is more general as it can be used anywhere; it is the real equivalent of the assignment operator you have used in C-like languages.

> if (a = 4) 1
Error: unexpected '=' in "if (a ="
> if (a <- 4) 1
[1] 1

Moreover, as you can already see from the code above, there is no need for semicolons.

Numbers in R are numeric (which means double actually) or integers.

> answer <- 42
> class(42)
[1] "numeric"
> answers <- c(42, 43)
> class(answers)
[1] "numeric"
> answers <- 42:43
> class(answers)
[1] "integer"
> class(as.integer(42))
[1] "integer"

Booleans are represented with the instances TRUE and FALSE:

> if (TRUE) 42
[1] 42

Strings are also a first-class type, with easier handling than with C libraries:

> message <- "hello"
> message
[1] "hello"

Basically, every R variable is a vector, again similarly to the case of Matlab/Octave; even scalars are just vectors of length 1. Vectors are created with the concatenation function c():

> my_vector <- c(1, 2, 3)
> my_vector
[1] 1 2 3

Lists, however can store variables of any type, while vectors must be homegeneous:

> list(42, "a")
[[1]]
[1] 42

[[2]]
[1] "a"

> c(42, "a")
[1] "42" "a"

Moreover, both lists and vectors can act as maps, since their keys can be strings.

Matrices and data frames are more complex structures. Matrices are the evolution of vectors in 2 dimensions, while data frames are similar structures that can contain values of different types. The difference between them is the same as for vectors and lists.

There are many more types, and useful functions bundled with the interpreter. If you come across some unknown calls, type ?entity at the prompt to load the corresponding man page; entity can be a function or a type name.

A quick example: linear regression

R can seamlessly  perform linear regression, a staple problem in statistics and machine learning. The operation consists in finding the parameters of a linear combination that fits several samples of input and output variables. In our case, we want to find the parameters q and m in the model correlated_data = q + m * data.
> data <- c(1, 2, 3, 4)

> data
[1] 1 2 3 4
> mean(data)
[1] 2.5
> var(data)
[1] 1.666667
> sd(data)
[1] 1.290994
> correlated_data <- c(2, 4, 7, 7.5)
> fm<-lm(correlated_data ~ data)
> fm

Call:
lm(formula = correlated_data ~ data)

Coefficients:
(Intercept)         data  
       0.25         1.95  
ttributes(fm)
$names
 [1] "coefficients"  "residuals"     "effects"       "rank"         
 [5] "fitted.values" "assign"        "qr"            "df.residual"  
 [9] "xlevels"       "call"          "terms"         "model"        

$class
[1] "lm"
> fm$coefficients
(Intercept)        data
       0.25        1.95
> fm$coefficients['data']
data
1.95 
Published at DZone with permission of Giorgio Sironi, author and DZone MVB.

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)