Learning R for PL/SQL Developers—Part 3
 
by Arup Nanda

Welcome to the third installment of this series. In the previous two installments, you learned how to use variables of different types and various kinds of loops and decision logic.

In this installment, you will lean how to define various types of collections of data, not just atomic variables. Being a data-oriented language, collections are vital in R because most operations are done on collections, not on individual pieces of data.

At the end of this article, there is a summary of what you learned followed by a quiz you can use to test what you've learned.

Introduction

There are five types of collections in R. In a nutshell, they are different representations of collections of data in R.

  • Vector: A collection of only one dimension composed of elements of one data type only
  • Factor: Discrete data points that can be used as identifiers only
  • Matrix: A collection of two dimensions of elements of the same data type
  • List: A collection of one dimension (the same as is the case with a vector), but the elements can be of multiple data types
  • Data frame: A two-dimensional collection of elements of multiple data types
  • Array: A multidimensional collection of elements of one data type only

We will examine each of them and how they are used in different situations. As with the previous installments in this series, you will learn these compared to the equivalent in PL/SQL.

But first, let's explore the options available for collections in PL/SQL before exploring the same in R.

There are three basic kinds of collections in PL/SQL:

Nested Tables in PL/SQL

These are just a list of items of the same data type. The list is ordered and the elements can be addressed by their position in the array. There is no "index" to reference them by; the only index is their position. Therefore, this type of data can't be used for key-value pairs.

Let's see an example where we will store the days of the week to a variable.

-- pl1.sql
declare
   type ty_tabtype is table of varchar2(30);
   l_days_of_the_week ty_tabtype;
begin
   l_days_of_the_week := ty_tabtype (
          'Sun',
          'Mon',
          'Tue',
          'Wed',
          'Thu',
          'Fri',
          'Sat'
   );
   -- let's print the values
   for d in l_days_of_the_week.FIRST .. l_days_of_the_week.LAST loop
          dbms_output.put_line ('Day '||d||' = '||
                  l_days_of_the_week (d)
           );
   end loop;
end;

Here's the output:

Day 1 = Sun
Day 2 = Mon
Day 3 = Tue
Day 4 = Wed
Day 5 = Thu
Day 6 = Fri
Day 7 = Sat

Note that the elements are all of the same data type, which, in this case is varchar2(30). You can't mix any other data type in. But you can keep adding values, as you need, to this list.

Varrays in PL/SQL

The second type of collections in PL/SQL is called varray. It's similar to nested tables, except that the maximum number of elements is fixed. The actual number of elements vary and depend on the elements added or subtracted at runtime. Here is an example:

-- p2.sql
declare
   type ty_weekdays_list is varray(7) of varchar2(3);
   v_week_days ty_weekdays_list;
   v_count number;
begin
   v_week_days := ty_weekdays_list('Sun','Mon','Tue','Wed','Thu','Fri','Sat');
   v_count := v_week_days.count;
   for i in 1..v_count loop
          dbms_output.put_line(i||'='||v_week_days(i));
   end loop;
end;

Note: The values are also of the same data type.

Associative Arrays in PL/SQL

The third type of collection is an associative array, which is also called a PL/SQL table. An associative array is an arbitrary collection of keys and values. The important properties of associative arrays are

  • They are empty (but not null) until you populate them.
  • They can hold an any number of elements. You don't need to specify the number while creating the array.
  • You can access the elements of the array without knowing their positions.

You have to reference the elements by an index, not by position. Therefore, associative arrays are also called unordered lists.

To demonstrate a use case, let's see an example where we need to hold some book titles and their authors' names using a key-value pair structure. We build an associative array of varchar2(30) values (the "value" part of key-value pair) indexed by another varchar2 (the "key" part of the key-value pair). The book title is the key and the author's name is the value. We will initially populate this array with four books, as shown below:

Book Author
Pride and Prejudice Jane Austen
1984 George Orwell
Anna Karenina Leo Tolstoy
Adventures of Tom Sawyer Mark Twain

In the programs below, we will populate the array and then select from the array.

--pl3.sql
declare
   type ty_tabtype is table of varchar2(30)
          index by varchar2(30);
  l_books        ty_tabtype;
  i              varchar2(30);
begin
   l_books ('Pride and Prejudice') := 'Jane Austen';
   l_books ('1984') := 'George Orwell';
   l_books ('Anna Karenina') := 'Leo Tolstoy';
   l_books ('Adventures of Tom Sawyer') := 'Mark Twain';
   --
   -- now let's display the books and their authors
   --
   i := l_books.first;
   while i is not null loop
     dbms_output.put_line('Author of '||i||' = '||
                  l_books (i));
      i := l_books.next(i);
   end loop;
end;

Here is the output:

Author of 1984 = George Orwell
Author of Adventures of Tom Sawyer = Mark Twain
Author of Anna Karenina = Leo Tolstoy
Author of Pride and Prejudice = Jane Austen

Now Let's see how these PL/SQL collections translate to R.

Vectors in R

Features of many R collections overlap those provided by PL/SQL collections; so I am going to start with an explanation of R collections and show their equivalence to PL/SQL collections.

Take the first one: nested table in PL/SQL. In the PL/SQL example, we used the days of the week. This is a one-dimensional array. In R, this is called a vector. Actually, a vector is a lot more, but we will start with the basic explanation.

Here is how we define the same days of the week in R.

Note the use the function c(), which creates a vector.

> dow <- c('Sun','Mon','Tue','Wed','Thu','Fri','Sat')
> dow
[1] "Sun" "Mon" "Tue" "Wed" "Thu" "Fri" "Sat"

All the elements of the vector must be of the same data type. If you give multiple data types, R will convert all of them to the superset data type that can accommodate all the values. You can access the elements of a vector by their position.

> dow[1]
[1] "Sun"

Notice that the first position is indexed as "1," just as in PL/SQL; it is not indexed as "0," which is the case in other programming languages such as Python or C. You don't have to remember how many elements are present in the vector. You can use the length() function.

> length(dow)
[1] 7

Note the c function; that's what defines a vector. Let's explore it with a simpler example of a vector of 10 numbers from 1 to 10.

> v1 <- c(1,2,3,4,5,6,7,9,10)

You can also define that vector using the range notation you learned in Part 2 of this series.

> v1 <- 1:10

This is similar to the FOR i IN 1..20 LOOP construct in PL/SQL. If you check the data type of the variable v1, you will notice that it is labeled "numeric":

> class(v1)
[1] "numeric"

That's because the elements of the vector are numeric. So, how do you know if a variable is a vector? Use a built-in function:

> is.vector(dow)
[1] TRUE

Vectors allow operations as a whole. Here is how you multiply all the elements of the vector by 2 at once:

> v2 <- v1 * 2
> v2

[1]  2  4  6  8 10 12 14 18 20

And here is how you subtract one vector from another:

> v2 - v1
[1]  1  2  3  4  5  6  7  9 10

You can check if a value is present in a vector, that is, whether it is one of the elements:

> v1 <- c(1,2,3,4,5,6,7,9,10)
> 4 %in% v1
[1] TRUE

This returned TRUE because 4 is one of the elements of the vector variable v1. Likewise, you can check whether the elements of a vector can be found on another vector.

Let's define another vector called v2:

> v2 <- c(1,2)
> v2 %in% v1
[1] TRUE TRUE

It returned TRUE for both values because both 1 and 2 (elements of vector v2) are found in vector v1. By the way, the positions of the vectors are not important.

Let's see another vector, v3, with two elements (4 and 5) in the first and second elements. The following checks for the presence of (4,5) in vector v1.

> v3 <- c(4,5)
> v2 %in% v1
[1] TRUE TRUE

Let's extend the example by using another vector, (11,5), and checking if the values are present in v1.

> v3 <- c(11,5)
> v3 %in% v1
[1] FALSE  TRUE

The first element of v3, 11, is not present in v1; so the return value for that is FALSE. The second element, 5, was found in v1; so the return value is TRUE.

Vectors have elements of the same data type, not of different data types. What if you combine multiple data types in a vector? Let's see:

> v4 <- c(1,2,'Three')
> v4
[1] "1"     "2"     "Three"

Notice how R converted everything to character format, even if you input only numeric values for the first two elements. If you check the data types, as follows:

> class(v4)
[1] "character"

They will show as character. If you combine logical values, they will be converted to numbers.

> v5 <- c(1,2,TRUE,FALSE)
> v5
[1] 1 2 1 0

Note how TRUE and FALSE have been converted to 1 and 0, respectively. If you introduce a character, all the items will be converted to characters.

> v5 <- c(1,2,'Three',TRUE,FALSE)
> v5
[1] "1"     "2"     "Three" "TRUE"  "FALSE"

Note how the logical values TRUE and FALSE were merely converted to their character representations—not to 1 and 0, as was done earlier.

There are special operations on the elements of vector where all the elements are of logical values. Recall from Part 1 that "|" is the logical OR operator and "&" is the logical AND operator.

> v1 <- c(TRUE,TRUE,FALSE,FALSE)
> v2 <- c(TRUE,FALSE,TRUE,FALSE)
> v1
[1]  TRUE  TRUE FALSE FALSE
> v2
[1]  TRUE FALSE  TRUE FALSE
> v1 & v2
[1]  TRUE FALSE FALSE FALSE
> v1 | v2
[1]  TRUE  TRUE  TRUE FALSE

There are two more operators, "&&" and "||," which are similar to "&" and "|," respectively; but they operate on only the first element of the vector. Therefore, they return only one value:

> v1 && v2
[1] TRUE

This returns TRUE because the first elements of both vectors v1 and v2 are TRUE, and TRUE & TRUE equals TRUE.

> v1 = c(1,2,3)
> v2 = c(10,20,30)
> v1 * v2
[1] 10 40 90
> v2 - v1
[1] 9 18 27
> c(v1,v2)
[1] 1 2 3 10 20 30

You can't operate on vectors of different lengths together.

> v3 = c(1,2)
> v3 * v1
[1] 1 4 3
Warning message:
In v3 * v1 :
longer object length is not a multiple of shorter object length
Accessing Values in R Vectors

You access the elements of the vector using square brackets.

> v1 [1]
[1] 1

The negative index has a different connotation in R than what you might be used to. It means excluding the values in that place. For instance, -1 means remove the first element and return the rest. It does not mean starting from the end, as seen in some languages such as Python.

> v1 [-1]
[1] 2 3

If an index is not valid, R returns NA, which is the equivalent of NULL in PL/SQL.

> v1 [9]
[1] NA

If you want to access a contiguous set of values from a vector, use the m:n notation:

> v2 [2:3]
[1] 20 30

What if you want to access a discrete set of values—not a contiguous set of values—from a vector, for example, the first and third values? You would think you'd use something like this, wouldn't you?

> v1[1,3]

Unfortunately, that will throw an error:

Error in v1[1, 3] : incorrect number of dimensions

What happened? The m,n notation is used for something else you will learn later. To access the discrete elements, you will need to pass the indexes as a vector:

> v2[c(1,3)]
[1] 10 30

Interestingly, indexes can be logical boolean values as well.

> v2
[1] 10 20 30
> v3 = c(T,F,T)
> v2[v3]
[1] 10 30

In this case, the first element of v3 is T, that is, TRUE; so it gets the first element of v2. The second element of v3 is F, or FALSE; so the second element of v2 (which is 20) is not displayed. The third element of v3 is T; so the third element (30) is displayed.

Features of R Vectors That Are Similar to PL/SQL Associative Arrays

Remember associative arrays in PL/SQL, which allow you to assign names to variables, such as a key-value pair? The example we used was book titles and author names. Life is so much simpler if you can look up the element by its key instead of trying to figure out its position and access it via the positional index, doesn't it?

Vectors can do that as well. You do that by a labeling the elements, or, in R terms, naming the elements. Let's see an example of sales figures for a company in different quarters.

> sales <- c(100,200,150,75)

This doesn't tell us the whole story. Does it start with Quarter 1? Does 100 correspond to Quarter 1 or Quarter 4? We can make this clear by naming the elements:

> quarters <- c('Quarter 1','Quarter 2','Quarter 3','Quarter 4')
> names(sales) <- quarters

Now, if we display the sales variable, we get the following:

> sales
Quarter 1 Quarter 2 Quarter 3 Quarter 4
      100       200       150        75

Note how the values have labels now, making the meaning clearer. Another way to name the elements is to assign the names while creating the vector itself.

> sales <- c('Quarter 1'=100,'Quarter 2'=200,'Quarter 3'=150,'Quarter 4'=75)
> sales
Quarter 1 Quarter 2 Quarter 3 Quarter 4
      100       200       150        75

When elements are labeled, you can use the labels to access the elements, not the positional index, exactly how a key-value pair would be accessed:

> sales['Quarter 3']
Quarter 3
      150

But when you write a program, you just need the value; not the label itself, which can be source of error. To suppress it, you use the double square brackets syntax.

> sales[['Quarter 3']]
[1] 150

Vectors are the most used data type in R. Behind the scenes, all the data types are just vectors of some sort. You can compare the vectors to an atomic data type.

Factors in R

Vectors provide a way to create a collection of values, which are important in our data manipulation. Sometimes these values are merely identifiers. Take for instance the following vector:

> deptname <- c("Marketing and Public Relations", "Finance and Treasury", "Engineering and Technology", "Operations and Process Control")

This is a pretty long variable. However these elements are merely descriptors. There is no significance of the names relative to one another, and you will not operate on the names; but the names will have to be used by R in their entirety. So mere pointers to these values will be sufficient.

There is also another property of this type of variables. Consider a vector of department numbers:

c(10,20,30,40,50)

These are just numbers representing the department. There is nothing more to it. For instance, you can't say that department 30 is bigger than department 20, nor can you say that there is an average of the department numbers. These numbers are merely descriptive. If R knows about this property, it's better for memory management and processing, because these will not just be usual numbers; instead they will be specific values. In R, these are called factors. You can create a factor using the factor() function on any vector.

Here is how you can create a factor of department numbers.

> dept <- factor(c(10,20,30,40,50))

If you check the data type of this variable, you can see it's a factor.

> class(dept)
[1] "factor"  

This is what you see if you examine the contents:

> dept
[1] 10 20 30 40 50
Levels: 10 20 30 40 50

Note the contents. They are called levels of the factor. The levels are individual distinct values in the factor. So, if you pass the same value multiple times, it will be in the factor but it will not be listed multiple times in levels. Here is a factor with multiple occurrences of the same value.

> dept <- factor(c(10,10,10,20,20,30,40,50,50))
> dept
[1] 10 10 10 20 20 30 40 50 50
Levels: 10 20 30 40 50

You can also pull up the levels of a factor explicitly by using the level() function:

> levels(dept)
[1] "10" "20" "30" "40" "50"

Numbers like those in this list probably don't make much sense. So you need to assign descriptive levels. The levels() function accomplishes that.

> levels(dept) <- c("Marketing","Sales","Finance","Operations","IT")
> dept
[1] Marketing  Sales      Finance    Operations IT         IT        
Levels: Marketing Sales Finance Operations IT

See how the factor levels are different now? You can access the elements of factor the same way as you access matrices.

> dept[2]
[1] Sales
Levels: Marketing Sales Finance Operations IT

In this example, the levels are merely descriptive; there is no comparative relationship among them. For instance, the Marketing department is not greater than or smaller than, say, the Finance department. If you force such a comparison, you will get an error:

> dept[2] > dept [1]
[1] NA
Warning message:
In Ops.factor(dept[2], dept[1]) : '>' not meaningful for factors

But sometimes there may be a comparative relationship. Take for instance a factor containing all the titles in a company. The title "president" is higher than "vice president," which is higher than "director," and so on. This is called an ordered factor. To indicate the factor is ordered, you have to include the ordered parameter and set it to T when creating the factor, as shown below. Here, we also added the labels.

> title <- factor(c('associate'=1,'manager'=2,'director'=3,'vp'=4,'president'=5), ordered = T)
> title
associate   manager  director        vp president 
        1         2         3         4         5 
Levels: 1 < 2 < 3 < 4 < 5

Note how the Levels attribute is shown now. 1 is less than 2, which is less than 3, and so on. In this case, you can compare factors. Here is a check:

> title["vp"] > title['manager']
[1] TRUE
> title["vp"] > title['president']
[1] FALSE

In large datasets, the factors will be quite large, which makes visual comparison difficult. This approach of declaring ordered factors comes in handy at that time.

Lists in R

While you can see the obvious advantage of vectors, there is an important limitation: all the elements must be of the same data type, which can be pretty huge based on the use case. So, here comes a similar but different collection—list—which is like a vector but the elements can be of any data type. You define a list using the function list().

v1 <- list(1,'a',T)

Just to make sure you have created it with elements of different data types, display the v1 variable:

> v1
[[1]]
[1] 1

[[2]]
[1] "a"

[[3]]
[1] TRUE

Does the output look familiar? If you notice, the double square bracket notation, [[]], came from the vector representation. The elements are all of a different type, as you can see. But under the covers, it's just a vector of multiple vectors. You can confirm that by checking for list and vector:

> is.list(v1)
[1] TRUE
> is.vector(v1)
[1] TRUE

The above can be written as follows:

> v1 <- c(c(1),c('a'),c(T))
> v1
[1] "1"    "a"    "TRUE"

You can create two variables and compare them:

> v1 <- list(1,'a',T)
> v2 <- c(c(1),c('a'),c(T))
> v1 == v2
[1] TRUE TRUE TRUE

They are the same. Because list is a vector of vectors, you can also define sort of a multidimensional representation of data. I call is "sort of" just because there are better ways to handle multiple dimensions. Here are three vectors, of numbers, characters, and logical data types, named n1, c1, and l1 respectively.

n1 <- c(1,2,3,4,5)
c1 <- c("First","Second","Third")
l1 <- c(T,F,T,T,F)

We can create a list from all these, as shown below:

list1 <- list(n1,c1,l1)

If you display the list, you will see its contents:

> list1
[[1]]
[1] 1 2 3 4 5

[[2]]
[1] "First" "Second" "Third"

[[3]]
[1] TRUE FALSE TRUE TRUE FALSE

If you want to address the first element of the list, you will need to access it by its position:

> list1[1]
[[1]]
[1] 1 2 3 4 5

This will return a vector, as expected. If you want to access the first element of this vector, you will need to access it by index.

> list1[[1]][1]
[1] 1

But positional indexes can be difficult to use. It can be made easier. Remember the naming of the elements in vectors? The same can be done in lists too, using the same function names() and passing a vector of values.

Suppose you want to name them "Numbers," "Spelled," and "Booleans," respectively. You can use the following:

names(list1) <- c("Numbers", "Spelled", "Booleans")

Now, if you display the variable, you will see different headers for the elements.

> list1
$Numbers
[1] 1 2 3 4 5

$Spelled
[1] "First"  "Second" "Third" 

$Booleans
[1]  TRUE FALSE  TRUE  TRUE FALSE

This helps in accessing individual elements of the list.

> list1 [1]
$Numbers
[1] 1 2 3 4 5

To access a specific element of the list, you can use the label as well.

> list1 ["Numbers"]
$Numbers
[1] 1 2 3 4 5

If you ever want to change lists to vectors, simply use the unlist() function:

> unlist(list1)
 Numbers1  Numbers2  Numbers3  Numbers4  Numbers5  Spelled1  Spelled2  Spelled3 
      "1"       "2"       "3"       "4"       "5"   "First"  "Second"   "Third" 
Booleans1 Booleans2 Booleans3 Booleans4 Booleans5 
   "TRUE"   "FALSE"    "TRUE"    "TRUE"   "FALSE" 

Now each element can be addressed as in a vector. For example, list1[13]references the thirteenth element. Sometimes we might need to do this "flattening" of the output, as you will learn later in the series.

Matrices in R

A matrix is a two-dimensional representation of data elements; but of the same data type. Remember, it has to be exactly two dimensions—rows and columns—not three dimensions. You create a matrix using the matrix() function.

In the following example, we create a matrix with 20 elements (1:20, if you remember from Part 2, creates a sequence of 20 numbers from 1 to 20).

> m1 <- matrix(1:20)
> m1
[,1]
[1,] 1
[2,] 2
[3,] 3
[4,] 4
[5,] 5
[6,] 6
[7,] 7
[8,] 8
[9,] 9
[10,] 10
[11,] 11
[12,] 12
[13,] 13
[14,] 14
[15,] 15
[16,] 16
[17,] 17
[18,] 18
[19,] 19
[20,] 20

Note how it created a single column matrix, with 20 rows. If you want rows and columns, you need to specify that. The nrow parameter specifies the number of rows of the resultant matrix.

> m1 <- matrix(1:20, nrow=4)
> m1
     [,1] [,2] [,3] [,4] [,5]
[1,]    1    5    9   13   17
[2,]    2    6   10   14   18
[3,]    3    7   11   15   19
[4,]    4    8   12   16   20

Note there are four columns because you mentioned that while creating the matrix. R distributed the values and ended up with five columns.

What if you had wanted a fixed number of columns instead? You can request that with the ncol parameter.

> m1 <- matrix(1:20, ncol=4)
> m1
     [,1] [,2] [,3] [,4]
[1,]    1    6   11   16
[2,]    2    7   12   17
[3,]    3    8   13   18
[4,]    4    9   14   19
[5,]    5   10   15   20

You can specify both ncol and nrow. If the values are more than required, R will discard them. Note how in the following example, only 16 of your initial 20 values made it to the matrix of four rows and four columns.

> m1 <- matrix(1:20, nrow=4, ncol=4)
> m1
     [,1] [,2] [,3] [,4]
[1,]    1    5    9   13
[2,]    2    6   10   14
[3,]    3    7   11   15
[4,]    4    8   12   16

By the way, matrices can be generated with discrete values as well, not just sequential values. Here is an example.

> m1 <- matrix(c('a','e','i','o','u','1'), nrow=3)
> m1
     [,1] [,2]
[1,]  "a"  "o"
[2,]  "e"  "u"
[3,]  "i"  "1"

Note how the data has been distributed first down and then across. What if that's not what you want? You can tell the matrix to be across rows instead of columns. Another optional parameter, byrow, when set to TRUE, accomplishes that.

> m1 <- matrix(c('a','e','i','o','u','1'), nrow=3, byrow=T)
> m1
     [,1] [,2]
[1,]  "a"  "e"
[2,]  "i"  "o"
[3,]  "u"  "1"

Now that you know how a matrix is defined, you might notice some annoying things about how it is shown. The first thing you might notice is the labeling. The rows and columns do not have headers. Instead they have just a default header with some commas and numbers, which doesn't add a lot of value. To add labels, you can use another parameter called dimnames. The parameter takes a vector containing the row labels and another containing the column labels.

> m1 <- matrix(c(1,2,3,4,5,6,7,8,9,10), nrow=2, byrow = T, dimnames = list(c("Row1","Row2"),c("Col1","Col2","Col3","Col4","Col5")))
> m1
     Col1 Col2 Col3 Col4 Col5
Row1    1    2    3    4    5
Row2    6    7    8    9   10

It's not just at the declaration time that you have the opportunity to mention the labels. If you already have a matrix, you can add labels later. Or, if you want to rename the labels, you can do that using the dimnames() function, as shown below:

> dimnames(m1) <- list(c("R1","R2"))
> m1
   [,1] [,2] [,3] [,4] [,5]
R1    1    2    3    4    5
R2    6    7    8    9   10

Another way to change the column name is with the colnames() function:

colnames(m1) <- c("Col1","Col2","Col3","Col4","Col5")

Now if you display m1, you'll see this:

> m1    
   Col1 Col2 Col3 Col4 Col5
R1    1    2    3    4    5
R2    6    7    8    9   10

Similarly, another way to change the row names is with the rownames() function. Let's use that to change the names of the rows back to Row1, Row2, and so on.

rownames(m1) <- c("Row1", "Row2")
Accessing a Matrix

Well, that's a lot about naming the rows and columns of the matrix. But a matrix is useless without some way to access the data. It's super easy. Just use the x and y coordinates. Always remember: rows and then columns, not other way around. For instance, to select the element in the first row and second column, just use this:

> m1[1,2]
[1] 2

If you want to select multiple columns, just use the range notation. For instance, to select columns 2 until 3, use 2:3, as shown below:

> m1[1,2:3]
Col2 Col3 
   2    3 

If you want to select multiple but discrete columns, not a range, just use the column numbers as a vector. For instance, to select columns 2, 3, and 4, use c(2,3,4), as shown below:

> m1[1,c(2,3,4)]
Col2 Col3 Col4 
   2    3    4 

If you want to select all the columns of row 1, you can of course use c(1,2,3,4,5); but what if you don't know the number of columns? No worries. you can just omit the column reference:

> m1[1,]
Col1 Col2 Col3 Col4 Col5 
   1    2    3    4    5 

Note there is a comma and nothing after the comma— m1[1,], not m1[1]—such as you would do with an implied second parameter. If you don't have a value as a parameter, R assumes all the columns. If you don't place a comma, R picks up only one element starting from the left top-most going downward, for instance:

> m1[2]
[1] 6

So, be careful about the placement of values and the comma. In PL/SQL, if you have a comma, it means you are supplying the next parameter. It can be null, but it must be supplied. If you want the next parameter to be the default, you simply don't put the comma. If you put the comma and then don't mention the value of the next parameter, you will get a syntax error. In R, you will not get a syntax error and the behavior will be very different. Likewise, you can omit the first parameter as well, which then defaults to all the rows of the column.

For instance, the following will bring up all the rows of column 1.

> m1[,1]
Row1 Row2 
   1    6 
> m1[,2]
Row1 Row2 
   2    7 

But, note that the output comes as rows, even though it's from a column. At least the labels are correct: Row1, Row2, and so on. If you select multiple columns, for example, columns 1 until 3, you will see the output as expected.

> m1[,1:3]
     Col1 Col2 Col3
Row1    1    2    3
Row2    6    7    8

But, thinking like database professionals, we probably won't like to address them as numbers. Because we have the rows and columns named, can't we address them using the names, as you would do in a table? The answer is, of course you can. Here is how we access row Row1 and column Col2:

> m1["Row1","Col2"]
[1] 2

Another way to create a matrix is to use the rbind() function, which adds a row to the matrix.

> row1 <- c(11,12,13,14)
> row2 <- c(21,22,23,24)
> row3 <- c(31,32,33,34)
> m1 <- rbind(row1,row2,row3)
> m1
     [,1] [,2] [,3] [,4]
row1   11   12   13   14
row2   21   22   23   24
row3   31   32   33   34
> m2 <- cbind(row1,row2,row3)
> m2
     row1 row2 row3
[1,]   11   21   31
[2,]   12   22   32
[3,]   13   23   33
[4,]   14   24   34

What if you use less number of values in rbind()? Let's see:

> row4 <- c(41,42)
> m1 <- rbind(row1,row2,row3, row4)
> m1
     [,1] [,2] [,3] [,4]
row1   11   12   13   14
row2   21   22   23   24
row3   31   32   33   34
row4   41   42   41   42

Row4 didn't have enough columns. It had only two, but the total columns needed (four) is a multiple of two. Therefore, the available values simply got repeated. However, if you give a nondivisible value as the number of elements, you will get an error:

> row4 <- c(41,42,43)
> m1 <- rbind(row1,row2,row3, row4)
Warning message:
In rbind(row1, row2, row3, row4) :
  number of columns of result is not a multiple of vector length (arg 4)
Operations on a Matrix

But why would you even want to create values as matrices? Ease of operations, of course. R allows you to operate on matrices as a whole. Here is an example.

Create a matrix of 20 numbers: four rows and five columns:

> m1 <- matrix(1:20, nrow=5)
> m2 <- matrix(1:20, nrow=5)

> m2
     [,1] [,2] [,3] [,4]
[1,]    1    6   11   16
[2,]    2    7   12   17
[3,]    3    8   13   18
[4,]    4    9   14   19
[5,]    5   10   15   20

If you want to multiply 2 by every element of this matrix, just multiply.

> m2 * 2
     [,1] [,2] [,3] [,4]
[1,]    2   12   22   32
[2,]    4   14   24   34
[3,]    6   16   26   36
[4,]    8   18   28   38
[5,]   10   20   30   40

In PL/SQL and other languages such as C or Java, you would have had to write a loop visiting all the elements and multiplying. Not in R. You just add, subtract, multiply, raise to the power of, and so on, just as with any regular unary operation. You can even multiply a matrix by another matrix:

> m1
     Col1 Col2 Col3 Col4
Row1   11   12   13   14
Row2   21   22   23   24
Row3   31   32   33   34
Row4   41   42   41   42
> m2
     [,1] [,2] [,3] [,4]
[1,]    1    5    9   13
[2,]    2    6   10   14
[3,]    3    7   11   15
[4,]    4    8   12   16
> m1 * m2
     Col1 Col2 Col3 Col4
Row1   11   60  117  182
Row2   42  132  230  336
Row3   93  224  363  510
Row4  164  336  492  672

Let's stop the discussion on matrices for now and discuss the next important collection in R: arrays.

Arrays in R

Matrices can accomplish most of the data analysis tasks. Most of the data you will get is in two dimensions. But what if your dataset comes in more than two dimensions? An array comes to rescue here. It is a multidimensional matrix.

Here is how you can define a one-dimensional array.

> v1 <- array(data=1:30)
> v1
 [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

As you can see, it looks like a vector; it's not of much value. An optional parameter creates a multidimensional array. This parameter—dim—allows you to define multidimensional arrays:

> v2 <- array(1:30, dim=c(5,5))
> v2
     [,1] [,2] [,3] [,4] [,5]
[1,]    1    6   11   16   21
[2,]    2    7   12   17   22
[3,]    3    8   13   18   23
[4,]    4    9   14   19   24
[5,]    5   10   15   20   25

But this also looks like another familiar structure: the matrix. So it's not of much value. Let's define a third dimension:

> v3 <- array(1:30, dim=c(5,5,5))
> v3
, , 1

[,1] [,2] [,3] [,4] [,5]
[1,] 1 6 11 16 21
[2,] 2 7 12 17 22
[3,] 3 8 13 18 23
[4,] 4 9 14 19 24
[5,] 5 10 15 20 25

, , 2

[,1] [,2] [,3] [,4] [,5]
[1,] 26 1 6 11 16
[2,] 27 2 7 12 17
[3,] 28 3 8 13 18
[4,] 29 4 9 14 19
[5,] 30 5 10 15 20

, , 3
... output truncated ...

As you can see, it created a three-dimensional data structure. We provided only 30 values (as shown in the sequence 1:30), not enough for the 125 values needed for the array (5 x 5 x 5 = 125); so R repeated the values as much as needed to fill the array.

How do we access the elements of the array? The exact way we addressed them in matrix: variable[rownumber, column number, third dimension]. You start with the row, then the column, and finally all the other dimensions defined in the array.

Here's how to refer to the earlier output:

> v3[2,2,1]
[1] 7

Note that the third dimension is printed in the output first, as headings are when you display all the values of a list variable. There is a matrix for each of the third dimensions. Like the matrix, you can omit the exact position of a dimension and you will get all the values for that dimension, for instance,

> v3[1,1,]
[1] 1 26 21 16 11

Here we omitted the third dimension; so we get the first row and first column of all the matrices for all values of the third dimensions, that is, v3[1,1,1], v3[1,1,2], v3[1,1,4], and so on. Just as with a matrix, you can add and multiply values directly to the array.

For example, this

v3 * 10 

will multiply 10 by all the elements of the array.

You can multiply arrays by other arrays as well. If you want a sum of all the first row and first column from all the third dimensions, you can use this:

> sum(v3[1,1,])
[1] 75

Just as with a matrix, you can name the dimensions of the array as well.

colnames(v3) = c("Col1","Col2","Col3","Col4","Col5")
rownames(v3) = c("Row1","Row2","Row3","Row4","Row5")

The following is a special naming for the third dimension, which you didn't see with matrix:

dimnames(v3)[[3]] <- c("Third1","Third2","Third3","Third4","Third5")

Now if you print v3, you will see the names:

> v3
, , Third1

Col1 Col2 Col3 Col4 Col5
Row1 1 6 11 16 21
Row2 2 7 12 17 22
Row3 3 8 13 18 23
Row4 4 9 14 19 24
Row5 5 10 15 20 25

, , Third2

Col1 Col2 Col3 Col4 Col5
Row1 26 1 6 11 16
Row2 27 2 7 12 17
Row3 28 3 8 13 18
Row4 29 4 9 14 19
Row5 30 5 10 15 20

, , Third3
... output truncated ...

And, just as with a matrix, you will be able to address the elements by their dimension names instead of the positions.

> v3["Row1","Col2","Third3"]
[1] 26

So, as you can see, arrays are just like matrices, but with multiple dimensions that you can name and address elements by.

Data Frames in R

Being a database professional, you probably already know that data comes in all types, not just in one type. For instance, take employee data. You probably have employee ID (EmpID), which is a number; the name (Name), which is characters, and another column called AtHQ, which is a logical value TRUE or FALSE that shows whether the employee is located at the company headquarters. This is what you typically see in database tables, spreadsheets, and even in text files.

EmpID Name AtHQ
1 John Smith T
2 Jane Doe F

In PL/SQL, you would have to define a record type and create a collection on that record. In Oracle Database, you would have to create a table with those columns and specified data types. In R, the collection is called a data frame.

A data frame is just like a matrix, but with a very important difference: the elements can contain different data types. Recall that all the elements in a matrix must be of the same data type. Because a data frame removes that limitation, it can be used in many data analysis cases.

Let's create a data frame for the above information. You create a data frame via the data.frame() function.

> df1 <- data.frame(1,"John Smith",T)

If you show the data frame, you can see that the column labels are not exactly what you intended.

> df1
  X1 X.John.Smith.    T
1  1 John Smith    TRUE 

R simply puts whatever it feels is right. Let's put appropriate labels for the columns.

> colnames(df1) <- c("EmpId","Name","AtHQ")
> df1
  EmpId Name       AtHQ
1     1 John Smith TRUE

When you want to add rows to the data frame, just create a new data frame and add it to the other one using the rbind() function.

df2 <- data.frame(2,"Jane Doe",F)
colnames(df2) <- c("EmpId","Name","AtHQ")
df1 <- rbind(df1,df2)

Now if you check df1, you'll see this:

> df1
  EmpId Name       AtHQ
1     1 John Smith TRUE
2     2 Jane Doe   FALSE

Assigning row and column names makes accessing the elements of the data frame immensely easy. To show all the values of column EmpID, here is what you need to write:

> df1["EmpId"]
EmpId
1 1
2 2

But in a data frame, you have another method of writing the column values. It's the name of the data frame followed by the column name separated by a "$" sign. Here is the example:

> df1$EmpId
[1] 1 2

Note that while this also gets the employee IDs, the values come back as a vector. In the previous approach, the value returned was another data frame. This might not make much difference in your work, but it might. So, keep that mind.

Finally, you can access the rows of the data frame using the "[]" notation.

> df1[1,]
EmpId Name AtHQ
1 1 John Smith TRUE

Here, we retrieved all the columns of row 1. If you want only a selected column, for example, EmpId, you can give it as an index:

> df1[1,"EmpId"]
[1] 1

What about row names? Yes, you can name rows in a data frame, using the same rownames() function you saw with matrix:

> rownames(df1) <- c("Employee1","Employee2")
> df1
EmpId Name AtHQ
Employee1 1 John Smith TRUE
Employee2 2 Jane Doe FALSE

However, naming the rows doesn't help because rows are merely observations (which are analogous to records in a table) of multiple variables (which are columns). It makes sense to name these variables; but naming rows might not make sense. When you add a row, you have to remember to add the row name for that as well.

Working with an Actual Data Frame

In the previous text, I just wanted to give you a flair of the various collections. Let's see some operations on an actual dataset, not made up values, as we saw before.

Fortunately for us, R comes with many built-in datasets. To find these datasets, just use the data() function at the R prompt:

> data()

This will bring up a different window with the datasets. Here is a small excerpt from that window:

Data sets in package 'datasets':

AirPassengers Monthly Airline Passenger Numbers 1949-1960
BJsales Sales Data with Leading Indicator
BJsales.lead (BJsales)
Sales Data with Leading Indicator
... output truncated ...

We will use one dataset, airquality, which shows the daily air quality measurements in New York from May to September 1973. How do I know that? I didn't have to guess. Just use help() function to see what it is about. Enter help(airquality) at the R command prompt and you will see a browser window pop up to explain what the dataset is. In that help window, I see that it's a data frame with the following properties:

A data frame with 154 observations on 6 variables. 

[,1]  Ozone     numeric  Ozone (ppb)

[,2]  Solar.R   numeric  Solar R (lang)

[,3]  Wind      numeric  Wind (mph)

[,4]  Temp      numeric  Temperature (degrees F)

[,5]  Month     numeric  Month (1--12)

[,6]  Day       numeric  Day of month (1--31) 

Let's see some of the example data:

> head(airquality)
  Ozone Solar.R Wind Temp Month Day
1    41     190  7.4   67     5   1
2    36     118  8.0   72     5   2
3    12     149 12.6   74     5   3
4    18     313 11.5   62     5   4
5    NA      NA 14.3   56     5   5
6    28      NA 14.9   66     5   6

Let's see how we can use this data frame for some analysis. First, I am tired of typing "airquality" every time. So I will create a nice little variable from the dataset.

> a <- airquality

From now on, I will just use "a" for the data. Recall from the first article in this series that you can use the str() function to find the structure of a data frame.

> str(a)
'data.frame': 153 obs. of  6 variables:
 $ Ozone  : num  41 36 12 18 25 28 23 19 8 25 ...
 $ Solar.R: int  190 118 149 313 NA NA 299 99 19 194 ...
 $ Wind   : num  7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 ...
 $ Temp   : int  67 72 74 62 56 66 65 59 61 69 ...
 $ Month  : int  5 5 5 5 5 5 5 5 5 5 ...
 $ Day    : int  1 2 3 4 5 6 7 8 9 10 ...

The str() function is analogous to the describe command in SQL*Plus to describe the structure of a table. Consider the above output carefully. It shows there are 153 "observations." There are six variables, analogous to columns in a table. So this is similar to a table with six columns and 153 rows. The subsequent parts of the output shows the columns, their data type, and some of the sample.

For my first analysis, I just want to look at the quick summaries of all the available attributes:

> summary(a)
    Ozone        Solar.R         Wind           Temp
Min.   :   1.00  Min.   :  7.0   Min. : 1.700   Min. :56.00
1st Qu.:  18.00  1st Qu.:115.8   1st Qu.: 7.400 1st Qu.:72.00
Median :  31.50  Median :205.0   Median : 9.700 Median :79.00
Mean   :  42.13  Mean   :185.9   Mean : 9.958   Mean :77.88
3rd Qu.:  63.25  3rd Qu.:258.8   3rd Qu.:11.500 3rd Qu.:85.00
Max.   : 168.00  Max.   :334.0   Max. :20.700   Max. :97.00
  NA's :37         NA's :7
   Month          Day
Min.   :5.000 Min.   : 1.0
1st Qu.:6.000 1st Qu.: 8.0
Median :7.000 Median :16.0
Mean   :6.993 Mean   :15.8
3rd Qu.:8.000 3rd Qu.:23.0
Max.   :9.000 Max.   :31.0

This is a very good starting point that tells me some of the important details of the dataset: the minimum, the maximum, the median, the mean, the first quantile, and the third quantile for each attribute (or column). It also shows how many values are not available, shown as NA, which is equivalent of NULL in Oracle Database.

From the output, we know that 37 records have NA in the Ozone column and seven records have NA in the Solar.R column. The other columns don't have NA counts, which means they don't have NA values. Interestingly, it does not show a key factor in data analysis—standard deviation—but you can get it easily if you want, as you will see later.

Let's see some questions that might be asked and how we get the answers. We need to find out the daily temperatures in the month of June. You can reference the cells of the data frame using the absolute values (row m and column n), or you can use the logical values for the cell. To get the logical value, you should use the comparison operator. Because you are looking for the month of June, that is, 6 under the Month column, use Month==6 as the index. And because you want just the Temp column, you should use the $Temp notation.

> a$Temp[a$Month==6]
 [1] 78 74 67 84 85 79 82 87 90 87 93 92 82 80 79 77 72 65 73 76
[21] 77 76 76 76 75 78 73 80 77 83

How about we get a bit specific, for example, June 30th? The same principle applies. You just need to extend the comparison operation to be a bit more restrictive.

> a$Temp[a$Month==6 & a$Day==30]
[1] 83

But mostly you will be interested in some kind of statistical inference on a group of data, for example, the average (or mean) temperature in June, not just a single value. The mean() function accomplishes that.

> mean(a$Temp[a$Month==6])
[1] 79.1

Or, the maximum temperature in June:

> max(a$Temp[a$Month==6])
[1] 93

Remember the summary function that gives you all quick stats at a glance? Won't you want to have the same here? We would love to slice the data for June and apply the summary. Let's create a new variable, called june, to hold data for June.

> june <- a$Temp[a$Month==6]
> june
[1] 78 74 67 84 85 79 82 87 90 87 93 92 82 80 79 77 72 65 73 76 77 76 76 76 75
[26] 78 73 80 77 83

But what exactly is june?

> class(june)
[1] "integer"

It says it's an integer, but we wanted a data frame. Check if it is a data frame:

> is.data.frame(june)
[1] FALSE

So it's not a data frame. Well, what is it? Could it be a vector?

> is.vector(june)
[1] TRUE

It is a vector. But we want a data frame, so we will convert it to one.

> june <- data.frame(june)
> class(june)
[1] "data.frame"

Now that we have our data frame, we can apply the summary() function.

> summary(june)
june
Min.   :65.00
1st Qu.:76.00
Median :78.00
Mean   :79.10
3rd Qu.:82.75
Max.   :93.00

Let's get the standard devision of this dataset using the sd() function.

> sd(june$june)
[1] 6.598589
Subsetting

What if you want to get all the data, not just Temp, of the data frame for the month of June? In SQL, you would write something like this:

select *
from df1
where Month = 6;

In R, it's also trivial. There are two ways to do it. I will start with the first—and my favorite—method, because it's just plain easy to read. It's by using the subset() function.

> june<-subset(a, a$Month==6)
> june
   Ozone Solar.R Wind Temp Month Day
32    NA     286  8.6   78     6   1
33    NA     287  9.7   74     6   2
34    NA     242 16.1   67     6   3
35    NA     186  9.2   84     6   4
36    NA     220  8.6   85     6   5
37    NA     264 14.3   79     6   6
38    29     127  9.7   82     6   7
39    NA     273  6.9   87     6   8
40    71     291 13.8   90     6   9
41    39     323 11.5   87     6  10
42    NA     259 10.9   93     6  11
43    NA     250  9.2   92     6  12
44    23     148  8.0   82     6  13
45    NA     332 13.8   80     6  14
46    NA     322 11.5   79     6  15
47    21     191 14.9   77     6  16
48    37     284 20.7   72     6  17
49    20      37  9.2   65     6  18
50    12     120 11.5   73     6  19
51    13     137 10.3   76     6  20
52    NA     150  6.3   77     6  21
53    NA      59  1.7   76     6  22
54    NA      91  4.6   76     6  23
55    NA     250  6.3   76     6  24
56    NA     135  8.0   75     6  25
57    NA     127  8.0   78     6  26
58    NA      47 10.3   73     6  27
59    NA      98 11.5   80     6  28
60    NA      31 14.9   77     6  29
61    NA     138  8.0   83     6  30

The other approach is this:

> a[a$Month==6,]

Note the expression carefully; there is a comma after the conditional expression. Why is that? Recall from the previous discussion that to address an element you have to use the [RowNumber, ColumnNumber] notation. RowNumber is specified by a$Month==6, but for columns, we don't want to restrict anything. We want to select all the columns; therefore, we left it as null.

Now you can see why I like the subset() approach more. It's several times more readable. However, subset() doesn't work when you want to assign values based on conditions. For instance, suppose you want to assign the value 25 to Ozone when the value is NA. subset won't work; you have to leverage the subscript approach.

> a[is.na(a$Ozone),]
    Ozone Solar.R Wind Temp Month Day
5      NA      NA 14.3   56     5   5
10     NA     194  8.6   69     5  10
25     NA      66 16.6   57     5  25
26     NA     266 14.9   58     5  26
27     NA      NA  8.0   57     5  27
32     NA     286  8.6   78     6   1
33     NA     287  9.7   74     6   2
34     NA     242 16.1   67     6   3
35     NA     186  9.2   84     6   4
36     NA     220  8.6   85     6   5
37     NA     264 14.3   79     6   6
39     NA     273  6.9   87     6   8
42     NA     259 10.9   93     6  11
43     NA     250  9.2   92     6  12
45     NA     332 13.8   80     6  14
46     NA     322 11.5   79     6  15
52     NA     150  6.3   77     6  21
53     NA      59  1.7   76     6  22
54     NA      91  4.6   76     6  23
55     NA     250  6.3   76     6  24
56     NA     135  8.0   75     6  25
57     NA     127  8.0   78     6  26
58     NA      47 10.3   73     6  27
59     NA      98 11.5   80     6  28
60     NA      31 14.9   77     6  29
61     NA     138  8.0   83     6  30
65     NA     101 10.9   84     7   4
72     NA     139  8.6   82     7  11
75     NA     291 14.9   91     7  14
83     NA     258  9.7   81     7  22
84     NA     295 11.5   82     7  23
102    NA     222  8.6   92     8  10
103    NA     137 11.5   86     8  11
107    NA      64 11.5   79     8  15
115    NA     255 12.6   75     8  23
119    NA     153  5.7   88     8  27
150    NA     145 13.2   77     9  27

Let's assign a value of 25 to all the cells where Ozone is NA.

> a[is.na(a$Ozone),]$Ozone <- 25

Let's test only one row: row 5, which was NA earlier:

> a[5,]
  Ozone Solar.R Wind Temp Month Day
5    25      NA 14.3   56     5   5

Now it's no longer NA; it's 25. Let's confirm there are no NAs in the Ozone column.

> a[is.na(a$Ozone),]
[1] Ozone   Solar.R Wind    Temp    Month   Day    
<0 rows> (or 0-length row.names)
Selecting Specific Columns from a Data Frame

What if you want to select only a specific column from the data frame, instead of all the columns, based on some condition? For instance, you want to select temperatures for the month of June.

In SQL, it would be something like this:

select temp
from dataframe
where Month = 6;

In R, you can build it upon the previous expression:

> subset(a, a$Month==6)$Temp
 [1] 78 74 67 84 85 79 82 87 90 87 93 92 82 80 79 77 72 65 73 76 77 76 76 76 75 78 73 80 77 83

Note how we used the previous expression of subset, which yields a data frame containing all the columns from "a"—but only where Month matches 6—and we added $Temp to it to select only that column. As a further extension, if you want to get the minimum temperature in June, in SQL, you would write this:

select min(temp)
from dataframe1
where month = 6;

In R, you'd do this:

> min(subset(a, a$Month==6)$Temp)
[1] 65

By the way, the condition clause can also contain multiple conditions. For instance, if you want to select temperature on June 14, in SQL you would write something like this:

select Temp
from dataframe1
where Month = 6
and Day = 14;

In R, you would use the "&" character in place of AND:

> subset(a, a$Month==6 & a$Day==14)$Temp
[1] 80

There is an even simpler way for SQL-savvy developers. A package called sqldf makes the task as easy as writing SQL code.

> install.packages("sqldf")

Then load the package:

> library(sqldf)

Using the library, we can rewrite our code as follows:

> sqldf('select Temp from a where Month=6 and Day = 14')
  Temp
1   80

Well, that makes is super easy for SQL-literate readers. It returns a data frame, even if it's just one value. But as you saw earlier, you can compare that with any atomic variable. Here is an example:

> v1 <- sqldf('select Temp from a where Month=6 and Day = 14')> if (v1 > 75) { print ("This was a hot day day")}
[1] "This was a hot day day"

Let's go back to the subset approach. What if you wanted to select a few columns instead of all the columns? For instance, because you are selecting for the month of June only, you don't want to select the month number again. In addition, you might want to have the day in the first column. In SQL, you would write this:

select day, ozone, solar.r, temp
from dataframe
where month=6;

In R, you can use the select parameter of the substr() function.

> subset(a, a$Month==6, select=c(Day,Ozone, Solar.R, Temp))
   Day Ozone Solar.R Temp
32   1    25     286   78
33   2    25     287   74
34   3    25     242   67
35   4    25     186   84
36   5    25     220   85
37   6    25     264   79
38   7    29     127   82
39   8    25     273   87
40   9    71     291   90
41  10    39     323   87
42  11    25     259   93
43  12    25     250   92
44  13    23     148   82
45  14    25     332   80
46  15    25     322   79
47  16    21     191   77
48  17    37     284   72
49  18    20      37   65
50  19    12     120   73
51  20    13     137   76
52  21    25     150   77
53  22    25      59   76
54  23    25      91   76
55  24    25     250   76
56  25    25     135   75
57  26    25     127   78
58  27    25      47   73
59  28    25      98   80
60  29    25      31   77
61  30    25     138   83

Note that we had to pass the list columns as a vector (recall that the c function is used to create a vector). By the way, we just retrieved all the columns except one: Month. So instead of writing a long string of column names, we can simply write all columns except Month. There is no equivalent in SQL. In R, you can write this:

> subset(a, a$Month==6, select=-Month)
   Ozone Solar.R Wind Temp Day
32    25     286  8.6   78   1
33    25     287  9.7   74   2
34    25     242 16.1   67   3
35    25     186  9.2   84   4
36    25     220  8.6   85   5
... output truncated...

With the "[]" notation, you can achieve the same objective by writing this:

> a[a$Month==6,c("Day","Ozone","Solar.R","Temp")]
   Day Ozone Solar.R Temp
32   1    25     286   78
33   2    25     287   74
34   3    25     242   67
... output truncated ...

Here you learned how to manipulate a data frame using the basic R syntax. Earlier in this article, you saw briefly how the sqldf function makes queries SQL-like, which makes the learning curve for database professionals amazingly smaller. Actually, sqldf is designed for database access from within R. We will explore sqldf in more detail later in the series, in the installment dedicated to database access.

Summary

Here is a quick recap of what you learned in this article.

  • A vector is a collection of only one dimension composed of elements of one data type only.
  • A factor is like a vector—a one-dimensional collection of a single data type—but the values are merely representational variables, not values to be operated on.
  • A matrix is a collection of two dimensions of elements of the same data type. So, a matrix is a collection of vectors.
  • A list is a collection of one dimension, similar to vector, but the elements could be of different data types, similar to a database record.
  • A data frame is a two-dimensional collection of elements of multiple data types, similar to a database table.
  • An array is a multidimensional collection of elements of one data type only. So, an array is just like a matrix but with more than two dimensions.

To check for the type of collection for a variable x, use the function

  • is.vector(x) to check if it's a vector
  • is.list(x) to check if it's a list
  • is.factor(x) to check if it's a factor
  • is.data.frame(x) to check if it's a data frame

All data types can be operated on as a whole, for example, to multiply 2 by all the elements of a vector, matrix, or array, just call 2 * variable_name.

While defining the collections, you can optionally supply column headers, or labels, as shown below:

 list1 <- list(col1=1,col2="A",col3=T)
vector1 <- c(col1=1,col2=2,col3=3)

You can address the individual elements of the collection by position, for example, list1[1].

Or, you can access them by the column label, if that is defined: list1["col1"].

By default, the above will also show the column label. If you want the value alone, you use double square brackets: list1[["col1"]].

For two-dimensional objects such as matrices and data frames, you need to use the [row,column] format, for example, matrix1[1,3] to get the first row, third column.

If you want to get the entire first row, that is, all the columns, use matrix1[1,].

For lists, matrices, and data frames you can use the $ notation if the column labels have been created, for instance, matrix1$col1.

To add row and column names to an exiting collection, you use the rownames() and colnames() functions, respectively.

> matrix1
     [,1] [,2] [,3] [,4] [,5] [,6]
[1,]    1    6   11   16   21   26
[2,]    2    7   12   17   22   27
[3,]    3    8   13   18   23   28
[4,]    4    9   14   19   24   29
[5,]    5   10   15   20   25   30
> colnames(matrix1) <- c("col1","col2","col3","col4","col5","col6")
> rownames(matrix1) <- c("row1","row2","row3","row4","row5")

To add rows to a matrix or a data frame, you need to use the rbind() function.

To add columns, you need to use the cbind() function.

To selectively get the values from a collection, you use the logical operator within square brackets. For instance, to get all values in a matrix greater than 20, you use this:

> matrix1 > 20
col1 col2 col3 col4 col5 col6
row1 FALSE FALSE FALSE FALSE TRUE TRUE
row2 FALSE FALSE FALSE FALSE TRUE TRUE
row3 FALSE FALSE FALSE FALSE TRUE TRUE
row4 FALSE FALSE FALSE FALSE TRUE TRUE
row5 FALSE FALSE FALSE FALSE TRUE TRUE
> matrix1[matrix1 > 20]
[1] 21 22 23 24 25 26 27 28 29 30

That's it. Now it's time to test your understanding using the quiz.

Quiz

Below are 10 questions followed by the answers.

Question 1: What will be the result of the following code? If this will result in a syntax error, mention that.

v1 <- 100
v2 <- c(100)

if (v1==v2) {
  paste("They are same")
} else {
  paste("They are different")
}

Question 2: What will be the output of the following code? If you think it will result in a syntax error, mention that. Hint: the c() function creates a vector and a vector can contain values of only one data type.

> v2 <- c(1,T)
> v2[2]

Question 3: Refer to the dataset airquality. Write an R expression that pulls five rows at random and displays all the columns.

Question 4: What will be the output of the following expression? If this will produce an error, say so.

v1 <- list(a=1:10, b=1:20)

Question 5: I want to create a two-column matrix with column labels "a" and "b" and with 10 rows. Here is what I proposed. Will this accomplish what I need or produce a syntax error?

v1 <- matrix(c(a=1:10, b=1:10))

Question 6: You need to create a three-dimensional data type of all integers. What's the best collection type to use to do that?

Question 7: You want to check if v1 is a vector. It has been defined earlier (although you are not aware of that) as follows:

> v1 <- c(1:10)

So you give the following command:

> class(v1)

Will that give you a confirmation of how the variable was defined?

Question 8: A variable v1 has been defined as follows:

v1 <- 7

However, you are not aware of that. To check if it's a vector, you use the following command:

> is.vector(v1)
[1] TRUE

Why does it show as a vector? It's just an integer.

Question 9: Here is a data frame.

> df1
     col1 col2 col3 col4 col5 col6
row1    1    6   11   16   21   26
row2    2    7   12   17   22   27
row3    3    8   13   18   23   28
row4    4    9   14   19   24   29
row5    5   10   15   20   25   30

You want to extract row3 and two columns: col1 and col2. You gave the following expression:

> df1[row3,("col1","col2")]

Will it produce the desired result? If it will result in a syntax error, mention that.

Question 10: You defined a factor that is supposed to represent some sort of level of seniority among employees.

f1 <- factor(c(1,2,3,4))

So, 1 is less in seniority than 2 and so on. However, when you try to determine the minimum seniority, you get this:

> min(f1)
Error in Summary.factor(1:4, na.rm = FALSE) : 
  'min' not meaningful for factors

What happened? How can you eliminate the error?

Answers

Answer 1: It will not produce a syntax error. Recall that behind the scenes, all data types are essentially vectors. So you will get the message "They are same."

Answer 2: Note it carefully. This is a vector, as the c() function would create. You have assigned a number and a logical value (T). Elements of multiple data types are not allowed in vectors. So you can assume that it will fail with a syntax error. However, a vector does not actually fail. R will convert all the elements into a superset data type that can accommodate all the elements and make them of a single data type. In this case, it will be of type number, because logical values will be assigned a value of 0 or 1 if they are false or true, respectively. So, this vector will be stored as c(1,1) and v2[2] will yield 1.

Answer 3: To sample five values from a set of known rows, you have to use the sample() function, as shown below.

sample(total_no_of rows, 5)

To get the total number of rows, you have to use the nrow() function. So, here is the expression to get five random values from the value of the total number of rows:

> sample(nrow(a),5)
[1]  60  56 101  16  70

This gives a vector containing the index of the rows of the dataframe. So, we can use that to get the rows from the data frame.

> a[sample(nrow(a),5),]
    Ozone Solar.R Wind Temp Month Day
24     32      92 12.0   61     5  24
43     25     250  9.2   92     6  12
144    13     238 12.6   64     9  21
40     71     291 13.8   90     6   9
12     16     256  9.7   69     5  12

Answer 4: It will not produce error. Remember lists can contain items of different data types and structures. So this will be a two-element list with the elements being a 10-number array and a 20-number array.

Answer 5: It will not produce a syntax error, but it will not yield what you needed. The expression c(a=1:10,b=1:10)) will merely create a 20-element vector. Here is a demonstration:

> v1 <- c(a=1:10, b=1:10)
> v1
a1 a2 a3 a4 a5 a6 a7 a8 a9 a10 b1 b2 b3 b4 b5 b6 b7 b8 b9 b10
 1  2  3  4  5  6  7  8  9  10  1  2  3  4  5  6  7  8  9  10

So, the expression in the question is equivalent to this:

matrix(1 through 10 and 1 through 10 again)

It will be a single-column matrix, because rows are used first. To create a two-column matrix, I need to use this:

> v1 <- matrix(c(a=1:10, b=1:10), ncol=2)
> v1
     [,1] [,2]
[1,]    1    1
[2,]    2    2
[3,]    3    3
[4,]    4    4
[5,]    5    5
[6,]    6    6
[7,]    7    7
[8,]    8    8
[9,]    9    9
[10,]  10   10

And then I can name the columns using the following:

> colnames(v1) <- c("a","b")

Answer 6: The answer is simple: it's an array collection type. Matrix allows only two dimensions.

Answer 7: No; it will not. class(v1) will show integer, the data type of the elements of the vector. To know if it's a vector, you have to use is.vector() function.

> is.vector(v1)
[1] TRUE

Answer 8: The atomic data types are also vectors under the covers. So v1 is a vector of one element.

Answer 9: It will result in a syntax error. The correct syntax is this:

df1[c("row3"),c("col1","col2")]

Note the two columns are passed as a vector.

Answer 10: The factor has to be ordered. The default is unordered and, therefore, it is not possible to have a comparative relationship. This should have been defined as follows:

> f1 <- factor(c(1,2,3,4), ordered = T)
> min(f1)
[1] 1
Levels: 1 < 2 < 3 < 4
About the Author
Arup Nanda (arup@proligence.com) has been an Oracle DBA since 1993, handling all aspects of database administration, from performance tuning to security and disaster recovery. He was Oracle Magazine's DBA of the Year in 2003 and received an Oracle Excellence Award for Technologist of the Year in 2012.
Join the Database Community Conversation