Latest content
Explore and discover our latest tutorials
Tech Articles from your friends at Oracle and the developer community.
Part 1 | Part 2 | Part 3 of a series that presents an easier way to learn R by comparing and contrasting it to PL/SQL.
Welcome to the third installment of this series. In the previous two installments, you learned how to use variables of different types and various kinds of loops and decision logic.
In this installment, you will lean how to define various types of collections of data, not just atomic variables. Being a data-oriented language, collections are vital in R because most operations are done on collections, not on individual pieces of data.
At the end of this article, there is a summary of what you learned followed by a quiz you can use to test what you've learned.
There are five types of collections in R. In a nutshell, they are different representations of collections of data in R.
We will examine each of them and how they are used in different situations. As with the previous installments in this series, you will learn these compared to the equivalent in PL/SQL.
But first, let's explore the options available for collections in PL/SQL before exploring the same in R.
There are three basic kinds of collections in PL/SQL:
These are just a list of items of the same data type. The list is ordered and the elements can be addressed by their position in the array. There is no "index" to reference them by; the only index is their position. Therefore, this type of data can't be used for key-value pairs.
Let's see an example where we will store the days of the week to a variable.
-- pl1.sql
declare
type ty_tabtype is table of varchar2(30);
l_days_of_the_week ty_tabtype;
begin
l_days_of_the_week := ty_tabtype (
'Sun',
'Mon',
'Tue',
'Wed',
'Thu',
'Fri',
'Sat'
);
-- let's print the values
for d in l_days_of_the_week.FIRST .. l_days_of_the_week.LAST loop
dbms_output.put_line ('Day '||d||' = '||
l_days_of_the_week (d)
);
end loop;
end;
Here's the output:
Day 1 = Sun
Day 2 = Mon
Day 3 = Tue
Day 4 = Wed
Day 5 = Thu
Day 6 = Fri
Day 7 = Sat
Note that the elements are all of the same data type, which, in this case is varchar2(30)
. You can't mix any other data type in. But you can keep
adding values, as you need, to this list.
The second type of collections in PL/SQL is called varray. It's similar to nested tables, except that the maximum number of elements is fixed. The actual number of elements vary and depend on the elements added or subtracted at runtime. Here is an example:
-- p2.sql
declare
type ty_weekdays_list is varray(7) of varchar2(3);
v_week_days ty_weekdays_list;
v_count number;
begin
v_week_days := ty_weekdays_list('Sun','Mon','Tue','Wed','Thu','Fri','Sat');
v_count := v_week_days.count;
for i in 1..v_count loop
dbms_output.put_line(i||'='||v_week_days(i));
end loop;
end;
Note: The values are also of the same data type.
The third type of collection is an associative array, which is also called a PL/SQL table. An associative array is an arbitrary collection of keys and values. The important properties of associative arrays are
You have to reference the elements by an index, not by position. Therefore, associative arrays are also called unordered lists.
To demonstrate a use case, let's see an example where we need to hold some book titles and their
authors' names using a key-value pair structure. We build an associative array of varchar2(30)
values (the "value" part of key-value pair) indexed
by another varchar2
(the "key" part of the key-value pair). The
book title is the key and the author's name is the value. We will initially populate this array
with four books, as shown below:
Book | Author |
---|---|
Pride and Prejudice | Jane Austen |
1984 | George Orwell |
Anna Karenina | Leo Tolstoy |
Adventures of Tom Sawyer | Mark Twain |
In the programs below, we will populate the array and then select from the array.
--pl3.sql
declare
type ty_tabtype is table of varchar2(30)
index by varchar2(30);
l_books ty_tabtype;
i varchar2(30);
begin
l_books ('Pride and Prejudice') := 'Jane Austen';
l_books ('1984') := 'George Orwell';
l_books ('Anna Karenina') := 'Leo Tolstoy';
l_books ('Adventures of Tom Sawyer') := 'Mark Twain';
--
-- now let's display the books and their authors
--
i := l_books.first;
while i is not null loop
dbms_output.put_line('Author of '||i||' = '||
l_books (i));
i := l_books.next(i);
end loop;
end;
Here is the output:
Author of 1984 = George Orwell
Author of Adventures of Tom Sawyer = Mark Twain
Author of Anna Karenina = Leo Tolstoy
Author of Pride and Prejudice = Jane Austen
Now Let's see how these PL/SQL collections translate to R.
Features of many R collections overlap those provided by PL/SQL collections; so I am going to start with an explanation of R collections and show their equivalence to PL/SQL collections.
Take the first one: nested table in PL/SQL. In the PL/SQL example, we used the days of the week. This is a one-dimensional array. In R, this is called a vector. Actually, a vector is a lot more, but we will start with the basic explanation.
Here is how we define the same days of the week in R.
Note the use the function c()
, which creates a vector.
> dow <- c('Sun','Mon','Tue','Wed','Thu','Fri','Sat')
> dow
[1] "Sun" "Mon" "Tue" "Wed" "Thu" "Fri" "Sat"
All the elements of the vector must be of the same data type. If you give multiple data types, R will convert all of them to the superset data type that can accommodate all the values. You can access the elements of a vector by their position.
> dow[1]
[1] "Sun"
Notice that the first position is indexed as "1," just as in PL/SQL; it is not indexed as
"0," which is the case in other programming languages such as Python or C. You don't
have to remember how many elements are present in the vector. You can use the length()
function.
> length(dow)
[1] 7
Note the c
function; that's what defines a vector. Let's explore
it with a simpler example of a vector of 10 numbers from 1 to 10.
> v1 <- c(1,2,3,4,5,6,7,9,10)
You can also define that vector using the range notation you learned in Part 2 of this series.
> v1 <- 1:10
This is similar to the FOR i IN 1..20 LOOP
construct in PL/SQL. If you
check the data type of the variable v1
, you will notice that it is
labeled "numeric"
:
> class(v1)
[1] "numeric"
That's because the elements of the vector are numeric. So, how do you know if a variable is a vector? Use a built-in function:
> is.vector(dow)
[1] TRUE
Vectors allow operations as a whole. Here is how you multiply all the elements of the vector by 2 at once:
> v2 <- v1 * 2
> v2
[1] 2 4 6 8 10 12 14 18 20
And here is how you subtract one vector from another:
> v2 - v1
[1] 1 2 3 4 5 6 7 9 10
You can check if a value is present in a vector, that is, whether it is one of the elements:
> v1 <- c(1,2,3,4,5,6,7,9,10)
> 4 %in% v1
[1] TRUE
This returned TRUE
because 4 is one of the elements of the vector variable
v1
. Likewise, you can check whether the elements of a vector can be found
on another vector.
Let's define another vector called v2
:
> v2 <- c(1,2)
> v2 %in% v1
[1] TRUE TRUE
It returned TRUE
for both values because both 1 and 2 (elements of vector
v2
) are found in vector v1
. By the way, the
positions of the vectors are not important.
Let's see another vector, v3
, with two elements (4 and 5) in the first
and second elements. The following checks for the presence of (4,5)
in
vector v1
.
> v3 <- c(4,5)
> v2 %in% v1
[1] TRUE TRUE
Let's extend the example by using another vector, (11,5)
, and checking
if the values are present in v1
.
> v3 <- c(11,5)
> v3 %in% v1
[1] FALSE TRUE
The first element of v3
, 11, is not present in v1
; so
the return value for that is FALSE
. The
second element, 5, was found in v1
; so the return value is TRUE
.
Vectors have elements of the same data type, not of different data types. What if you combine multiple data types in a vector? Let's see:
> v4 <- c(1,2,'Three')
> v4
[1] "1" "2" "Three"
Notice how R converted everything to character format, even if you input only numeric values for the first two elements. If you check the data types, as follows:
> class(v4)
[1] "character"
They will show as character
. If you combine logical values, they will be
converted to numbers.
> v5 <- c(1,2,TRUE,FALSE)
> v5
[1] 1 2 1 0
Note how TRUE
and FALSE
have been converted to
1
and 0
, respectively. If you introduce a
character, all the items will be converted to characters.
> v5 <- c(1,2,'Three',TRUE,FALSE)
> v5
[1] "1" "2" "Three" "TRUE" "FALSE"
Note how the logical values TRUE
and FALSE
were
merely converted to their character representations—not to 1
and
0
, as was done earlier.
There are special operations on the elements of vector where all the elements are of logical values. Recall from Part 1 that "|" is the logical OR operator and "&" is the logical AND operator.
> v1 <- c(TRUE,TRUE,FALSE,FALSE)
> v2 <- c(TRUE,FALSE,TRUE,FALSE)
> v1
[1] TRUE TRUE FALSE FALSE
> v2
[1] TRUE FALSE TRUE FALSE
> v1 & v2
[1] TRUE FALSE FALSE FALSE
> v1 | v2
[1] TRUE TRUE TRUE FALSE
There are two more operators, "&&" and "||," which are similar to "&" and "|," respectively; but they operate on only the first element of the vector. Therefore, they return only one value:
> v1 && v2
[1] TRUE
This returns TRUE
because the first elements of both vectors v1
and v2
are TRUE
,
and TRUE & TRUE
equals TRUE
.
> v1 = c(1,2,3)
> v2 = c(10,20,30)
> v1 * v2
[1] 10 40 90
> v2 - v1
[1] 9 18 27
> c(v1,v2)
[1] 1 2 3 10 20 30
You can't operate on vectors of different lengths together.
> v3 = c(1,2)
> v3 * v1
[1] 1 4 3
Warning message:
In v3 * v1 :
longer object length is not a multiple of shorter object length
You access the elements of the vector using square brackets.
> v1 [1]
[1] 1
The negative index has a different connotation in R than what you might be used to. It means excluding the values in that place. For instance, -1 means remove the first element and return the rest. It does not mean starting from the end, as seen in some languages such as Python.
> v1 [-1]
[1] 2 3
If an index is not valid, R returns NA
, which is the equivalent of NULL in
PL/SQL.
> v1 [9]
[1] NA
If you want to access a contiguous set of values from a vector, use the m:n
notation:
> v2 [2:3]
[1] 20 30
What if you want to access a discrete set of values—not a contiguous set of values—from a vector, for example, the first and third values? You would think you'd use something like this, wouldn't you?
> v1[1,3]
Unfortunately, that will throw an error:
Error in v1[1, 3] : incorrect number of dimensions
What happened? The m,n
notation is used for something else you will learn
later. To access the discrete elements, you will need to pass the indexes as a vector:
> v2[c(1,3)]
[1] 10 30
Interestingly, indexes can be logical boolean values as well.
> v2
[1] 10 20 30
> v3 = c(T,F,T)
> v2[v3]
[1] 10 30
In this case, the first element of v3
is T
,
that is, TRUE
; so it gets the first element of v2
. The
second element of v3
is F
, or FALSE
; so the second element of v2
(which is 20
) is not displayed. The third
element of v3
is T
; so the third element
(30
) is displayed.
Remember associative arrays in PL/SQL, which allow you to assign names to variables, such as a key-value pair? The example we used was book titles and author names. Life is so much simpler if you can look up the element by its key instead of trying to figure out its position and access it via the positional index, doesn't it?
Vectors can do that as well. You do that by a labeling the elements, or, in R terms, naming the elements. Let's see an example of sales figures for a company in different quarters.
> sales <- c(100,200,150,75)
This doesn't tell us the whole story. Does it start with Quarter 1? Does 100 correspond to Quarter 1 or Quarter 4? We can make this clear by naming the elements:
> quarters <- c('Quarter 1','Quarter 2','Quarter 3','Quarter 4')
> names(sales) <- quarters
Now, if we display the sales
variable, we get the following:
> sales
Quarter 1 Quarter 2 Quarter 3 Quarter 4
100 200 150 75
Note how the values have labels now, making the meaning clearer. Another way to name the elements is to assign the names while creating the vector itself.
> sales <- c('Quarter 1'=100,'Quarter 2'=200,'Quarter 3'=150,'Quarter 4'=75)
> sales
Quarter 1 Quarter 2 Quarter 3 Quarter 4
100 200 150 75
When elements are labeled, you can use the labels to access the elements, not the positional index, exactly how a key-value pair would be accessed:
> sales['Quarter 3']
Quarter 3
150
But when you write a program, you just need the value; not the label itself, which can be source of error. To suppress it, you use the double square brackets syntax.
> sales[['Quarter 3']]
[1] 150
Vectors are the most used data type in R. Behind the scenes, all the data types are just vectors of some sort. You can compare the vectors to an atomic data type.
Vectors provide a way to create a collection of values, which are important in our data manipulation. Sometimes these values are merely identifiers. Take for instance the following vector:
> deptname <- c("Marketing and Public Relations", "Finance and Treasury", "Engineering and Technology", "Operations and Process Control")
This is a pretty long variable. However these elements are merely descriptors. There is no significance of the names relative to one another, and you will not operate on the names; but the names will have to be used by R in their entirety. So mere pointers to these values will be sufficient.
There is also another property of this type of variables. Consider a vector of department numbers:
c(10,20,30,40,50)
These are just numbers representing the department. There is nothing more to it. For instance, you
can't say that department 30 is bigger than department 20, nor can you say that there is an
average of the department numbers. These numbers are merely descriptive. If R knows about this
property, it's better for memory management and processing, because these will not just be usual
numbers; instead they will be specific values. In R, these are called factors. You can
create a factor using the factor()
function on any vector.
Here is how you can create a factor of department numbers.
> dept <- factor(c(10,20,30,40,50))
If you check the data type of this variable, you can see it's a factor.
> class(dept)
[1] "factor"
This is what you see if you examine the contents:
> dept
[1] 10 20 30 40 50
Levels: 10 20 30 40 50
Note the contents. They are called levels of the factor. The levels are individual distinct values in the factor. So, if you pass the same value multiple times, it will be in the factor but it will not be listed multiple times in levels. Here is a factor with multiple occurrences of the same value.
> dept <- factor(c(10,10,10,20,20,30,40,50,50))
> dept
[1] 10 10 10 20 20 30 40 50 50
Levels: 10 20 30 40 50
You can also pull up the levels of a factor explicitly by using the level()
function:
> levels(dept)
[1] "10" "20" "30" "40" "50"
Numbers like those in this list probably don't make much sense. So you need to assign descriptive
levels. The levels()
function accomplishes that.
> levels(dept) <- c("Marketing","Sales","Finance","Operations","IT")
> dept
[1] Marketing Sales Finance Operations IT IT
Levels: Marketing Sales Finance Operations IT
See how the factor levels are different now? You can access the elements of factor the same way as you access matrices.
> dept[2]
[1] Sales
Levels: Marketing Sales Finance Operations IT
In this example, the levels are merely descriptive; there is no comparative relationship among them. For instance, the Marketing department is not greater than or smaller than, say, the Finance department. If you force such a comparison, you will get an error:
> dept[2] > dept [1]
[1] NA
Warning message:
In Ops.factor(dept[2], dept[1]) : '>' not meaningful for factors
But sometimes there may be a comparative relationship. Take for instance a factor containing all the
titles in a company. The title "president" is higher than "vice president,"
which is higher than "director," and so on. This is called an ordered factor. To
indicate the factor is ordered, you have to include the ordered
parameter
and set it to T
when creating the factor, as shown below. Here, we also
added the labels.
> title <- factor(c('associate'=1,'manager'=2,'director'=3,'vp'=4,'president'=5), ordered = T)
> title
associate manager director vp president
1 2 3 4 5
Levels: 1 < 2 < 3 < 4 < 5
Note how the Levels
attribute is shown now. 1 is less than 2, which is
less than 3, and so on. In this case, you can compare factors. Here is a check:
> title["vp"] > title['manager']
[1] TRUE
> title["vp"] > title['president']
[1] FALSE
In large datasets, the factors will be quite large, which makes visual comparison difficult. This approach of declaring ordered factors comes in handy at that time.
While you can see the obvious advantage of vectors, there is an important limitation: all the
elements must be of the same data type, which can be pretty huge based on the use case. So,
here comes a similar but different collection—list—which is like a vector but
the elements can be of any data type. You define a list using the function list()
.
v1 <- list(1,'a',T)
Just to make sure you have created it with elements of different data types, display the v1
variable:
> v1
[[1]]
[1] 1
[[2]]
[1] "a"
[[3]]
[1] TRUE
Does the output look familiar? If you notice, the double square bracket notation, [[]]
, came from the vector representation. The elements are all of a
different type, as you can see. But under the covers, it's just a vector of multiple vectors.
You can confirm that by checking for list and vector:
> is.list(v1)
[1] TRUE
> is.vector(v1)
[1] TRUE
The above can be written as follows:
> v1 <- c(c(1),c('a'),c(T))
> v1
[1] "1" "a" "TRUE"
You can create two variables and compare them:
> v1 <- list(1,'a',T)
> v2 <- c(c(1),c('a'),c(T))
> v1 == v2
[1] TRUE TRUE TRUE
They are the same. Because list is a vector of vectors, you can also define sort of a
multidimensional representation of data. I call is "sort of" just because there are better
ways to handle multiple dimensions. Here are three vectors, of numbers, characters, and logical data
types, named n1
, c1
, and l1
respectively.
n1 <- c(1,2,3,4,5)
c1 <- c("First","Second","Third")
l1 <- c(T,F,T,T,F)
We can create a list from all these, as shown below:
list1 <- list(n1,c1,l1)
If you display the list, you will see its contents:
> list1
[[1]]
[1] 1 2 3 4 5
[[2]]
[1] "First" "Second" "Third"
[[3]]
[1] TRUE FALSE TRUE TRUE FALSE
If you want to address the first element of the list, you will need to access it by its position:
> list1[1]
[[1]]
[1] 1 2 3 4 5
This will return a vector, as expected. If you want to access the first element of this vector, you will need to access it by index.
> list1[[1]][1]
[1] 1
But positional indexes can be difficult to use. It can be made easier. Remember the naming of the
elements in vectors? The same can be done in lists too, using the same function names()
and passing a vector of values.
Suppose you want to name them "Numbers," "Spelled," and "Booleans," respectively. You can use the following:
names(list1) <- c("Numbers", "Spelled", "Booleans")
Now, if you display the variable, you will see different headers for the elements.
> list1
$Numbers
[1] 1 2 3 4 5
$Spelled
[1] "First" "Second" "Third"
$Booleans
[1] TRUE FALSE TRUE TRUE FALSE
This helps in accessing individual elements of the list.
> list1 [1]
$Numbers
[1] 1 2 3 4 5
To access a specific element of the list, you can use the label as well.
> list1 ["Numbers"]
$Numbers
[1] 1 2 3 4 5
If you ever want to change lists to vectors, simply use the unlist()
function:
> unlist(list1)
Numbers1 Numbers2 Numbers3 Numbers4 Numbers5 Spelled1 Spelled2 Spelled3
"1" "2" "3" "4" "5" "First" "Second" "Third"
Booleans1 Booleans2 Booleans3 Booleans4 Booleans5
"TRUE" "FALSE" "TRUE" "TRUE" "FALSE"
Now each element can be addressed as in a vector. For example, list1[13]
references
the thirteenth element. Sometimes we might need to do
this "flattening" of the output, as you will learn later in the series.
A matrix is a two-dimensional representation of data elements; but of the same data type. Remember,
it has to be exactly two dimensions—rows and columns—not three dimensions. You create a
matrix using the matrix()
function.
In the following example, we create a matrix with 20 elements (1:20
, if
you remember from Part 2, creates a sequence of 20 numbers from 1 to 20).
> m1 <- matrix(1:20)
> m1
[,1]
[1,] 1
[2,] 2
[3,] 3
[4,] 4
[5,] 5
[6,] 6
[7,] 7
[8,] 8
[9,] 9
[10,] 10
[11,] 11
[12,] 12
[13,] 13
[14,] 14
[15,] 15
[16,] 16
[17,] 17
[18,] 18
[19,] 19
[20,] 20
Note how it created a single column matrix, with 20 rows. If you want rows and columns, you need to
specify that. The nrow
parameter specifies the number of rows of the
resultant matrix.
> m1 <- matrix(1:20, nrow=4)
> m1
[,1] [,2] [,3] [,4] [,5]
[1,] 1 5 9 13 17
[2,] 2 6 10 14 18
[3,] 3 7 11 15 19
[4,] 4 8 12 16 20
Note there are four columns because you mentioned that while creating the matrix. R distributed the values and ended up with five columns.
What if you had wanted a fixed number of columns instead? You can request that with the ncol
parameter.
> m1 <- matrix(1:20, ncol=4)
> m1
[,1] [,2] [,3] [,4]
[1,] 1 6 11 16
[2,] 2 7 12 17
[3,] 3 8 13 18
[4,] 4 9 14 19
[5,] 5 10 15 20
You can specify both ncol
and nrow
. If the
values are more than required, R will discard them. Note how in the following example, only 16 of
your initial 20 values made it to the matrix of four rows and four columns.
> m1 <- matrix(1:20, nrow=4, ncol=4)
> m1
[,1] [,2] [,3] [,4]
[1,] 1 5 9 13
[2,] 2 6 10 14
[3,] 3 7 11 15
[4,] 4 8 12 16
By the way, matrices can be generated with discrete values as well, not just sequential values. Here is an example.
> m1 <- matrix(c('a','e','i','o','u','1'), nrow=3)
> m1
[,1] [,2]
[1,] "a" "o"
[2,] "e" "u"
[3,] "i" "1"
Note how the data has been distributed first down and then across. What if that's not what you
want? You can tell the matrix to be across rows instead of columns. Another optional
parameter, byrow
, when set to TRUE
,
accomplishes that.
> m1 <- matrix(c('a','e','i','o','u','1'), nrow=3, byrow=T)
> m1
[,1] [,2]
[1,] "a" "e"
[2,] "i" "o"
[3,] "u" "1"
Now that you know how a matrix is defined, you might notice some annoying things about how it is
shown. The first thing you might notice is the labeling. The rows and columns do not have headers.
Instead they have just a default header with some commas and numbers, which doesn't add a lot of
value. To add labels, you can use another parameter called dimnames
. The
parameter takes a vector containing the row labels and another containing the column labels.
> m1 <- matrix(c(1,2,3,4,5,6,7,8,9,10), nrow=2, byrow = T, dimnames = list(c("Row1","Row2"),c("Col1","Col2","Col3","Col4","Col5")))
> m1
Col1 Col2 Col3 Col4 Col5
Row1 1 2 3 4 5
Row2 6 7 8 9 10
It's not just at the declaration time that you have the opportunity to mention the labels. If you already have a matrix, you can add labels later. Or, if you want to rename the labels, you can do that using the dimnames() function, as shown below:
> dimnames(m1) <- list(c("R1","R2"))
> m1
[,1] [,2] [,3] [,4] [,5]
R1 1 2 3 4 5
R2 6 7 8 9 10
Another way to change the column name is with the colnames()
function:
colnames(m1) <- c("Col1","Col2","Col3","Col4","Col5")
Now if you display m1
, you'll see this:
> m1
Col1 Col2 Col3 Col4 Col5
R1 1 2 3 4 5
R2 6 7 8 9 10
Similarly, another way to change the row names is with the rownames()
function. Let's use that to change the names of the rows back to Row1, Row2, and so on.
rownames(m1) <- c("Row1", "Row2")
Well, that's a lot about naming the rows and columns of the matrix. But a matrix is useless without some way to access the data. It's super easy. Just use the x and y coordinates. Always remember: rows and then columns, not other way around. For instance, to select the element in the first row and second column, just use this:
> m1[1,2]
[1] 2
If you want to select multiple columns, just use the range notation. For instance, to select columns
2 until 3, use 2:3
, as shown below:
> m1[1,2:3]
Col2 Col3
2 3
If you want to select multiple but discrete columns, not a range, just use the column numbers as a
vector. For instance, to select columns 2, 3, and 4, use c(2,3,4)
, as
shown below:
> m1[1,c(2,3,4)]
Col2 Col3 Col4
2 3 4
If you want to select all the columns of row 1, you can of course use c(1,2,3,4,5)
; but what if you don't know the number of columns? No
worries. you can just omit the column reference:
Note there is a comma and nothing after the comma— m1[1,]
, not m1[1]
—such as you would do with an implied second parameter. If you
don't have a value as a parameter, R assumes all the columns. If you don't place a comma, R
picks up only one element starting from the left top-most going downward, for instance:
> m1[2]
[1] 6
So, be careful about the placement of values and the comma. In PL/SQL, if you have a comma, it means you are supplying the next parameter. It can be null, but it must be supplied. If you want the next parameter to be the default, you simply don't put the comma. If you put the comma and then don't mention the value of the next parameter, you will get a syntax error. In R, you will not get a syntax error and the behavior will be very different. Likewise, you can omit the first parameter as well, which then defaults to all the rows of the column.
For instance, the following will bring up all the rows of column 1.
> m1[,1]
Row1 Row2
1 6
> m1[,2]
Row1 Row2
2 7
But, note that the output comes as rows, even though it's from a column. At least the labels are
correct: Row1
, Row2
, and so on. If you select
multiple columns, for example, columns 1 until 3, you will see the output as expected.
> m1[,1:3]
Col1 Col2 Col3
Row1 1 2 3
Row2 6 7 8
But, thinking like database professionals, we probably won't like to address them as numbers.
Because we have the rows and columns named, can't we address them using the names, as you would
do in a table? The answer is, of course you can. Here is how we access row Row1
and column Col2
:
> m1["Row1","Col2"]
[1] 2
Another way to create a matrix is to use the rbind()
function, which adds
a row to the matrix.
> row1 <- c(11,12,13,14)
> row2 <- c(21,22,23,24)
> row3 <- c(31,32,33,34)
> m1 <- rbind(row1,row2,row3)
> m1
[,1] [,2] [,3] [,4]
row1 11 12 13 14
row2 21 22 23 24
row3 31 32 33 34
> m2 <- cbind(row1,row2,row3)
> m2
row1 row2 row3
[1,] 11 21 31
[2,] 12 22 32
[3,] 13 23 33
[4,] 14 24 34
What if you use less number of values in rbind()
? Let's see:
> row4 <- c(41,42)
> m1 <- rbind(row1,row2,row3, row4)
> m1
[,1] [,2] [,3] [,4]
row1 11 12 13 14
row2 21 22 23 24
row3 31 32 33 34
row4 41 42 41 42
Row4
didn't have enough columns. It had only two, but the total
columns needed (four) is a multiple of two. Therefore, the available values simply got repeated.
However, if you give a nondivisible value as the number of elements, you will get an error:
> row4 <- c(41,42,43)
> m1 <- rbind(row1,row2,row3, row4)
Warning message:
In rbind(row1, row2, row3, row4) :
number of columns of result is not a multiple of vector length (arg 4)
But why would you even want to create values as matrices? Ease of operations, of course. R allows you to operate on matrices as a whole. Here is an example.
Create a matrix of 20 numbers: four rows and five columns:
> m1 <- matrix(1:20, nrow=5)
> m2 <- matrix(1:20, nrow=5)
> m2
[,1] [,2] [,3] [,4]
[1,] 1 6 11 16
[2,] 2 7 12 17
[3,] 3 8 13 18
[4,] 4 9 14 19
[5,] 5 10 15 20
If you want to multiply 2 by every element of this matrix, just multiply.
> m2 * 2
[,1] [,2] [,3] [,4]
[1,] 2 12 22 32
[2,] 4 14 24 34
[3,] 6 16 26 36
[4,] 8 18 28 38
[5,] 10 20 30 40
In PL/SQL and other languages such as C or Java, you would have had to write a loop visiting all the elements and multiplying. Not in R. You just add, subtract, multiply, raise to the power of, and so on, just as with any regular unary operation. You can even multiply a matrix by another matrix:
> m1
Col1 Col2 Col3 Col4
Row1 11 12 13 14
Row2 21 22 23 24
Row3 31 32 33 34
Row4 41 42 41 42
> m2
[,1] [,2] [,3] [,4]
[1,] 1 5 9 13
[2,] 2 6 10 14
[3,] 3 7 11 15
[4,] 4 8 12 16
> m1 * m2
Col1 Col2 Col3 Col4
Row1 11 60 117 182
Row2 42 132 230 336
Row3 93 224 363 510
Row4 164 336 492 672
Let's stop the discussion on matrices for now and discuss the next important collection in R: arrays.
Matrices can accomplish most of the data analysis tasks. Most of the data you will get is in two dimensions. But what if your dataset comes in more than two dimensions? An array comes to rescue here. It is a multidimensional matrix.
Here is how you can define a one-dimensional array.
> v1 <- array(data=1:30)
> v1
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
As you can see, it looks like a vector; it's not of much value. An optional parameter creates a
multidimensional array. This parameter—dim
—allows you to
define multidimensional arrays:
> v2 <- array(1:30, dim=c(5,5))
> v2
[,1] [,2] [,3] [,4] [,5]
[1,] 1 6 11 16 21
[2,] 2 7 12 17 22
[3,] 3 8 13 18 23
[4,] 4 9 14 19 24
[5,] 5 10 15 20 25
But this also looks like another familiar structure: the matrix. So it's not of much value. Let's define a third dimension:
> v3 <- array(1:30, dim=c(5,5,5))
> v3
, , 1
[,1] [,2] [,3] [,4] [,5]
[1,] 1 6 11 16 21
[2,] 2 7 12 17 22
[3,] 3 8 13 18 23
[4,] 4 9 14 19 24
[5,] 5 10 15 20 25
, , 2
[,1] [,2] [,3] [,4] [,5]
[1,] 26 1 6 11 16
[2,] 27 2 7 12 17
[3,] 28 3 8 13 18
[4,] 29 4 9 14 19
[5,] 30 5 10 15 20
, , 3
... output truncated ...
As you can see, it created a three-dimensional data structure. We provided only 30 values (as shown
in the sequence 1:30
), not enough for the 125 values needed for the array
(5 x 5 x 5 = 125); so R repeated the values as much as needed to fill the array.
How do we access the elements of the array? The exact way we addressed them in matrix: variable[rownumber, column number, third dimension]
. You start with the
row, then the column, and finally all the other dimensions defined in the array.
Here's how to refer to the earlier output:
> v3[2,2,1]
[1] 7
Note that the third dimension is printed in the output first, as headings are when you display all the values of a list variable. There is a matrix for each of the third dimensions. Like the matrix, you can omit the exact position of a dimension and you will get all the values for that dimension, for instance,
> v3[1,1,]
[1] 1 26 21 16 11
Here we omitted the third dimension; so we get the first row and first column of all the matrices for
all values of the third dimensions, that is, v3[1,1,1]
, v3[1,1,2]
, v3[1,1,4]
, and so on. Just as with a
matrix, you can add and multiply values directly to the array.
For example, this
v3 * 10
will multiply 10 by all the elements of the array.
You can multiply arrays by other arrays as well. If you want a sum of all the first row and first column from all the third dimensions, you can use this:
> sum(v3[1,1,])
[1] 75
Just as with a matrix, you can name the dimensions of the array as well.
colnames(v3) = c("Col1","Col2","Col3","Col4","Col5")
rownames(v3) = c("Row1","Row2","Row3","Row4","Row5")
The following is a special naming for the third dimension, which you didn't see with matrix:
dimnames(v3)[[3]] <- c("Third1","Third2","Third3","Third4","Third5")
Now if you print v3
, you will see the names:
> v3
, , Third1
Col1 Col2 Col3 Col4 Col5
Row1 1 6 11 16 21
Row2 2 7 12 17 22
Row3 3 8 13 18 23
Row4 4 9 14 19 24
Row5 5 10 15 20 25
, , Third2
Col1 Col2 Col3 Col4 Col5
Row1 26 1 6 11 16
Row2 27 2 7 12 17
Row3 28 3 8 13 18
Row4 29 4 9 14 19
Row5 30 5 10 15 20
, , Third3
... output truncated ...
And, just as with a matrix, you will be able to address the elements by their dimension names instead of the positions.
> v3["Row1","Col2","Third3"]
[1] 26
So, as you can see, arrays are just like matrices, but with multiple dimensions that you can name and address elements by.
Being a database professional, you probably already know that data comes in all types, not just in
one type. For instance, take employee data. You probably have employee ID (EmpID
), which is a number; the name (Name
),
which is characters, and another column called AtHQ
, which is a logical
value TRUE
or FALSE
that shows whether the
employee is located at the company headquarters. This is what you typically see in database tables,
spreadsheets, and even in text files.
EmpID |
Name |
AtHQ |
---|---|---|
1 | John Smith | T |
2 | Jane Doe | F |
In PL/SQL, you would have to define a record type and create a collection on that record. In Oracle Database, you would have to create a table with those columns and specified data types. In R, the collection is called a data frame.
A data frame is just like a matrix, but with a very important difference: the elements can contain different data types. Recall that all the elements in a matrix must be of the same data type. Because a data frame removes that limitation, it can be used in many data analysis cases.
Let's create a data frame for the above information. You create a data frame via the data.frame()
function.
> df1 <- data.frame(1,"John Smith",T)
If you show the data frame, you can see that the column labels are not exactly what you intended.
> df1
X1 X.John.Smith. T
1 1 John Smith TRUE
R simply puts whatever it feels is right. Let's put appropriate labels for the columns.
> colnames(df1) <- c("EmpId","Name","AtHQ")
> df1
EmpId Name AtHQ
1 1 John Smith TRUE
When you want to add rows to the data frame, just create a new data frame and add it to the other one
using the rbind()
function.
df2 <- data.frame(2,"Jane Doe",F)
colnames(df2) <- c("EmpId","Name","AtHQ")
df1 <- rbind(df1,df2)
Now if you check df1
, you'll see this:
> df1
EmpId Name AtHQ
1 1 John Smith TRUE
2 2 Jane Doe FALSE
Assigning row and column names makes accessing the elements of the data frame immensely easy. To show
all the values of column EmpID
, here is what you need to write:
> df1["EmpId"]
EmpId
1 1
2 2
But in a data frame, you have another method of writing the column values. It's the name of the data frame followed by the column name separated by a "$" sign. Here is the example:
> df1$EmpId
[1] 1 2
Note that while this also gets the employee IDs, the values come back as a vector. In the previous approach, the value returned was another data frame. This might not make much difference in your work, but it might. So, keep that mind.
Finally, you can access the rows of the data frame using the "[]" notation.
> df1[1,]
EmpId Name AtHQ
1 1 John Smith TRUE
Here, we retrieved all the columns of row 1. If you want only a selected column, for example, EmpId
, you can give it as an index:
> df1[1,"EmpId"]
[1] 1
What about row names? Yes, you can name rows in a data frame, using the same rownames()
function you saw with matrix:
> rownames(df1) <- c("Employee1","Employee2")
> df1
EmpId Name AtHQ
Employee1 1 John Smith TRUE
Employee2 2 Jane Doe FALSE
However, naming the rows doesn't help because rows are merely observations (which are analogous to records in a table) of multiple variables (which are columns). It makes sense to name these variables; but naming rows might not make sense. When you add a row, you have to remember to add the row name for that as well.
Working with an Actual Data FrameIn the previous text, I just wanted to give you a flair of the various collections. Let's see some operations on an actual dataset, not made up values, as we saw before.
Fortunately for us, R comes with many built-in datasets. To find these datasets, just use the data()
function at the R prompt:
> data()
This will bring up a different window with the datasets. Here is a small excerpt from that window:
Data sets in package 'datasets':
AirPassengers Monthly Airline Passenger Numbers 1949-1960
BJsales Sales Data with Leading Indicator
BJsales.lead (BJsales)
Sales Data with Leading Indicator
... output truncated ...
We will use one dataset, airquality
, which shows the daily air quality
measurements in New York from May to September 1973. How do I know that? I didn't have to guess.
Just use help()
function to see what it is about. Enter help(airquality)
at the R command prompt and you will see a browser window
pop up to explain what the dataset is. In that help window, I see that it's a data frame with
the following properties:
A data frame with 154 observations on 6 variables.
[,1] Ozone numeric Ozone (ppb)
[,2] Solar.R numeric Solar R (lang)
[,3] Wind numeric Wind (mph)
[,4] Temp numeric Temperature (degrees F)
[,5] Month numeric Month (1--12)
[,6] Day numeric Day of month (1--31)
Let's see some of the example data:
> head(airquality)
Ozone Solar.R Wind Temp Month Day
1 41 190 7.4 67 5 1
2 36 118 8.0 72 5 2
3 12 149 12.6 74 5 3
4 18 313 11.5 62 5 4
5 NA NA 14.3 56 5 5
6 28 NA 14.9 66 5 6
Let's see how we can use this data frame for some analysis. First, I am tired of typing "airquality" every time. So I will create a nice little variable from the dataset.
> a <- airquality
From now on, I will just use "a" for the data. Recall from the first article in this series
that you can use the str()
function to find the structure of a data
frame.
> str(a)
'data.frame': 153 obs. of 6 variables:
$ Ozone : num 41 36 12 18 25 28 23 19 8 25 ...
$ Solar.R: int 190 118 149 313 NA NA 299 99 19 194 ...
$ Wind : num 7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 ...
$ Temp : int 67 72 74 62 56 66 65 59 61 69 ...
$ Month : int 5 5 5 5 5 5 5 5 5 5 ...
$ Day : int 1 2 3 4 5 6 7 8 9 10 ...
The str()
function is analogous to the describe
command in SQL*Plus to describe the structure of a table. Consider the above output carefully. It
shows there are 153 "observations." There are six variables, analogous to columns in a
table. So this is similar to a table with six columns and 153 rows. The subsequent parts of the
output shows the columns, their data type, and some of the sample.
For my first analysis, I just want to look at the quick summaries of all the available attributes:
> summary(a)
Ozone Solar.R Wind Temp
Min. : 1.00 Min. : 7.0 Min. : 1.700 Min. :56.00
1st Qu.: 18.00 1st Qu.:115.8 1st Qu.: 7.400 1st Qu.:72.00
Median : 31.50 Median :205.0 Median : 9.700 Median :79.00
Mean : 42.13 Mean :185.9 Mean : 9.958 Mean :77.88
3rd Qu.: 63.25 3rd Qu.:258.8 3rd Qu.:11.500 3rd Qu.:85.00
Max. : 168.00 Max. :334.0 Max. :20.700 Max. :97.00
NA's :37 NA's :7
Month Day
Min. :5.000 Min. : 1.0
1st Qu.:6.000 1st Qu.: 8.0
Median :7.000 Median :16.0
Mean :6.993 Mean :15.8
3rd Qu.:8.000 3rd Qu.:23.0
Max. :9.000 Max. :31.0
This is a very good starting point that tells me some of the important details of the dataset: the
minimum, the maximum, the median, the mean, the first quantile, and the third quantile for each
attribute (or column). It also shows how many values are not available, shown as NA
, which is equivalent of NULL in Oracle Database.
From the output, we know that 37 records have NA
in the Ozone
column and seven records have NA
in the
Solar.R
column. The other columns don't have NA
counts, which means they don't have NA
values. Interestingly, it does not show a key factor in data analysis—standard
deviation—but you can get it easily if you want, as you will see later.
Let's see some questions that might be asked and how we get the answers. We need to find out the
daily temperatures in the month of June. You can reference the cells of the data frame using the
absolute values (row m
and column n
), or you
can use the logical values for the cell. To get the logical value, you should use the comparison
operator. Because you are looking for the month of June, that is, 6
under
the Month
column, use Month==6
as the index.
And because you want just the Temp
column, you should use the $Temp
notation.
> a$Temp[a$Month==6]
[1] 78 74 67 84 85 79 82 87 90 87 93 92 82 80 79 77 72 65 73 76
[21] 77 76 76 76 75 78 73 80 77 83
How about we get a bit specific, for example, June 30th? The same principle applies. You just need to extend the comparison operation to be a bit more restrictive.
> a$Temp[a$Month==6 & a$Day==30]
[1] 83
But mostly you will be interested in some kind of statistical inference on a group of data, for
example, the average (or mean) temperature in June, not just a single value. The mean()
function accomplishes that.
> mean(a$Temp[a$Month==6])
[1] 79.1
Or, the maximum temperature in June:
> max(a$Temp[a$Month==6])
[1] 93
Remember the summary
function that gives you all quick stats at a glance?
Won't you want to have the same here? We would love to slice the data for June and apply the
summary. Let's create a new variable, called june
, to hold data for
June.
> june <- a$Temp[a$Month==6]
> june
[1] 78 74 67 84 85 79 82 87 90 87 93 92 82 80 79 77 72 65 73 76 77 76 76 76 75
[26] 78 73 80 77 83
But what exactly is june
?
> class(june)
[1] "integer"
It says it's an integer, but we wanted a data frame. Check if it is a data frame:
> is.data.frame(june)
[1] FALSE
So it's not a data frame. Well, what is it? Could it be a vector?
> is.vector(june)
[1] TRUE
It is a vector. But we want a data frame, so we will convert it to one.
> june <- data.frame(june)
> class(june)
[1] "data.frame"
Now that we have our data frame, we can apply the summary()
function.
> summary(june)
june
Min. :65.00
1st Qu.:76.00
Median :78.00
Mean :79.10
3rd Qu.:82.75
Max. :93.00
Let's get the standard devision of this dataset using the sd()
function.
> sd(june$june)
[1] 6.598589
What if you want to get all the data, not just Temp
, of the data frame for
the month of June? In SQL, you would write something like this:
select *
from df1
where Month = 6;
In R, it's also trivial. There are two ways to do it. I will start with the first—and my
favorite—method, because it's just plain easy to read. It's by using the subset()
function.
> june<-subset(a, a$Month==6)
> june
Ozone Solar.R Wind Temp Month Day
32 NA 286 8.6 78 6 1
33 NA 287 9.7 74 6 2
34 NA 242 16.1 67 6 3
35 NA 186 9.2 84 6 4
36 NA 220 8.6 85 6 5
37 NA 264 14.3 79 6 6
38 29 127 9.7 82 6 7
39 NA 273 6.9 87 6 8
40 71 291 13.8 90 6 9
41 39 323 11.5 87 6 10
42 NA 259 10.9 93 6 11
43 NA 250 9.2 92 6 12
44 23 148 8.0 82 6 13
45 NA 332 13.8 80 6 14
46 NA 322 11.5 79 6 15
47 21 191 14.9 77 6 16
48 37 284 20.7 72 6 17
49 20 37 9.2 65 6 18
50 12 120 11.5 73 6 19
51 13 137 10.3 76 6 20
52 NA 150 6.3 77 6 21
53 NA 59 1.7 76 6 22
54 NA 91 4.6 76 6 23
55 NA 250 6.3 76 6 24
56 NA 135 8.0 75 6 25
57 NA 127 8.0 78 6 26
58 NA 47 10.3 73 6 27
59 NA 98 11.5 80 6 28
60 NA 31 14.9 77 6 29
61 NA 138 8.0 83 6 30
The other approach is this:
> a[a$Month==6,]
Note the expression carefully; there is a comma after the conditional expression. Why is that? Recall
from the previous discussion that to address an element you have to use the [RowNumber, ColumnNumber]
notation. RowNumber
is specified by a$Month==6
, but for
columns, we don't want to restrict anything. We want to select all the columns; therefore, we
left it as null.
Now you can see why I like the subset()
approach more. It's several
times more readable. However, subset()
doesn't work when you want to
assign values based on conditions. For instance, suppose you want to assign the value 25 to Ozone
when the value is NA
. subset
won't work; you have to leverage the subscript approach.
> a[is.na(a$Ozone),]
Ozone Solar.R Wind Temp Month Day
5 NA NA 14.3 56 5 5
10 NA 194 8.6 69 5 10
25 NA 66 16.6 57 5 25
26 NA 266 14.9 58 5 26
27 NA NA 8.0 57 5 27
32 NA 286 8.6 78 6 1
33 NA 287 9.7 74 6 2
34 NA 242 16.1 67 6 3
35 NA 186 9.2 84 6 4
36 NA 220 8.6 85 6 5
37 NA 264 14.3 79 6 6
39 NA 273 6.9 87 6 8
42 NA 259 10.9 93 6 11
43 NA 250 9.2 92 6 12
45 NA 332 13.8 80 6 14
46 NA 322 11.5 79 6 15
52 NA 150 6.3 77 6 21
53 NA 59 1.7 76 6 22
54 NA 91 4.6 76 6 23
55 NA 250 6.3 76 6 24
56 NA 135 8.0 75 6 25
57 NA 127 8.0 78 6 26
58 NA 47 10.3 73 6 27
59 NA 98 11.5 80 6 28
60 NA 31 14.9 77 6 29
61 NA 138 8.0 83 6 30
65 NA 101 10.9 84 7 4
72 NA 139 8.6 82 7 11
75 NA 291 14.9 91 7 14
83 NA 258 9.7 81 7 22
84 NA 295 11.5 82 7 23
102 NA 222 8.6 92 8 10
103 NA 137 11.5 86 8 11
107 NA 64 11.5 79 8 15
115 NA 255 12.6 75 8 23
119 NA 153 5.7 88 8 27
150 NA 145 13.2 77 9 27
Let's assign a value of 25 to all the cells where Ozone
is NA
.
> a[is.na(a$Ozone),]$Ozone <- 25
Let's test only one row: row 5, which was NA
earlier:
> a[5,]
Ozone Solar.R Wind Temp Month Day
5 25 NA 14.3 56 5 5
Now it's no longer NA
; it's 25
.
Let's confirm there are no NAs in the Ozone
column.
> a[is.na(a$Ozone),]
[1] Ozone Solar.R Wind Temp Month Day
<0 rows> (or 0-length row.names)
What if you want to select only a specific column from the data frame, instead of all the columns, based on some condition? For instance, you want to select temperatures for the month of June.
In SQL, it would be something like this:
select temp
from dataframe
where Month = 6;
In R, you can build it upon the previous expression:
> subset(a, a$Month==6)$Temp
[1] 78 74 67 84 85 79 82 87 90 87 93 92 82 80 79 77 72 65 73 76 77 76 76 76 75 78 73 80 77 83
Note how we used the previous expression of subset
, which yields a data
frame containing all the columns from "a"—but only where Month
matches 6
—and we added $Temp
to it to select
only that column. As a further extension, if you want
to get the minimum temperature in June, in SQL, you would write this:
select min(temp)
from dataframe1
where month = 6;
In R, you'd do this:
> min(subset(a, a$Month==6)$Temp)
[1] 65
By the way, the condition clause can also contain multiple conditions. For instance, if you want to select temperature on June 14, in SQL you would write something like this:
select Temp
from dataframe1
where Month = 6
and Day = 14;
In R, you would use the "&" character in place of AND
:
> subset(a, a$Month==6 & a$Day==14)$Temp
[1] 80
There is an even simpler way for SQL-savvy developers. A package called sqldf
makes the task as easy as writing SQL code.
> install.packages("sqldf")
Then load the package:
> library(sqldf)
Using the library, we can rewrite our code as follows:
> sqldf('select Temp from a where Month=6 and Day = 14')
Temp
1 80
Well, that makes is super easy for SQL-literate readers. It returns a data frame, even if it's just one value. But as you saw earlier, you can compare that with any atomic variable. Here is an example:
> v1 <- sqldf('select Temp from a where Month=6 and Day = 14')> if (v1 > 75) { print ("This was a hot day day")}
[1] "This was a hot day day"
Let's go back to the subset approach. What if you wanted to select a few columns instead of all the columns? For instance, because you are selecting for the month of June only, you don't want to select the month number again. In addition, you might want to have the day in the first column. In SQL, you would write this:
select day, ozone, solar.r, temp
from dataframe
where month=6;
In R, you can use the select
parameter of the substr()
function.
> subset(a, a$Month==6, select=c(Day,Ozone, Solar.R, Temp))
Day Ozone Solar.R Temp
32 1 25 286 78
33 2 25 287 74
34 3 25 242 67
35 4 25 186 84
36 5 25 220 85
37 6 25 264 79
38 7 29 127 82
39 8 25 273 87
40 9 71 291 90
41 10 39 323 87
42 11 25 259 93
43 12 25 250 92
44 13 23 148 82
45 14 25 332 80
46 15 25 322 79
47 16 21 191 77
48 17 37 284 72
49 18 20 37 65
50 19 12 120 73
51 20 13 137 76
52 21 25 150 77
53 22 25 59 76
54 23 25 91 76
55 24 25 250 76
56 25 25 135 75
57 26 25 127 78
58 27 25 47 73
59 28 25 98 80
60 29 25 31 77
61 30 25 138 83
Note that we had to pass the list columns as a vector (recall that the c
function is used to create a vector). By the way, we just retrieved all the columns except one:
Month
. So instead of writing a long string of column names, we can simply
write all columns except Month
. There is no equivalent in SQL. In R, you
can write this:
> subset(a, a$Month==6, select=-Month)
Ozone Solar.R Wind Temp Day
32 25 286 8.6 78 1
33 25 287 9.7 74 2
34 25 242 16.1 67 3
35 25 186 9.2 84 4
36 25 220 8.6 85 5
... output truncated...
With the "[]" notation, you can achieve the same objective by writing this:
> a[a$Month==6,c("Day","Ozone","Solar.R","Temp")]
Day Ozone Solar.R Temp
32 1 25 286 78
33 2 25 287 74
34 3 25 242 67
... output truncated ...
Here you learned how to manipulate a data frame using the basic R syntax. Earlier in this article,
you saw briefly how the sqldf
function makes queries SQL-like, which
makes the learning curve for database professionals amazingly smaller. Actually, sqldf
is designed for database access from within R. We will explore sqldf
in more detail later in the series, in the installment dedicated to
database access.
Here is a quick recap of what you learned in this article.
To check for the type of collection for a variable x
, use the function
is.vector(x)
to check if it's a vectoris.list(x)
to check if it's a listis.factor(x)
to check if it's a factoris.data.frame(x)
to check if it's a data frameAll data types can be operated on as a whole, for example, to multiply 2 by all the elements of a
vector, matrix, or array, just call 2 * variable_name
.
While defining the collections, you can optionally supply column headers, or labels, as shown below:
list1 <- list(col1=1,col2="A",col3=T)
vector1 <- c(col1=1,col2=2,col3=3)
You can address the individual elements of the collection by position, for example, list1[1]
.
Or, you can access them by the column label, if that is defined: list1["col1"]
.
By default, the above will also show the column label. If you want the value alone, you use double
square brackets: list1[["col1"]]
.
For two-dimensional objects such as matrices and data frames, you need to use the [row,column]
format, for example, matrix1[1,3]
to get the first row, third column.
If you want to get the entire first row, that is, all the columns, use matrix1[1,]
.
For lists, matrices, and data frames you can use the $ notation if the column labels have been
created, for instance, matrix1$col1
.
To add row and column names to an exiting collection, you use the rownames()
and
colnames()
functions,
respectively.
> matrix1
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] 1 6 11 16 21 26
[2,] 2 7 12 17 22 27
[3,] 3 8 13 18 23 28
[4,] 4 9 14 19 24 29
[5,] 5 10 15 20 25 30
> colnames(matrix1) <- c("col1","col2","col3","col4","col5","col6")
> rownames(matrix1) <- c("row1","row2","row3","row4","row5")
To add rows to a matrix or a data frame, you need to use the rbind()
function.
To add columns, you need to use the cbind()
function.
To selectively get the values from a collection, you use the logical operator within square brackets. For instance, to get all values in a matrix greater than 20, you use this:
> matrix1 > 20
col1 col2 col3 col4 col5 col6
row1 FALSE FALSE FALSE FALSE TRUE TRUE
row2 FALSE FALSE FALSE FALSE TRUE TRUE
row3 FALSE FALSE FALSE FALSE TRUE TRUE
row4 FALSE FALSE FALSE FALSE TRUE TRUE
row5 FALSE FALSE FALSE FALSE TRUE TRUE
> matrix1[matrix1 > 20]
[1] 21 22 23 24 25 26 27 28 29 30
That's it. Now it's time to test your understanding using the quiz.
Below are 10 questions followed by the answers.
Question 1: What will be the result of the following code? If this will result in a syntax error, mention that.
v1 <- 100
v2 <- c(100)
if (v1==v2) {
paste("They are same")
} else {
paste("They are different")
}
Question 2: What will be the output of the following code? If you think it will
result in a syntax error, mention that. Hint: the c()
function creates a
vector and a vector can contain values of only one data type.
> v2 <- c(1,T)
> v2[2]
Question 3: Refer to the dataset airquality
. Write an R
expression that pulls five rows at random and displays all the columns.
Question 4: What will be the output of the following expression? If this will produce an error, say so.
v1 <- list(a=1:10, b=1:20)
Question 5: I want to create a two-column matrix with column labels "a" and "b" and with 10 rows. Here is what I proposed. Will this accomplish what I need or produce a syntax error?
v1 <- matrix(c(a=1:10, b=1:10))
Question 6: You need to create a three-dimensional data type of all integers. What's the best collection type to use to do that?
Question 7: You want to check if v1
is a vector. It has
been defined earlier (although you are not aware of that) as follows:
> v1 <- c(1:10)
So you give the following command:
> class(v1)
Will that give you a confirmation of how the variable was defined?
Question 8: A variable v1
has been defined as follows:
v1 <- 7
However, you are not aware of that. To check if it's a vector, you use the following command:
> is.vector(v1)
[1] TRUE
Why does it show as a vector? It's just an integer.
Question 9: Here is a data frame.
> df1
col1 col2 col3 col4 col5 col6
row1 1 6 11 16 21 26
row2 2 7 12 17 22 27
row3 3 8 13 18 23 28
row4 4 9 14 19 24 29
row5 5 10 15 20 25 30
You want to extract row3
and two columns: col1
and col2
. You gave the following expression:
> df1[row3,("col1","col2")]
Will it produce the desired result? If it will result in a syntax error, mention that.
Question 10: You defined a factor that is supposed to represent some sort of level of seniority among employees.
f1 <- factor(c(1,2,3,4))
So, 1 is less in seniority than 2 and so on. However, when you try to determine the minimum seniority, you get this:
> min(f1)
Error in Summary.factor(1:4, na.rm = FALSE) :
'min' not meaningful for factors
What happened? How can you eliminate the error?
Answer 1: It will not produce a syntax error. Recall that behind the scenes, all data types are essentially vectors. So you will get the message "They are same."
Answer 2: Note it carefully. This is a vector, as the c()
function would create. You have assigned a number and a logical value (T
).
Elements of multiple data types are not allowed in vectors. So you can
assume that it will fail with a syntax error. However, a vector does not actually fail. R will
convert all the elements into a superset data type that can accommodate all the elements and make
them of a single data type. In this case, it will be of type number, because logical values will be
assigned a value of 0 or 1 if they are false or true, respectively. So, this vector will be stored
as c(1,1)
and v2[2]
will yield 1
.
Answer 3: To sample five values from a set of known rows, you have to use the sample()
function, as shown below.
sample(total_no_of rows, 5)
To get the total number of rows, you have to use the nrow()
function. So,
here is the expression to get five random values from the value of the total number of rows:
> sample(nrow(a),5)
[1] 60 56 101 16 70
This gives a vector containing the index of the rows of the dataframe. So, we can use that to get the rows from the data frame.
> a[sample(nrow(a),5),]
Ozone Solar.R Wind Temp Month Day
24 32 92 12.0 61 5 24
43 25 250 9.2 92 6 12
144 13 238 12.6 64 9 21
40 71 291 13.8 90 6 9
12 16 256 9.7 69 5 12
Answer 4: It will not produce error. Remember lists can contain items of different data types and structures. So this will be a two-element list with the elements being a 10-number array and a 20-number array.
Answer 5: It will not produce a syntax error, but it will not yield what you needed.
The expression c(a=1:10,b=1:10))
will merely create a 20-element vector.
Here is a demonstration:
> v1 <- c(a=1:10, b=1:10)
> v1
a1 a2 a3 a4 a5 a6 a7 a8 a9 a10 b1 b2 b3 b4 b5 b6 b7 b8 b9 b10
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
So, the expression in the question is equivalent to this:
matrix(1 through 10 and 1 through 10 again)
It will be a single-column matrix, because rows are used first. To create a two-column matrix, I need to use this:
> v1 <- matrix(c(a=1:10, b=1:10), ncol=2)
> v1
[,1] [,2]
[1,] 1 1
[2,] 2 2
[3,] 3 3
[4,] 4 4
[5,] 5 5
[6,] 6 6
[7,] 7 7
[8,] 8 8
[9,] 9 9
[10,] 10 10
And then I can name the columns using the following:
> colnames(v1) <- c("a","b")
Answer 6: The answer is simple: it's an array collection type. Matrix allows only two dimensions.
Answer 7: No; it will not. class(v1)
will show integer
, the data type of the elements of the vector. To know if it's a
vector, you have to use is.vector()
function.
> is.vector(v1)
[1] TRUE
Answer 8: The atomic data types are also vectors under the covers. So v1
is a vector of one element.
Answer 9: It will result in a syntax error. The correct syntax is this:
df1[c("row3"),c("col1","col2")]
Note the two columns are passed as a vector.
Answer 10: The factor has to be ordered. The default is unordered and, therefore, it is not possible to have a comparative relationship. This should have been defined as follows:
> f1 <- factor(c(1,2,3,4), ordered = T)
> min(f1)
[1] 1
Levels: 1 < 2 < 3 < 4
Arup Nanda arup@proligence.com has been an Oracle DBA since 1993, handling all aspects of database administration, from performance tuning to security and disaster recovery. He was Oracle Magazine's DBA of the Year in 2003 and received an Oracle Excellence Award for Technologist of the Year in 2012. Arup Nanda has been an Oracle DBA since 1993, handling all aspects of database administration, from performance tuning to security and disaster recovery. He was Oracle Magazine's DBA of the Year in 2003 and received an Oracle Excellence Award for Technologist of the Year in 2012.