Technical Articles

Tech Articles from your friends at Oracle and the developer community.

Oracle R Advanced Analytics for Hadoop: Part 2

In this article, which is Part 2 of a series, we will look at some of the more advanced features of Oracle R Advanced Analytics for Hadoop, including advanced analytics and machine learning, and how to use Spark. Oracle R Advanced Analytics for Hadoop is a component of Oracle Big Data Connectors and provides a set of R functions allowing you to connect to and process data stored in Hadoop Distributed File System (HDFS), using Hive transparency, as well as data stored in Oracle Database.

In Part 1 of this series, we looked at some of the more typical use cases for using Oracle R Advanced Analytics for Hadoop, including working with Oracle Database, HDFS, and Hive and initiating map-reduce jobs. Oracle R Advanced Analytics for Hadoop has a number of highly scalable machine learning algorithms and utilizes some of the Apache Spark machine learning algorithms for greater performance of in-memory distributed machine learning. This will be the focus of this article.

Analytical and Machine Learning Features in Oracle R Advanced Analytics for Hadoop

When using Oracle R Advanced Analytics for Hadoop, you also have access to the wide range of analytic functions available from the many thousands of R packages. It would take a very long time to cover all of those here, but when we look closer at what analytic functions are specific to Oracle R Advanced Analytics for Hadoop, we find the functions listed in Table 1. To find this list, you can use the following R command once the ORCH package has been loaded.


apropos("^orch")

Table 1: Statistical and analytic functions available in Oracle R Advanced Analytics for Hadoop

Function Name
orch.cor
orch.cov
orch.glm
orch.glm2*
orch.glm.control
orch.kmeans
orch.lm
orch.lm2*
orch.lmf
orch.multivar
orch.neural
orch.neural2*
orch.nmf
orch.predict
orch.princomp
orch.sample
orch.scale

Note: * These are Spark-enabled versions of the functions.

With each release or Oracle R Advanced Analytics for Hadoop, you will find that the list of analytic and machine learning functions will increase. These functions have been specifically tuned to work in big data environments with data in HDFS and Hive. This allows these functions to be available for scaling using map-reduce jobs as well as for improvements in memory usage and so on.

The following example illustrates the creation of a linear regression model, using the orch.lm() function, using the on-time flight dataset.


# Attach the dataset containing the details of flights
# Data file is located in HDFS
ontime_DS <- hdfs.attach("/user/oracle/ontime_s")

# Create a linear regression model on this dataset to 
# predict the possible flight delay time
# 
# Map-Reduce is used to scale the processing to create the model
# using 4 mappers and 2 reducers 
lm_model <- orch.lm(ARRDELAY ~ DISTANCE + DEPDELAY, 
                  dfs.dat = ontime_DS,
                  numMappers = 4, 
                  numReducers = 2)

# Display the summary details of the LM model
summary(lm_model)

As you can see, these Oracle R Advanced Analytics for Hadoop analytic and machine learning functions are easy to use and are highly scalable. Make sure you check the documentation for each function to ensure that you are maximizing them fully.

Spark Machine Learning Feature in Oracle R Advanced Analytics for Hadoop

Over the past few releases of Oracle R Advanced Analytics for Hadoop, Oracle has been increasing the support for using Spark. By doing this, Oracle is making it easier to access and use the various machine learning functions available in Spark, thereby utilizing their memory-resident efficiency. Additionally, some of the HDFS functions have been updated to allow data to be easily transferred from Spark RDDs into HDFS. Similarly these Spark-based functions can be run on data stored in HDFS and Hive. Table 2 lists the Spark-enabled functions in Oracle R Advanced Analytics for Hadoop (version 2.7.1).

Table 2: Spark-enabled functions available in Oracle R Advanced Analytics for Hadoop

Function Name
orch.lm2
orch.glm2
orch.ml.gmm
orch.ml.linear
orch.ml.lasso
orch.ml.ridge
orch.ml.logistic
orch.ml.dt
orch.ml.random.forest
orch.ml.svm
orch.ml.kmeans
orch.ml.pca

It is expected that the list of functions in Table 2 will be expanded with each subsequent release of Oracle R Advanced Analytics for Hadoop.

Additionally, some Oracle R Advanced Analytics for Hadoop functions have been updated to support the use of these algorithms in Spark. These include an updated predict() function for scoring new datasets using the Spark-based models. The orch.save.model() function saves the Spark-based model details to a file in HDFS. This allows the model to be saved for later use or for sharing with other data scientists. The orch.load.model() function can then be used to reload the Spark model back into the environment.

To enable access to Spark from your R and Oracle R Advanced Analytics for Hadoop environment, it is important that you have Spark installed and the necessary environment variables enabled to make it accessible. This can be easily configured by editing the Renviron.site file to ensure that the SPARK_HOME and SPARK_JAVA_OPTS environment variables are set and to ensure that the necessary Spark directories are included in the CLASSPATH. Some of this setup is dependent on your environment. The working environment for the articles in this series is the Oracle VM VirtualBox prebuilt virtual machine called Oracle Big Data Lite VM (see Part 1 for more information).

The first step is to create a Spark connection. A Spark connection can be set up to use Yarn or it can be set up in a standalone mode. The following example illustrates the spark.connect() function. This function has four parameters:

  • The first is to define if you are going to use Yarn or standalone mode. This first parameter is called the master.
  • The second (optional) parameter is a name variable that helps centralize logging in the session on the Spark master. By default, this is set to ORCH and does not need to be defined in the call to the function.
  • The third parameter is used to define the amount of memory to allocate per Spark worker for this Spark context.
  • The fourth parameter, dfs.namenode, points to the HDFS namenode server in order to exchange information with HDFS.

# Load the ORCH R package 
library(ORCH) 

# Create the Spark connection using Yarn 
spark.connect("yarn-client", 
               memory="512m", 
               dfs.namenode="bigdatalite.localdomain")

After the Spark connection is set up, you can proceed to process the data and run the Spark-enabled algorithms that you need to use. The following example illustrates using the Spark algorithm orch.glm2() to fit a model for the kyphosis dataset that is part of the rpart R package.


# Load the rpart package to allow access to the kyphosis dataset
# Create a local copy of the dataset
library(ORCH)
library(rpart)
k_dataset <- kyphosis

# Write the dataset to HDFS. 
# It will be this dataset that will be used with Spark
k_hdfs <- hdfs.put(kyphosis)
# List the contents of the default directory in HDFS 
# and verify the file exists
hdfs.ls()

# Call the Spark-enabled GLM2 function to generate the 
# machine learning model
sparkModel <- orch.glm2(Kyphosis ~ Age + Number + Start, 
                        dfs.dat = dfs.dat)

 ORCH GLM: processed 1 factor variables, 0.365 sec 
 ORCH GLM: created model matrix, 2  partitions, 0.398 sec 
 ORCH GLM: iter  1,  deviance   1.12289843250711020E+02,  elapsed time 0.216 sec 
 ORCH GLM: iter  2,  deviance   6.64219993846240600E+01,  elapsed time 0.304 sec 
 ORCH GLM: iter  3,  deviance   6.18628545282569460E+01,  elapsed time 0.277 sec 
 ORCH GLM: iter  4,  deviance   6.13897990884807400E+01,  elapsed time 0.313 sec 
 ORCH GLM: iter  5,  deviance   6.13799331446360300E+01,  elapsed time 0.460 sec 
 ORCH GLM: iter  6,  deviance   6.13799272764552550E+01,  elapsed time 0.214 sec

The GLM2 Spark model can be saved to HDFS using the orch.save.model function. This function takes the name of the model as the first parameter and the name of the file in HDFS as the second parameter.


orch.save.model(sparkModel, "sparkmodel_hdfs", overwite=TRUE)

When you want to reuse the saved model, you can use the orch.load.model function to load the model details back into your R environment, for example:


modelReloaded <- orch.load.model("sparkmodel_hdfs")

When you are finished performing your analytics and machine learning using Spark, you can close the connection using the spark.disconnect function. This function does not delete the current Spark context but instead marks it as nonactive. Within a short time, all resources will be freed up by the R environment and Java garbage collectors.


# Disconnect from Spark
spark.disconnect()

Summary

In this article, we looked at using Oracle R Advanced Analytics to perform some advanced analytics work. This included using some of the machine learning algorithms and how to use Spark to enable machine learning.

In Part 1 of this series, we looked at how to work with data in Oracle Database, HDFS, and Hive and how to initiate map-reduce jobs. Make sure to check out this article.

About the Author

Oracle ACE Director Brendan Tierney is an independent consultant (Oralytics) and lectures on data science, databases, and big data at the Dublin Institute of Technology/Dublin Technological University. He has 24+ years of experience working in the areas of data mining, data science, big data, and data warehousing. As a recognized data science and big data expert, Tierney has worked on projects in Ireland, the UK, Belgium, Holland, Norway, Spain, Canada, and the US. He is active in the UK Oracle User Group (UKOUG) community and one of the user group leaders in Ireland. Tierney has also been editor of the UKOUG Oracle Scene magazine, is a regular speaker at conferences around the world, and writes for several publications. In addition, he has published four books, three with Oracle Press/McGraw-Hill (Predictive Analytics Using Oracle Data Miner, Oracle R Enterprise: Harnessing the Power of R in Oracle Database, and Real World SQL and PL/SQL: Advice from the Experts) and one with MIT Press (Essentials of Data Science).

Latest content

Explore and discover our latest tutorials

Serverless functions

Serverless functions are part of an evolution in cloud computing that has helped free organizations from many of the constraints of managing infrastructure and resources. 

What is a blockchain?

In broad terms, a blockchain is an immutable transaction ledger, maintained within a distributed peer-to-peer (p2p) network of nodes. In essence, blockchains serve as a decentralized way to store information.

OCI CLI

The CLI is a small-footprint tool that you can use on its own or with the Console to complete Oracle Cloud Infrastructure tasks. The CLI provides the same core functionality as the Console, plus additional commands.