Docker container with R and ODBC

If you are using R to perform statistical analysis, chances are that at least some of the data you need is stored in a database. While some databases such as SQL Server support native R scripts, others do not (for example, support for R in Azure SQL Database is in preview at the time of this writing). Additionally, your R script might need access to other data sources besides your SQL database (maybe the data in the SQL database is only use to enrich the main data source with additional context metadata).

The first obvious choice might be running your R analysis in a virtual machine. However, if those analysis do not require an awful lot of CPU and memory, a VM might not be the optimum choice. Besides, you would need to take care of starting the VM, running the R script and shutting the VM down again, not to mention the storing of the R logs somewhere in order to check whether the script has run successfully or not.

In certain scenarios you might want to use a Docker container instead of a full blown virtual machine: in a Docker container you can install all prerequisites (R of course, plus ODBC drivers to access the database, plus any R package you might need), with the advantage that Docker containers are very easy to start and stop. For example, using Container-as-a-Service offerings such as Azure Container Instances, what we will be seeing in this post.

Now that we have described the use case, let us roll up our sleeves. First things first, we need a container with ODBC and R. You could use the predefined Docker images in the rocker repository. Frankly, had I not needed ODBC that would have been much easier. However ODBC requires some extra software at the OS layer (ODBC drivers), so I decided to build the image from a standard Ubuntu 18.04. Not exactly best practice for Docker containers, since it probably has many more packages that what I need, building an image out of something more streamlined such as alpine might be a future improvement.

I used the following Dockerfile to install some prerequisites, open source R, ODBC driver version 17 for SQL Server and finally the odbc R package. Here you can see the contents of the Dockerfile:

FROM ubuntu:18.04
MAINTAINER Jose Moreno "jose.moreno@microsoft.com"

# Install tzdata and configure Timezone
# We do this in the first place to make sure tzdata will not stop R installation
RUN export DEBIAN_FRONTEND=noninteractive
RUN apt-get update -y && apt-get install -y tzdata
RUN ln -fs /usr/share/zoneinfo/America/New_York /etc/localtime
RUN dpkg-reconfigure --frontend noninteractive tzdata

# Install R
# Some of these packages such as apt-utils, apt-transport-https or gnupg2 are required so that the R repo can be added and R installed
# Note that the R repo is specific for the Linux distro (Ubuntu 18.04 aka bionic in this case)
# Other packages such as curl will be used later to install ODBC
RUN apt-get update -y && apt-get install -y build-essential curl libssl1.0.0 libssl-dev gnupg2 software-properties-common dirmngr apt-transport-https apt-utils lsb-release ca-certificates
RUN apt-key adv --keyserver keyserver.ubuntu.com --recv-keys E298A3A825C0D65DFD57CBB651716619E084DAB9
RUN add-apt-repository 'deb https://cloud.r-project.org/bin/linux/ubuntu bionic-cran35/'
RUN apt-get update -y && apt-get install -y r-base

# See about installing ODBC drivers here: https://docs.microsoft.com/en-us/sql/connect/odbc/linux-mac/installing-the-microsoft-odbc-driver-for-sql-server?view=sql-server-2017
# Note that the driver version installed needs to match the version used in the code
# In this case for Ubuntu 18.04: ODBC SQL driver 17
RUN curl https://packages.microsoft.com/keys/microsoft.asc | apt-key add -
RUN curl https://packages.microsoft.com/config/ubuntu/18.04/prod.list > /etc/apt/sources.list.d/mssql-release.list
RUN apt-get update -y
RUN ACCEPT_EULA=Y apt-get install -y msodbcsql17 unixodbc-dev mssql-tools

# Install R packages (odbc R package must be installed AFTER installing ODBC in the OS)
RUN Rscript -e "install.packages('odbc')"

# Copy and execute R script
RUN mkdir /home/analysis
COPY transformdata.r /home/analysis/transformdata.r
CMD R -e "source('/home/analysis/transformdata.r')"

As you probably guessed looking at the previous dockerfile, I found some problems with the R installation, since it got stuck at the point where it was installing tzdata expecting some terminal input. Hence I install tzdata in non-interactive mode before installing R.

Secondly, you have probably realized as well that the installation of the ODBC driver for SQL Server on Ubuntu is non-trivial, but there is some documentation out there (see the comments in the dockerfile itself).

The last step copies and executes an R script. This R script is pretty basic: it accesses the records in a table in a database (‘Categories’), and stores the line count with a timestamp in another table in the same database (‘Rtest’). Here is the transformdata.r script I use for this test:

# Script used to test ODBC read/write connection to SQL Server DB from a Docker container
# Requires package odbc installed
# Takes DB connection parameters out of environment variables:
#  - DB_SERVER
#  - DB_NAME
#  - DB_USER
#  - DB_PASSWORD
# Created by jose.moreno@microsoft.com
print(paste0(Sys.time(), ': Importing odbc package...'))
library(odbc)
print(paste0(Sys.time(), ': Initiating connection to server ', Sys.getenv()['DB_SERVER'], ' for database ', Sys.getenv()['DB_NAME'], ' with user ', Sys.getenv()['DB_USER']))
con <- dbConnect(odbc(),
  Driver = "ODBC Driver 17 for SQL Server",
  Server = Sys.getenv()['DB_SERVER'],
  Database = Sys.getenv()['DB_NAME'],
  UID = Sys.getenv()['DB_USER'],
  PWD = Sys.getenv()['DB_PASSWORD'],
  Port = 1433)
print(paste0(Sys.time(), ': Reading from database...'))
result1 <- dbSendQuery(con, 'SELECT * FROM Categories')
data <- dbFetch(result1)
rowno <- nrow(data)
print(paste0(Sys.time(), ': Accessed database, retrieved ', rowno, ' records'))
now <- Sys.time()
query <- paste0('INSERT INTO Rtest (Date, CatCount) VALUES (\'', now, '\', ', rowno, ');')
result2 <- dbSendQuery (con, query)
print(paste0(Sys.time(), ': Appended to database: ', rowno))

Pretty straightforward: it connects to a database (connection parameters are expected to be supplied in environment variables), reads the contents of the table ‘Categories’ and puts them into a data frame, counts the lines of the data frame and writes a record to the table ‘Rtest’.

Now we have all the components of our container. If you put the Dockerfile and the transformdata.r script into their own directory, you can build the docker image. For that I typically use the build functionality of Azure Container Registry, but you could of course use your locally-installed docker software:

az acr build -r yourACRname -t rodbc:1.0 .

Where yourACRname is the name of your Azure Container Registry (you do not need to specify a resource group because ACR names in Azure are globally unique). The build process will run for around 5 minutes, and after that you should have a container image in your Azure Container Repository. Let’s run it!

The easiest way of running containers in Azure is using the Azure Container Instance service. You provide a source image, some parameters such as the required CPU and RAM and environment variables, and you are ready to rumble. We need some preparations though: creating a resource group and identifying the parameters of our database. Let us do that:

az group create -n rodbc -l westeurope
kvname=yourKeyvaultName
secretname=databasePassword
dbpwd=$(az keyvault secret show -n $secretname --vault-name $kvname --query value -o tsv 2>/dev/null)
dbserver=yourdbserver.database.windows.net
dbname=yourdatabase
dbuser=yourdbuser
acr_usr=$(az acr credential show -n yourACRname -g yourACRresourcegroup --query username -o tsv)
acr_pwd=$(az acr credential show -n yourACRname -g yourACRresourcegroup --query passwords[0].value -o tsv)

The previous variables assume that you have a database with configured credentials, where the ‘Categories’ and ‘Rtest’ tables exist. You can use these CLI examples to create an Azure SQL Database, and standard SQL commands to create the tables. I will not put them in this blog since I believe that to be straight forward, but let me know in the comments if you have some trouble with that.

Finally the last two commands retrieve the authentication credentials for our Azure Container Registry, which we will need when deploying the container.

Now we can start up our container with the required environment variables:

az container create -n rodbc -g rodbc \
    --registry-login-server=youracrname.azurecr.io \
    --registry-username=$acr_usr --registry-password=$acr_pwd \
    --restart-policy Never 
    --environment-variables \
          DB_SERVER=$dbserver \
          DB_NAME=$dbname \
          DB_USER=$dbuser \
          DB_PASSWORD=$dbpwd \
    --image=youracrname.azurecr.io/rodbc:1.0

For readability I have split the command in multiple lines, but if you have problems with copy and paste you can of course collapse it into a single line. See the --cpu and --memory options for the command az container create, should your container need more (or less) hardware resources than the default 1 CPU core and 1.5 GB RAM.

After some seconds your container will be up and running:

$ az container list -g rodbc -o table
Name    ResourceGroup    Status     Image                             CPU/Memory       OsType    Location
------  ---------------  ---------  --------------------------------  ---------------  --------  ----------
rodbc   rodbc            Succeeded  youracrname.azurecr.io/rodbc:1.0  1.0 core/1.5 gb  Linux     westeurope

Lastly, let us have a look at the container logs:

az container logs -g rodbc -n rodbc
[...]
> source('/home/analysis/transformdata.r')
[1] "2019-05-29 01:55:48: Importing odbc package..."
[1] "2019-05-29 01:55:48: Initiating connection to server yourdbserver.database.windows.net for database yourdatabase with user yourdbuser"
[1] "2019-05-29 01:55:48: Reading from database..."
[1] "2019-05-29 01:55:48: Accessed database, retrieved 18 records"
Warning message:
In new_result(connection@ptr, statement) : Cancelling previous query
[1] "2019-05-29 01:55:48: Appended to database: 18"

Excellent, our container used the odbc R package to access the Azure SQL database in order to read and write data! Probably this use case might not be that relevant once Azure SQL Database has native support for R, but having this container image might prove handy to cover similar use cases where you need a Docker container with R and some other specific OS dependencies.

Thanks for reading!

 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: