Importing ADLS gen2 from Azure ML Studio

I recently got an interesting question: how to access data stored in Azure Data Lake Storage gen2 from Azure Machine Learning Studio?

Alright, there is some stuff to unpack there. Azure Data Lake Storage gen2 is a new iteration on Azure Data Lake Storage that leverages the Azure Blob Storage engine with hierarchical addressing. The APIs are different than ADLS gen1, and somewhat different than of native Azure Blob Storage.

Azure Machine Learning Studio has been around for a while, and it is essentially a canvas where you can use Lego-like components to create Machine Learning experiments, build ML models, and deploy those to Web Services. Even though the newer Azure Machine Learning Service offers more functionality (IMHO), Azure ML Studio is still used by people that would like a zero-code ML experience.

So let us assume you are an Azure ML Studio user, and your data is stored in ADLS gen2. When you try to import your data in an Azure ML Studio experiment, you will find out that ADLS gen2 is not in the list of supported data sources in the “Import Dataset” component. “Azure Blob Storage” is in the list of supported data sources, but you will not be able to import ADLS gen2 data over this connector.

Luckily enough, you can use Python scripts to import data as well. If you have a look at the REST API reference for ADLS gen2, you will see that it is very close to the Azure Blob Storage API, with the authentication mechanism described in this article called “Authorize with Shared Key”.

The problem is that this authentication mechanism is rather convoluted, so most people out there use other authentication schemes to access Azure Blob Storage (such as AAD authentication, recently supported). As a consequence, you will not find a lot of examples out there.

Hence I decided to write this post with one example. Back to our problem, you can use an “Execute Python Script” module in Azure ML Studio. I tested the version “Anaconda 4.0/Python 3.5, with this code:

import pandas as pd
import io
import requests
import datetime
import hmac
import hashlib
import base64

def azureml_main(dataframe1 = None, dataframe2 = None):
  # Fill in your own variables
  storage_account_name = ''
  storage_account_key = ''
  api_version = '2018-03-28'
  request_time = datetime.datetime.utcnow().strftime('%a, %d %b %Y %H:%M:%S GMT')
  filepath='/adlsfilesystem/yourfolder/yourfile.csv'

  # Object with parameters to be signed. For a Get most of them are blank, full list for reference and supportability
  string_params = {
    'verb': 'GET',
    'Content-Encoding': '',
    'Content-Language': '',
    'Content-Length': '',
    'Content-MD5': '',
    'Content-Type': '',
    'Date': '',
    'If-Modified-Since': '',
    'If-Match': '',
    'If-None-Match': '',
    'If-Unmodified-Since': '',
    'Range': '',
    'CanonicalizedHeaders': 'x-ms-date:' + request_time + '\nx-ms-version:' + api_version + '\n',
    'CanonicalizedResource': '/' + storage_account_name + filepath
  }
  # String out of previous parameters
  string_to_sign = (string_params['verb'] + '\n'
    + string_params['Content-Encoding'] + '\n'
    + string_params['Content-Language'] + '\n'
    + string_params['Content-Length'] + '\n'
    + string_params['Content-MD5'] + '\n'
    + string_params['Content-Type'] + '\n'
    + string_params['Date'] + '\n'
    + string_params['If-Modified-Since'] + '\n'
    + string_params['If-Match'] + '\n'
    + string_params['If-None-Match'] + '\n'
    + string_params['If-Unmodified-Since'] + '\n'
    + string_params['Range'] + '\n'
    + string_params['CanonicalizedHeaders']
    + string_params['CanonicalizedResource'])
  # You can uncomment the following line for some troubleshooting
  #print('String to sign:', string_to_sign)
  # Now we can finally build the signed string to use as authentication 
  signed_string = base64.b64encode(hmac.new(base64.b64decode(storage_account_key), msg=string_to_sign.encode('utf-8'), digestmod=hashlib.sha256).digest()).decode()

  # Let us build the HTTP GET request, starting with the HTTP headers
  headers = {
    'x-ms-date' : request_time,
    'x-ms-version' : api_version,
    'Authorization' : ('SharedKey ' + storage_account_name + ':' + signed_string)
  }
  # Now the URL, and let us send it out
  url = ('https://' + storage_account_name + '.dfs.core.windows.net' + filepath) 
  s = requests.get(url, headers = headers).content
  c=pd.read_csv(io.StringIO(s.decode('utf-8')))

  # Azure ML Studio Python module expects two outputs, hence the comma in the next line
  return c,

I think the code is self-explanatory. The main challenge I found was in finding out how to build the signed string, since the parameters are different depending on what operation are you doing on the API. Fortunately, for a simple GET it is not too complicated.

So now you have no excuse for not using ADLS gen2 from ML Studio. Have fun!

 

2 thoughts on “Importing ADLS gen2 from Azure ML Studio

  1. Prakash

    Able to read the csv files properly. tried to read sas7bdat files in the same way(through pandas read_sas) and it is not working out. Please suggest on this.

    Like

    1. Hey Prakash, thanks for reading! Sorry, this post is 3 years old, and in the meantime I have to admit that I have not followed this space very closely. So instead of trying to guess and misguide you, I would suggest to send that question through any other channel available to you (for example, did you know that you can engage FastTrack for Azure for your projects)?

      Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: