Apache Sqoop

Sunday Jan 12, 2014

Using Oracle Wallets with Sqoop Jobs

How to Secure Apache Sqoop Jobs with Oracle Wallet

Apache Sqoop is a tool that transfers data between the Hadoop ecosystem and enterprise data stores. Sqoop does this by providing methods to transfer data to HDFS or Hive (using HCatalog). Oracle Database is one of the databases supported by Apache Sqoop.

With Oracle Database, the database connection credentials are stored in Oracle Wallet. Oracle Wallet can act as the store of keys and secrets such as authentication credentials. This post describes how Oracle Wallet Manager adds a secure authentication layer for Sqoop jobs.

In order to connect to an external database, Sqoop users must provide a set of credentials specific to that data store. These credentials are typically in the form of user name and password.

Some enterprises may already be standardized on credential management tools provided by their enterprise DBMS vendor. These companies may not be comfortable with any of Sqoop’s three authentication methods:

Password provided on the command line

Password read from the console during the interactive execution of a Sqoop job.

Password provided on a secure file system that only the user can access.

With Oracle Database, the database connection credentials can be securely stored in an Oracle Wallet. The use of Oracle Wallet can allay the security concerns mentioned above because it provides a secure client-side software container allowing secure storage of authentication and signing credentials.

With support for Oracle Wallet, Sqoop jobs no longer need to embed usernames and passwords. This reduces the risk of exposing credentials and eases enforcement of authentication policies, since application code and scripts need not change whenever usernames or passwords change.

Instructions for Using Oracle Wallets for External Password Store

Please note that the following steps were tested with Oracle 11gR2 database. We expect these instructions should be applicable to other versions of Oracle DB, but they were not tested on versions other than Oracle 11gR2.

At a high-level, these are the steps (which we describe in detail below):

Create an Oracle client-side wallet

Create tnsnames.ora and sqlnet.ora files

Add the database access credentials to the Oracle wallet

Test the Oracle wallet

Use the Oracle wallet for Sqoop jobs

Modify the JDBC URL for the connection to use the wallet

Run the Sqoop job with Oracle wallet

Create an Oracle Client-Side Wallet

The first step is to create an Oracle wallet to hold the credentials used by Sqoop jobs. Oracle provides various tools to manage the Oracle Wallet. This document describes one of those: the mkstore command line tool. The mkstore command can be used to create the wallet:

mkstore -wrl <wallet_location> -create

</o:p>

That command line creates a client side wallet at the location <wallet_location>. Substitute <wallet_location> to your valid directory location.

It will be an auto-logon wallet. So before a client process can use the wallet, the ownership of the wallet directory and files (owner and group) must match the process uid and gid.

Make sure that the password matches the database minimum requirements.

Create tnsnames.ora and sqlnet.ora Files

The Oracle database uses a few network configuration files. The files tnsnames.ora and sqlnet.ora are used for configuring client side network access.

tnsnames.ora

The configuration file tnsnames.ora has client side local naming parameters. The contents of tnsnames.ora are used to map the network service name to an Oracle database connection descriptor.

Here is an example of a tnsnames.ora file:

W_ORCL =

( DESCRIPTION =

(ADDRESS = (PROTOCOL = TCP)(HOST = 192.168.64.51)(PORT = 1521))

(CONNECT_DATA =

(SERVER = DEDICATED)

(SERVICE_NAME = orcl)

)

)

In the above tnsnames.ora file, the network service name W_ORCL is mapped to a connection descriptor accessing a database instance named ORCL running on a host with an IP address 192.168.64.51.

sqlnet.ora

The configuration file sqlnet.ora is used to specify client side network configuration properties that influence the connection profiles. This file configures tracing, authentication, routing, and advanced security parameters.

Here is an example of a sqlnet.ora file:

WALLET_LOCATION =

(SOURCE =

(METHOD = FILE)

(METHOD_DATA =

(DIRECTORY = <full path to file>)

)

)

SQLNET.WALLET_OVERRIDE = TRUE

SSL_CLIENT_AUTHENTICATION = FALSE

The tnsnames.ora and sqlnet.ora files can be placed in separate directories or they can both be placed in the wallet directory.

Add the Database Access Credentials to the Oracle Wallet

Now that we’ve created an Oracle Wallet, we need to populate it with the database access credentials. We can use the -createCredential option to the mkstore command to do this:

mkstore -wrl <wallet-location> -createCredential <DBSvc> <DBUser> <DBPassword>

<wallet-location> is the directory where the wallet files will be stored

<DBSvc> is the service name defined in tnsnames.ora

<DBUser> is the username to access the database

<DBPassword> is the database password for the user <DBUser>

For example, with the sample service defined in tnsames.ora, and a DB user “sqoop” using password “sqooppwd”, the command invocation would be:

mkstore -wrl <wallet-location> -createCredential w_orcl sqoop sqooppwd

This will prompt for the wallet password, which should be same as the one used when creating the wallet.

Test the Oracle Wallet

A quick way to test the password is to attempt a DB session with SQLPLUS.

If the tnsnames.ora and sqlnet.ora settings are not in the default location ($ORACLE_HOME/network/admin), then you can provide the location of these files using the TNS_ADMIN environment variable.

Assuming the tnsnames.ora and sqlnet.ora are under $HOME/wallet_test, do the following:

</o:p>

export TNS_ADMIN=$HOME/wallet_test

sqlplus /@<svc_name> # in our case w_orcl

</o:p>

This command should successfully establish the connection. If it does not, check the tnsnames.ora contents to make sure the DB host, port or service name are correct.

If the service name in tnsnames.ora changes, then the wallet entry has to be created/modified.

Use the Oracle Wallet for Sqoop Jobs

After successfully validating the wallet, it can be used for Sqoop jobs. There are a few steps for providing the wallet with Sqoop:

● Provide the wallet to the Sqoop launcher

● Provision the wallet to the mapper tasks

● Make command line changes for the Sqoop launcher program

● Specify the location of the wallet and Oracle configuration files to the mapper tasks

● Specify the option to localize the files to the mapper tasks

● Modify the JDBC URL

Provide the Wallet to the Sqoop Launcher

For the Sqoop client program and the mappers launched by the Sqoop job, we have to specify the location for the wallet and the tnsnames.ora and sqlnet.ora files.

Add the following files to the $SQOOP_HOME/lib directory (typically SQOOP_HOME is set to /usr/lib/sqoop):

oraclepki.jar

osdt_cert.jar

osdt_core.jar

ojdbc6.jar

These jars are available as part of an Oracle Database installation.

Provision the Wallet to the Mapper Tasks

Copy the contents of the wallet directory from the database host along with the tnsnames.ora and sqlnet.ora files to a folder, such as $HOME/wallet, as the operating system user who is launching the Sqoop command.

The wallet directory will have the following files after this step:

cwallet.sso

ewallet.p12

sqlnet.ora

tnsnames.ora

The first two are the wallet files and the next two are the Oracle client network configuration files that we saw before.

Make sure the wallet directory and the files in that directory are owned by the operating system user and the group ownership is the same as the Sqoop user group.

Make Command Line Changes for the Sqoop Launcher Program

Since the wallet location and the TNS locations are different from the defaults, we have to override the location of the wallet and the tnsnames.ora and sqlnames.ora files.

Oracle JDBC exposes two properties for this purpose:

oracle.net.tns_admin – Location of the tnsnames.ora and sqlnet.ora files

oracle.net.wallet_location – Location of the wallet files, cwallet.sso and ewallet.p12

In this case we will set both these properties to $HOME/wallet.

Setting these two system properties for Sqoop can be achieved by setting the system environment variable HADOOP_OPTS which will be used for setting additional Java options to the JVM.

export HADOOP_OPTS= "-Doracle.net.tns_admin=$HOME/wallet -Doracle.net.wallet_location=$HOME/wallet_test"

</o:p>

Specify the Location of the Wallet and Oracle Configuration Files to the Mapper Tasks

While the client program uses the DB for retrieving metadata, the mapper tasks do the actual data transfer. To provide the wallet files to the mapper tasks, we specify them as part of the –files tool option to localize to each mapper. Also, we have to use the Hadoop configuration properties to pass specific Java command line options (similar to HADOOP_OPTS for the launcher).

Specify the Option to Localize the Files to the Mapper Tasks

This can be achieved by adding the –files option to the Sqoop command line.

Please note that Hadoop options have to come before any Sqoop options. The option and option argument can be specified as follows:

-files $HOME/wallet/cwallet.sso,\

$HOME/wallet/ewallet.p12,\

$HOME/wallet/sqlnet.ora,\

$HOME/wallet/tnsnames.ora

We also have to tell the Mapper tasks the location of wallet and TNS files. We use the hadoop configuration parameter mapred.child.java.opts to provide additional Java options to the mappers.

Note that we use the current directory as the location (‘.’). This is because the files provided by the –files option will be localized to the current directory for each mapper task.

-D mapred.map.child.java.opts=’-Doracle.net.tns_admin=. -Doracle.net.wallet_location=.'

If there are additional Java options that need to be provided for the mappers, they should be added to the Java options given below. For example, to add the system property -Djava.security.egd=file:/dev/./urandom, the options would be specified as:

-D mapred.map.child.java.opts=’-Doracle.net.tns_admin=. -Doracle.net.wallet_location=. –Djava.security.egd=file:/dev/./urandom'

Modify the JDBC URL

The JDBC URL provided should use the format jdbc:oracle:thin:@SVC where the SVC is the service name used for accessing the database.

This is added as part of the createCredential command above and should also be in the tnsnames.ora file.

Run the Sqoop Job with Oracle Wallet

Please make sure that Apache Sqoop v1.4.5 is used or (if using a prior version of Sqoop) install the OracleManager fix for wallet support.

When we use wallet-based authentication, we don’t provide the username or password for establishing the JDBC connection.

Sqoop code has been enhanced to properly handle this case (please see SQOOP</spancolor:>-1224</spancolor:> for more information). Use Oracle session user when the database user name is not explicitly provided.

This will be part of Sqoop 1.4.5 and later versions. For versions of Sqoop prior to 1.4.5, please make sure that the Sqoop product has this patch applied.

Run the Sqoop Job

Now we are ready to use the wallet with the Sqoop job.

Here is an excerpt from a script that can be used after the wallet and Oracle client configuration setup are complete. This combines all of the steps previously outlined in this post.

export HADOOP_OPTS= "-Doracle.net.tns_admin=$PWD/wallet -Doracle.net.wallet_location=$PWD/wallet "

sqoop {import|export} \

-D mapred.map.child.java.opts= '-Doracle.net.tns_admin=. -Doracle.net.wallet_location=.' \

-files $HOME/wallet/cwallet.sso,\

$HOME/wallet/ewallet.p12,\

$HOME/wallet/sqlnet.ora,\

$HOME/wallet/tnsnames.ora \

-connect ‘jdbc:oracle:thin:@w_orcl’ \

<<rest of sqoop options>>

Thursday Sep 19, 2013

Connectors and drivers in the world of Apache Sqoop

Apache Sqoop is a tool for highly efficient data transfers between relational databases and the entire Hadoop ecosystem. One of the significant benefits of Sqoop is that it’s easy to use and can work with a variety of systems both inside and outside of the Hadoop ecosystem. With one tool, Sqoop, you can import or export data from all databases supporting the JDBC interface using the same command line arguments exposed by Sqoop. Additionally, Sqoop was designed in modular fashion, allowing you to plug in specialized additions to optimise transfers for particular database systems.

While some users of various projects within the Hadoop ecosystem use the words "connector" and "driver" interchangeably, these words mean completely different things in context of Sqoop. As both connector and driver are needed for every Sqoop invocation, we see a lot of confusion in the use and understanding of these concepts. This blog post will explain the difference between them and how Sqoop uses these concepts to transfer data between Hadoop and other systems.

[Read More]

Wednesday Nov 07, 2012

What's new in Apache Sqoop 1.4.2

Sqoop 1.4.2 was released in August 2012. As this was an extremely important release for the Sqoop community - our first release as an Apache Top Level project - I would like to highlight the key features and fixes of this release. The entire change log can be viewed on our JIRA and actual bits can be downloaded from the usual place. [Read More]

Monday Apr 09, 2012

Sqoop Graduation Meetup

Cloudera hosted the Apache Sqoop Meetup last week at Cloudera HQ in Palo Alto. About 20 of the Meetup attendees had not used Sqoop before, but were interested enough to participate in the Meetup on April 4th. We believe this healthy interest in Sqoop will contribute to its wide adoption. Not only was this Sqoop's second Meetup but also a celebration for Sqoop's graduation from the Incubator, cementing its status as a Top-Level Project in the Apache Software Foundation.[Read More]

Monday Apr 02, 2012

Apache Sqoop Graduates from Incubator

Sqoop recently graduated from Apache Incubator to become a Top-Level Project. This post describes a brief history of Sqoop from its inception to becoming a TLP.[Read More]

Thursday Jan 12, 2012

Apache Sqoop: Highlights of Sqoop 2

The popularity of Apache Sqoop (incubating) in enterprise systems confirms that Sqoop does bulk transfer admirably. That said, to enhance its functionality, Sqoop needs to fulfill data integration use-cases as well as become easier to manage and operate. Sqoop 2 addresses these issues and its high-level design overview is detailed in this post.

[Read More]

Friday Dec 30, 2011

What's New in Apache Sqoop 1.4.0-incubating

Apache Sqoop recently celebrates its first incubator release, version 1.4.0-incubating. This blog post highlights some of the newly added features for this release.

[Read More]

Monday Nov 21, 2011

Inaugural Sqoop Meetup

Earlier this month over 30 people attended the inaugural Sqoop Meetup on the eve of Hadoop World in NYC. Faces were put to names, troubleshooting tips were swapped, and stories were topped. This post summarizes the Meetup, posts pictures from it, and links to the three presentations.

[Read More]

Wednesday Oct 05, 2011

Apache Sqoop - Overview

This post provides a high-level overview of Apache Sqoop (incubating). It discusses the general problem addressed by Sqoop and provides simple examples on how to use it. This post is written by Arvind Prabhakar, who is a Sqoop committer.

[Read More]

Calendar

Search

Hot Blogs (today's hits)

Tag Cloud

Categories

Feeds

Links

Navigation