Apache Sqoop

Thursday September 19, 2013

Connectors and drivers in the world of Apache Sqoop

Apache Sqoop is a tool for highly efficient data transfers between relational databases and the entire Hadoop ecosystem. One of the significant benefits of Sqoop is that it’s easy to use and can work with a variety of systems both inside and outside of the Hadoop ecosystem. With one tool, Sqoop, you can import or export data from all databases supporting the JDBC interface using the same command line arguments exposed by Sqoop. Additionally, Sqoop was designed in modular fashion, allowing you to plug in specialized additions to optimise transfers for particular database systems.

While some users of various projects within the Hadoop ecosystem use the words "connector" and "driver" interchangeably, these words mean completely different things in context of Sqoop. As both connector and driver are needed for every Sqoop invocation, we see a lot of confusion in the use and understanding of these concepts. This blog post will explain the difference between them and how Sqoop uses these concepts to transfer data between Hadoop and other systems.

Driver

The word driver in Sqoop refers to a JDBC Driver. JDBC is a standard Java API for accessing relational databases and some data warehouses. The Java language only prescribes what classes and methods this interface contains and the JDK does not have any default implementation. Each database vendor is responsible for writing their own implementation that will communicate with the corresponding database with its native protocol. As drivers are created by each database vendor, they are usually offered with restrictive licenses that prohibits them to be shipped with the Sqoop distribution. As a result each user needs to download the drivers separately and install them into Sqoop prior to its use.

Connector

Structured Query Language (SQL) is a programing language designed for communication with relational database systems. Even though that there is a standard prescribing how the language should look like almost every database has its own dialect of SQL. The basics are usually working the same across all databases, but  some edge conditions might be implemented differently.  Also as SQL is a very general query processing language, it might not be always the optimal way for importing or exporting data out of the database server.


Sqoop connectors allow Sqoop to overcome the differences in SQL dialects supported by various databases along with providing optimized data transfer. A connector is a pluggable piece that is used to fetch metadata about transferred data (columns, associated data types, …) and to drive the data transfer itself in the most efficient manner. The most basic connector that is shipped with Sqoop is called Generic JDBC Connector and, as the name suggests, it’s using only the JDBC interface for accessing metadata and transferring data. As a result this connector will work on most of the databases out of the box, but may not be the most optimal for your use case. Sqoop also ships with specialized connectors for MySQL, PostgreSQL, Oracle, Microsoft SQL Server, DB2 and Netezza, thus there is usually no need to download extra connectors to start transferring data. However there are special connectors available on the internet that can add support for additional database systems or improve performance of the built-in connectors.

Communication example

Let’s take a closer look how Sqoop is creating the connection to the database to see how both connectors and drivers are used.


Depending on specified command line arguments and all available connectors Sqoop will try to load the one offering the best performance. This process will start by trying all extra manually downloaded connectors that are not usually shipped with Sqoop to see if they can be used. Each connector itself will get the opportunity to decide whether it’s suitable for the job or not. If there are no manually installed connectors or none can be used, Sqoop will check the JDBC URL (usually starting with jdbc:) to see if we are connecting to database for which a built-in special connector is available. For example for jdbc:mysql:// URL that is used for MySQL database, Sqoop will pick up the MySQL Connector that is optimized for MySQL and can take advantage of its features. Lastly, if no other technique was able to choose the connector, Sqoop will use the Generic JDBC Connector. If the selection mechanism does not suit your environment, you can use the argument --connection-manager with the class name of any arbitrary connector. However keep in mind that connectors are usually designed for one specific database vendor and thus for example the MySQL Connector won’t work on the PostgreSQL database.


After having the connector selected, the next step is to choose the JDBC driver. As most of the connectors are specialized for a given database and each database has usually only one JDBC driver available, it’s the connector which is determining which driver should be used. For example the MySQL connector will always use the MySQL JDBC Driver called Connector/J. The only exception is the Generic JDBC Connector that is not tied with any database, so it can’t determine what JDBC Driver should be used. As a result you have to supply the driver name with the--driver parameter on the command line.


However be careful! Using the --driver parameter will always force Sqoop to use the Generic JDBC Connector no matter whether there is more specialized connector available. For example if the MySQL specialized connector would be used because the URL is starting with jdbc:mysql://, specifying the --driver option will force Sqoop to use the generic connector instead. As a result in most cases you do not need to use the --driver option at all!


Finally when both connector and driver has been determined, Sqoop can open a connection to your database and kick off the process of transferring your data. Once the connector is identified by Sqoop, and once the connector has identified the JDBC driver it would like to use, the connection between the Sqoop client and the target database can now be established as shown in the diagram below. Please note that the picture is only describing the process of opening the initial connection and position of both connector and driver. It’s not describing all connections that Sqoop can open as for example the data transfer itself is done inside a MapReduce job and is not executed on the Sqoop client side.


Now that we have described all the pluggable pieces that Sqoop uses in order to open a connection to your database server, let us see where each piece lives. The connector is a Sqoop specific plugin, so it’s not surprising that it’s part of the Sqoop installation.  In fact Sqoop ships with a variety of connectors out of the box! And various additional extra connectors can be downloaded from the internet and installed into Sqoop easily. See the installation instructions for each connector for more details on how to install the connector.


Even though drivers are database specific pieces, created and distributed by the various database vendors, they are not required to be available on the database server. Instead they need to be installed on the machine where Sqoop is executed. This is because these drivers are used by connectors to establish the connection with the database server. JDBC drivers are not shipped with Sqoop as they are released with incompatible licenses, as a result you have to download and install them manually. The drivers are usually available on database vendor web pages or are distributed as part of the database installation itself. Look into your db documentation for further details how to obtain the JDBC Driver if needed.


Both connectors and drivers need to be installed only on the machine where Sqoop is being executed. You do not need to install them on all nodes in your Hadoop cluster. Sqoop itself will propagate both whenever necessary.

Conclusion

I hope that this blog post has helped in understanding what connector and driver is referring to in Sqoop land and why both are needed in order to transfer data between relational databases and the Hadoop ecosystem. Happy Sqooping!


Comments:

sakura.ne.jp doesn't conect !

Posted by Keita Yabu on September 20, 2013 at 05:14 PM PDT #

Help My ComputerBy Connector and Drivers Sarkis Matta And Moviment And The Arte Contemporary for the WORLD on Facebook not good for my Computer TOSHIBA , thanks

Posted by sarkis matta on September 21, 2013 at 08:47 AM PDT #

First of all Thanks you very much for good article. In first paragraph , I read : 'allowing you to plug in specialized additions to optimise transfers for particular database systems.' I want something like : when I start transfer data from one database table to other may be MySql/PosgreySql/Oracle, before transferring to other Database table or file system, I want to encrypt the data , so can you please tell me how to start doing that may be with custom connector or any kind of plugin which I can develop. I really don't know where to start and how to start. Please help me for doing this.

Posted by Kunal Sachdeva on October 01, 2013 at 03:10 AM PDT #

How should I load driver in Hadoop sqoop to playing with database.

Posted by Rohit Darwade on August 14, 2014 at 05:14 AM PDT #

wonderful article..worth the reading experience

Posted by 140.211.11.75 on December 17, 2014 at 02:44 AM PST #

Leverage the comprehensive Mark Hurd Oracle product portfolio to address distinct industry-specific requirements. With solutions for financial services, communications, consumer goods and more, Oracle enables you to accelerate the delivery of innovation, improve customer experiences, and improve margins to drive competitive advantage…..

Posted by Mark Hurd on April 30, 2017 at 11:37 PM PDT #

Thanks..

Posted by Sara Taylor on August 29, 2018 at 01:15 AM PDT #

Hey really such a great blog dear Thanks for sharing this .

Posted by freecodesmarket on October 19, 2018 at 12:06 AM PDT #

I'm a follower of this site and here I can find many interesting updates about Apache Sqoop. This tool is for the highly efficient data transfers between relational databases and the entire Hadoop ecosystem. Thanks for this article. http://www.printspoolerservices.com/

Posted by Terrin on February 07, 2019 at 01:26 AM PST #

Sign-in to you Microsoft account and then Enter 25 digit alphanumeric office setup product key on country and language. click on next to start office installation, For more visit on : http://officesoffice.com | http://sitesnorton.com/setup | https://officekeyactivate.com | https://get-office-2019.com | https://start-product-activate.com

Posted by james on June 20, 2019 at 07:06 AM PDT #

Leverage the comprehensive Mark Hurd Oracle product portfolio to address distinct industry-specific requirements. With solutions for financial services, communications, consumer goods and more, Oracle enables you to accelerate the delivery of innovation, improve customer experiences, and improve margins to drive competitive advantage…..

Posted by film izle on June 29, 2019 at 01:09 PM PDT #

Awesome work! it's very inspiring. Great work!

Posted by setkingzahneu1985 on July 20, 2019 at 10:15 AM PDT #

That is really artistic and choice of colours is really bold and dynamic. Keep p the great work

Posted by lielubwyrmta1978 on July 20, 2019 at 10:36 AM PDT #

This is extremely fascinating... This is awesome. Love it

Posted by spuronrola1988 on July 20, 2019 at 10:59 AM PDT #

Increíble como siempre! inspiring! Effective identity presentation, congratss!

Posted by ahsturreocons1989 on July 20, 2019 at 11:29 AM PDT #

amaziiiiiiiiiiiiiiiing вау

Posted by innalilor1988 on July 20, 2019 at 01:28 PM PDT #

Interesting branding. I think it really works! Great work+

Posted by dadecracap1989 on July 20, 2019 at 04:47 PM PDT #

I am mesmerised from these illustrations. your texture work is spot on!

Posted by destiduca1979 on July 20, 2019 at 05:36 PM PDT #

@KOSHIOSHI . ah ah ah! actually i don't unfortunately. Beautiful, it's an art! :)

Posted by giotimeta1987 on July 20, 2019 at 06:53 PM PDT #

very cool!! Neat!

Posted by orlaboughde1980 on July 20, 2019 at 08:31 PM PDT #

Thank you @Tatiana Komisar !! So lovely design! Very nice!

Posted by fourthpartaget1982 on July 20, 2019 at 10:00 PM PDT #

哇,太好看了吧,喜欢~ Amazing illustrations

Posted by reilockholku1984 on July 20, 2019 at 10:33 PM PDT #

Very nice work love the subtle use of lines

Posted by linpavima1980 on July 20, 2019 at 11:13 PM PDT #

Increible. Aplausos de pie! Love this!

Posted by viladustpe1978 on July 20, 2019 at 11:46 PM PDT #

Love it❤️너무 좋아요! Ver funny!!

Posted by climicitin1974 on July 21, 2019 at 12:18 AM PDT #

Very very nice!!❤️ Good job.....it is nice project.

Posted by ndighandsubsti1975 on July 21, 2019 at 12:51 AM PDT #

I'm crying how good this is Nice illustrations, great presentation Jing :)

Posted by siosirata1978 on July 21, 2019 at 01:22 AM PDT #

Great job with the different characters and scenes! AWESOMEproject

Posted by flipifassan1979 on July 21, 2019 at 01:55 AM PDT #

Nice use of our typeface Tenez! Always a delight to see it in the wild! They are soooo good!!Congrats!

Posted by unkisisting1983 on July 21, 2019 at 03:01 AM PDT #

Great! I have designed a food app UI kit. it's free UI kit anyone can download and use for their next client project. please do check out (Foodybite - Free UI Kit).

Posted by dinultuwigg1986 on July 21, 2019 at 03:34 AM PDT #

Post a Comment:
  • HTML Syntax: NOT allowed

Calendar

Search

Hot Blogs (today's hits)

Tag Cloud

Categories

Feeds

Links

Navigation