Apache Sentry architecture overview
Apache Sentry architecture overview
Apache Sentry is an authorization module for Hadoop that provides the granular, role-based authorization required to provide precise levels of access to the right users and applications.
It currently works out of the box with Apache Hive/Hcatalog, Apache Solr and Cloudera Impala. In future this could be extended to many Hadoop ecosystem components like HDFS and HBase.
This document provides high level architecture of Apache Sentry and integration with hive.
What is Apache Sentry
While Hadoop has strong security at the filesystem level, it lacked the granular support needed to adequately secure access to data by users and BI applications. This problem forces users to make a choice: either leave data unprotected or lock out users entirely. Most of the time, the preferred choice is the latter, severely inhibiting access to data in Hadoop.
Sentry provides the ability to control and enforce access to data and/or privileges on data for authenticated users. It offers fine-grained access control to data and metadata in Hadoop. In its initial release for Hive and Impala, Sentry allows access control at the server, database, table, and view scopes at different privilege levels including select, insert, and all. The column level security can be implemented by creating a view of subset of allowed columns. One can restrict the base table and grant privileges on the view so that the columns with sensitive data doesn’t have to be exposed to the unauthorized users.
Sentry supports ease of administration through role-based authorization; you can easily grant multiple groups access to the same data at different privilege levels. For example, for a particular data set you may give your fraud detection team rights to view all columns, your analysts rights to view only non-sensitive or non-PII (personally identifiable information) columns, and your ingest processing pipeline rights to insert new data into HDFS.
How does Sentry work
The goal of Apache Sentry is to address the authorization requirement. It’s a policy engine that can be used by a data processing tool to validate access. It’s highly extensible to support any arbitrary data model. Currently it support the relational data model for Apache Hive and Cloudera Impala, as well as hierarchical data model used by Apache Solr.
Sentry provides means of defining and persisting the policies for accessing resources. Currently the policies can be stored in flat files or DB backed storage that can be accessed using a RPC service. The data processing tool (eg Hive) identifies the user request to access a piece of data in certain mode, eg read a data row from a table or drop a table. The tool then asks Sentry to validate this access. Sentry builds map of privileges allowed for the requesting user and then determines whether the given request should be allowed. The requesting tool then allows or prohibit the user access based on Sentry’s decision.
Following are the actors that play part in Sentry authorization
- Users and Groups
A resource is an object that you want to regulate access to. In the relational model a resource can be Server, Database, Table or URI (ie HDFS or local path).
By default Sentry does not allow access to any resource unless explicitly granted. A privilege is essentially a rule that grant access to a resource. It spells out how a given resource is allowed to be accessed. For example, a table called customer_info from a database called sales is allowed to access in read mode.
Role is a collection of privileges. This is template to combine multiple privileges required for a logical role in the data processing. For example, a data analyst in your organisation requires read and write access to sales table, read access to customer table and full access to sandbox database. The notion of roles allow one to club all these rules under a single template which can be assigned to an analyst in one shot. Moreover this allows you to maintain the analyst permissions in future. For example, if analysts need change access to customer table from read mode to write mode, you can simply make that one change in the analyst role which will reflect for all analyst.
A group is a collection of users. Sentry group mapping is extensible. By default Sentry leverages Hadoop’s group mapping (which in turn can be OS groups or LDAP groups). Sentry allows you to associate roles to groups. The notion of groups further simplifies the administration. You can combine a number of users into a single group. For example, Bob, John and Kim are analyst in your organisation. You can put all of them into a single group called analyst. Then you can grant the analyst role (discussed in previous section) to this group analyst. This saves the trouble of assigning the roles to each users. If Bob moves out of analyst role, you can simply remove him from analyst group to restrict this access as analyst. Also if John takes an additional role of a manager, then you can simply add him to manager role to grant him all managerial access to.
Note that Sentry only supports this template based policy granting. You can’t grant a privilege directly to a user or group. You are required to combine privileges under roles and a role can only be granted to a group, not directly to a user.
As mentioned before, Sentry policy engine is a plugin invoked by downstream tool like Hive. The binding module is the bridge between the invoking tool and Sentry authorization. This layer takes the authorization request in the requestors native format and converts that into a auth request that can be handled by Sentry policy engine. For example consider consider following hive query,
INSERT INTO TABLE report_db.monthly_sales
SELECT customer_name, transaction_date, amount FROM
prod_db.customer JOIN prod_db.transaction
ON (customer.id = transaction.cid)
This query needs write access to table montly_sales from database report_db, read access on tables customer and transaction from prod_db. It’s the responsibility of the binding layer to extract this information from Hive’s compiler structure and pass it down to the policy engine.
This is the core of Sentry’s authorization. The policy engine gets the requested privileges from the binding layer and the required privileges from the provider layer. It looks at the requested and required privileges and makes the decision whether the action should be allowed.
The provider is an abstraction for making the authorization metadata available for the policy engine. This allows the metadata to be pulled out of the underlying repository independent of the way that metadata is stored.
Currently Sentry support file based storage and DB based storage out of the box.
File based provider
The File based provider stores metadata in a ini format file. The file can reside on a local file system or HDFS. The policy file contains a group section that contains group to role mapping. The roles section contains role to privilege mapping. Here an example of a policy file
# Assigns each Hadoop group to its set of roles
manager = analyst_role, junior_analyst_role
margin: 0px; font-family: Arial; line-height: 1; padding-top: 0pt; padding-bottom: 0pt; widows: 2; orphans: 2; direction: ltr;">analyst = analyst_role
admin = admin_role
analyst_role = server=server1->db=analyst1, \
# Implies everything on server1.
admin_role = server=server1
DB based provider
The file provider makes it hard to modify programmatically, has race conditions when modifying, and is tedious to maintain. The products like Hive and Impala need to support industry standard SQL interface to administer the authorization policies which requires a programmatic way to manage it.
The Sentry policy store and Sentry Service persist the role to privilege and group to role mappings in an RDBMS and provide programmatic APIs to create, query, update and delete it. This enables various Sentry clients to retrieve and modify the privileges concurrently and securely.
Sentry Policy Store works with a number of back-end databases (MySQL, Postgres etc). It uses ORM library DataNucleus to read and write to the database.
Sentry Service supports Kerberos authentication. Other authentication mechanisms can be added subsequently, if needed. You can further restrict the connection by specifying a list of users that are allowed to connect to service.
Currently Sentry service supports trusted authorization. The users are that connect to the service are essentially super users (eg. hive or Impala). The connecting user can specify the effective user for the each RPC request. The admin users that are allowed to execute a request is configurable. For example, service user hive connect to Sentry store and submit a create role request on behalf of user Bob. If Bob is not configured as an admin user, this request will be rejected.
The current RPC interface supported by Sentry service is available athttps://github.com/apache/incubator-sentry/blob/master/sentry-provider/sentry-provider-db/src/main/resources/sentry_policy_service.thrift
Sentry Hive integration
Sentry policy engine is plugged into Hive via semantic hook. HiveServer2 executes this hook after the query is successfully compiled.
The hooks gets the list of objects the query is try to access in read and write mode. The Sentry Hive binding converts this into authorization request based on the SQL privilege model.
The policy manipulation is handled in two steps. During the query compilation, Hive invokes Sentry’s authorization task factory that generates Sentry specific task which is executed during query processing. This task invokes the Sentry store client which sends RPC request to Sentry service for making authorization policy changes.
Sentry is integrated into Hive Metastore via pre-listener hooks. The metastore executes this hook prior to executing the metadata manipulation request. The metastore binding creates a Sentry authorization requests for the metadata modification request coming for the metastore/HCatalog client.