Ozone - The journey so far
Shashikant Banerjee, Mukul Kumar Singh
Apache Hadoop Ozone is a highly scalable, redundant, distributed object-store. Ozone is designed to work well with the existing Apache Hadoop ecosystem applications like Hive, Spark etc. Moreover, it is designed for ease of operational use and scales to thousands of nodes and billions of objects in a single cluster. Ozone supports a Hadoop Compatible File System interface as well as the S3 protocol.
The Apache Hadoop Distributed File System (HDFS) has been the de facto file system for big data and works most optimally when most of the files are large – tens to hundreds of MBs. HDFS suffers from the famous small files limitation and struggles when the file count goes beyond 300 million. Ozone was designed to address HDFS’s limitations and at the same time to work seamlessly with the existing Hadoop applications.
Ozone releases are named after US national parks. With every minor release, we move to the next letter for the release name and choose a new national park starting with that letter.
- Support for OzoneFileSystem
- Integration with Hive, Spark, YARN, and MapReduce
- Data pipeline handling and recovery
This is the first release of Ozone and it came out on Oct 1, 2018. The release had basic Ozone functionalities working where users could create ozone volumes, buckets, and keys via Ozone shell interface. Similarly, Ozone Filesystem interface was also functional thereby ensuring YARN, Hive, Spark and MapReduce applications can work against Ozone FS seamlessly. It had support for REST and RPC protocols as well as Java client libraries were also shipped for Ozone supporting both RPC and REST. Data pipeline handling and recovery in case of node failures support were built into the system as well.
- Support for S3 protocol via S3 Gateway
The next release came out on Nov 22, 2018. This release came out with support for a new S3 compatible REST server which ensured Ozone can be used from any S3 compatible tools like AWS CLI and AWS Java SDK. For example, with this release now a user could create buckets using AWS CLI or use Goofys which is an S3 FUSE driver to mount any Ozone bucket as a POSIX file system. It also had support for a minimally complete S3 API set which includes GET, HEAD, DELETE operations on buckets as well as objects.
The release incorporated significant OzoneFilesystem stability improvements. Ozone data pipeline handling and recovery procedures were improved and streamlined.
- Support for Apache Hadoop Security (Authentication/Authorization/Encryption)
- Support for Auditing
The next release in the chain was 0.4.0 which came out on May 7, 2019. This release was primarily driven by the Security feature of Ozone. Kerberos-based authentication support for Ozone was added. It had support for Hadoop Delegation Tokens and Block Tokens, the motivation for which is to prevent the unauthorized access while keeping the protocol lightweight and without sharing the secret over the wire. Similar to this, S3Tokens were also added which are supposed to be used for every S3 client request.
Certificate Infrastructure for Ozone was plugged in to be used for certificate-based authentication for Ozone service components. Transparent Data Encryption (TDE) support came in as a part of this release as well which allows data blocks to be encrypted-at-rest along with Apache Ranger support to control Authorization. Yarn, Hive, and Spark were able to work seamlessly in a secure ozone cluster environment.
This release also included support for Audit Log which functionally completes the security ecosystem in Ozone. It came with a custom audit parser - a SQLite based command-line utility to parse/query audit logs with predefined templates and options for a custom query.
Support for S3A filesystem as well as for S3 gateway Multipart upload API were also major highlights of this release.
- Kubernetes Integration
- Support for Native as well as Ranger ACLs
The next release made its way out on Oct 13, 2019, and with this release, native K8s (Kubernetes) support came in Ozone as well. The Ozone distribution package contains all the required resources files to deploy Ozone on Kubernetes which ensures that Ozone becomes a first-class citizen on Kubernetes clusters.
This release also plugged in support for native ACLs which can be used independently or along with Apache Ranger. If Apache Ranger is enabled, then ACLs will be checked first with Ranger and then Ozone’s internal ACLs will be evaluated. Ozone ACLs are a super set of Posix and S3 ACLs.
Crater Lake (0.5.0)
- Support for Topology awareness
- Support for GDPR Right to Erasure
- Scale testing upto 1 billion objects
The next release made its way out on Mar 24, 2020. This was the first Beta release for Ozone. Network Topology awareness for block placement in Ozone was added in this release. Support for GDPR(Right to Erasure) was also one of the highlights of the release. With major stability and performance improvements in the IO pipeline, Ozone was tested and verified to work seamlessly with scale up to more than 1 Billion keys.
- HA support for Ozone Manager
- Ozone OFS (New Filesystem scheme)
- Support for SSL
This is the latest release of Ozone and is the GA release. This release went out on Sep 2, 2020. It represents a point of API stability and quality that we consider production-ready. This release also added support for Recon which is a Management & Administrative daemon inside Ozone which enables continuous monitoring of an Ozone cluster. It also periodically checks on the correctness and stability of an Ozone cluster and reports any anomalies.
The Ozone journey thus far has been equally exciting and challenging due to the goals that we as a community had set out to achieve. Testing and Integration with diverse tools and products has helped in enhancing and stabilizing the system. Ozone is now truly production-ready to be used along with other Hadoop Applications.
Some of the important features in the upcoming releases:
- Support for upgrades and introduction of layout format in OM, SCM and DNs
- Ozone File System improvements for Directory delete and rename
- Decommissioning of Datanodes
- Support for large capacity disks on datanodes
Some of the features which are further ahead in the Roadmap are
- SCM High Availability
- Erasure Encoding in Ozone to provide storage efficiency
- NFS to support new Posix related workloads
More detailed information regarding Apache Ozone releases and roadmap can be found here: https://cwiki.apache.org/confluence/display/HADOOP/Ozone+Road+Map.
Posted at 03:14AM Oct 07, 2020 by Mukul Kumar Singh in Technology | |