Entries tagged [hadoop]

Thursday Sep 03, 2015

Not just yet another release of Apache Bigdata stack: Bigtop 1.0 delivers fast data

Bigtop 1.0 is out! And this time around it isn't just about providing our users with latest and more stable versions of the data crunching software, but also a paradigm shift from bigdata towards fastdata.
[Read More]

Wednesday Feb 11, 2015

20x faster mapreduce with gridgain-hadoop accelerator

The article will show you how to speed-up your existing MapReduce code using a new Hadoop Accelerator by GridGain. The Accelerator is now available as a part of Apache Ignite (incubating).[Read More]

Tuesday Dec 02, 2014

Release of Apache Bigtop 0.8.0

This new release brings tons of new features and fixes for the beloved 100% community and open source driven big data distribution.

Among the new features:

  • New Gradle based build system, with better integration into underlying Bigtop platform. Going forward it will allow us to improve build experience as well as simplify if for new and experienced users.
  • Groovy runtime was added directly to the stack, which will allow to speed-up initial HDFS cluster deployment
  • BigPetStore data analytics blueprints
  • JDK/OpenJDK 7 support. Yes, we have completely moved away from JDK6
  • Support for latest Ubuntu, OpenSUSE, and CentOS
  • Removing support for some outdated distros like CentOS5, Fedora17, and Ubuntu Lucid
  • The release fixes close to 200 bugs and adds a lot of new features focused on the usability and stability of the software stack

Of course this release brings its set of upgrades:

  • Apache Hadoop 2.4.1
  • Apache Giraph 1.1.0
  • Apache Flume 1.5.0.1
  • Apache Pig datafu 1.0.0
  • Apache Crunch 0.10.0
  • Apache HBbase 0.98.5
  • Apache Hive 0.13.0
  • Hue 3.6.0
  • Pig 0.12
  • Apache Mahout 0.9
  • Apache Solr 4.6.0
  • Spark 0.9.1

And as usual Apache Bigtop supports and provide convenience artifacts for a wide array of GNU/Linux distributions:

  • Centos6
  • Fedora 18
  • Ubuntu 12.04 (Precise)
  • Ubuntu 13.04 (Quetzal)
  • Ubuntu 14.04 (Trusty Tahr)
  • OpenSUSE 13.1
  • SLES11

With convenience repositories available: from http://www.apache.org/dist/bigtop/bigtop-0.8.0/repos/

Overall Apache Bigtop 0.8.0 is a great release and is in the continuity of what the previous ones brought to the field:

  • Stability
  • Reliability
  • Great set of features

As a developer of the distribution and a user, everything felt right and worked out of the box.

I would like to take the opportunity to thank all the members Apache Bigtop community for all the effort put into such a great community and distribution. And am looking forward for coming 0.9.0 release that will make the stack more focused on in-memory analytics with addition of Apache Ignite (incubating) and other improvements.

I would also like to encourage everyone to give a try to our latest release and to not hesitate to come in, participate and give any feedback on our mailing lists: http://bigtop.apache.org/mail-lists.html  .

[Read More]

Thursday Aug 28, 2014

Testing Apache Tez with Apache BigTop's new Testing Infrastructure

Apache BigTop's utilities can be consumed and reused by any hadoop distribution, not just itself.  Puppet recipes, RPM specifications, and so on, can save vendors weeks or months of time if borrowed from BigTop rather than maintained in house. However, until recently, the tests were somewhat difficult to customize and hack on.

So, after alot of stimulating debate, we finally settled on our new test infrastructure.  For those interested in building an integration test framework, especially a Java based application, where Java/Scala/Groovy based API calls  will be important to run in an integration context, gradle based tests can be very powerful. 

  • You can organize gradle source trees easily, without any requirement for complex package hierarchies.
  • You can dynamically add source sets without alot of boiler plate, meaning the tests can easily by extended and hacked by new engineers, devops folks, etc.
  • You can still test low level java  functionality easily, by adding java libraries to the classpath at runtime, without needing to compile jars and manage a whole maven style project.
  • The test interface is easy to customize with arguments.  You can parse arguments however you want.
  • Gradle combines the power of groovy into a declarative language for builds
  • Using something like gradle-wrapper, you can make your java based tests easy to consume by anyone, even folks outside the java community.

As an example of how to use gradle for integration tests, I'll demonstrate how we retooled the BigTop tests. 

You can check out the new tests by cloning bigtop, and going into the bigtop-tests/smoke-tests directory.

First lets take a look at the overall directory structure of the testing suite.

[bigtop@sandbox smoke-tests]$ tree
├── build.gradle
├── flume
│   ├── build.gradle
│   ├── conf
│   │   └── flume.conf
│   ├── log4j.properties
│   └── TestFlumeNG.groovy
├── hive
│   ├── build.gradle
│   └── log4j.properties
├── mahout
│   ├── build.gradle
│   └── log4j.properties
├── mapreduce
│   └── build.gradle

Testing Apache Tez with Apache BigTop

So, what better way to demonstrate the flexibility of the BigTop testing suite than to use it to test another tool, native to another distribution : Apache Tez on Hortonworks HDP !

The code for these tests is in this jira which also has the patch  to add a simple Tez test to bigtop attached to it.

How it works

Its pretty simple... Above we can see that each ecosystem component has a "build.gradle" file.   The build.gradle file contains a few dependencies, and the names of classes which it will be calling for tests.   There is also a top-level build.gradle file.   The job of this file is to send global parameters to the sub tests, it does no testing of its own.  We do this using the "subprojects" directive.  Finally, there is a settings.gradle file, that parses our input arguments to decide which tests to run.

So, how do we extend these tests?  Easy !

  1. Pick any existing test as a template (for example, pig/) and just copy the files into a new directory.
  2. Create a directory, for example "tez/"
  3. Customize the environment variables you want defined for your test in build.gradle
  4. Customize the unit testing script (which uses itest and junit for assertions and running bash commands)
  5. Run your new tests : gradle clean compileGroovy -Dsmoke.tests=tez --info

There it is : In slightly below 100 lines of code, by adding two simple files, we were able to add a new test the bigtop test suite.  Note that we didn't have to edit a single existing file to run this test, rather, we just dumped some groovy scripts into a new directory, and gradle discovered, ran the tests for us, and created a nice little html report as well, which is now available in ./build/reports/tests/index.html.  Gradle also injected the inherited dependencies for us, and did some basic sanity checking as well.

We can see that, the original test ram a MapReduce job - but after turning tez on, indeed our job UI shows that we can now test Tez using BigTop's test framework.

Thursday May 29, 2014

BigTop hackathon at Hadoop Summit !

Time for another hackathon.

There are alot of companies who contribute to BigTop.  Pivotal, Cloudera, Red Hat, WanDisco, Amazon and so on... if I left yours out feel free to leave a comment below and I'll update this post.  And today, we are proud  to announce that Red Hat is hosting the next bigtop hackathon, immediately following Hadoop Summit 2014, in San Jose.  Hadoop is about alot more than just soucre code - its about packaging, deployment, configuration, and so on.  And BigTop has embraced the difficult task of tying all this together.

Apache BigTop makes Hadoop deployment transparent

All source code is complex, regardless of the language.  But whats even more complex is the deployment of code on a distributed system.  While vendors have come a long way making it easy to DEPLOY hadoop with black box administrative or cloud tools, nobody has really opened up hadoop by building a culture around the deployment of it.

Enter Apache BigTop. 

BigTop contains all the stuff thats not in the hadoop docs.  For example:

  • Puppet modules for installation and configuration of hadoop without using a tarball.
  • Vagrant recipes for deployment-from-zero.
  • Smoke tests for fine grained testing.
  • The intersection of Java and RPM .
  • bigpetstore app for demonstrating to the business community how to actually use hadoop, gradle, pig, and google's javascript visualization widgets tools to build, test, and deploy a reference "hadoop app".

BigTop is working to embrace HCFS

The community has put alot of work into testing different hadoop stacks on different file systems (https://wiki.apache.org/hadoop/HCFS/Progress), and the bigtop community has embraced this effort - to their own higher cost of having to support a generic filesystem deployment, and also, at the cost of alot of JIRA reviewing.  For example, with the recent BIGTOP-952 and BIGTOP-1200 JIRAs, we're now packaging HDFS independent artifacts into BigTop.  That paves the way for more competition, more choice, and more hadoop hacking - which ultimately translates to a better end-user experiences, around hadoop.

BigTop builds OS+Admin freindly packages for emerging ecosystem projects, fast !

If you compare apache bigtop with other hadoop vendor distributions, you'll find that it is the bleeding edge.  For example, you can watch this recent video demonstration of spinning up Storm on BigTop, from ApacheCon 2014: https://www.youtube.com/watch?v=VZzJxsMJahc, to see just how easy it is to deploy spark out of the box using BigTop's deployment recipes.  As new projects come forward in the upstream, the first place to put them is into apache BigTop.  This means that if you want to try out a new animal in hadoop's stack, you can easily do so with the bigtop stack.  And again : the infrastructure around vagrant makes it easy to build maintainable VM workflows around hadoop app and distribution development tasks, which easily be modified to include/exlude whichever bleeding edge packages.   Think of bigtop's approach to packaging and deployment as a lower-level version of apache ambari.

Sounds Interesting? come to the hackathon in Mountain View after Hadoop Summit  ! 

So this is all prelude to the BigTop hackathon that we are hosting at  Red Hat.  The focus will be on hacking - not presentations.  But that doesn't mean you have to be an expert to get involved.  Coming to this hackathon will give you a chance to pair program with the BigTop commiters, and try your hand at a working directly on a JIRA.  I think most would agree that hacking around on apache bigtop is an excellent introduction to that hadoop ecosystem.  

The NEXT HACKATHON will be from June 6th - JUNE 9th at the Red Hat Offices in Mountain View California.  For more details, ping us on the BigTop mailing list and check the meetup URL : http://www.meetup.com/Bay-Area-Bigtop-Meetup/events/184893732/ .


See you there ! 

Thursday May 01, 2014

Getting involved with BigTop packaging

To get a feel for the need that bigtop packaging of hadoop components is all about, I suggest checking out Roman's puppetcon bigtop talk a few years back. 

The thrust of this talk is that that we need to bring the uniformity to the hadoop ecosystem, and ease of use for end users of hadoop. To me an important first step down this path, is bringing the Java community in-line with what packaging is really all about and why it makes it easier to maintain complex systems. 

As a Java/Maven guy, wrapping my head around "packaging" has been a little tricky... And according to stephen r. covey, change begins on the inside :).   

How would YOU package hadoop as an RPM ?

The thought of this is pretty daunting, and its really interesting to see how this is solved in bigtop.  I've begin documenting my current adventures into the world of RPMs, packaging, and BigTop.  I've just begin to scratch the surface of all of the services, users, binaries, and security features associated with a basic RPM hadoop installation, and it will probably be a while before I fully understand how it all really works. 

So in the meanwhile, lets learn about hadoop packaging with a simpler project... Apache Mahout.

Here are the packaging resources for mahout inside of bigtop:

common/mahout/
├── do-component-build
└── install_mahout.sh

...
bigtop-packages/rpm/mahout/SPECS/mahout.spec

Above you can see that there are three main components to packaging of mahout.

1) The "do-component-build" file.

2) The "install_mahout.sh" file.

3) The rpm file "mahout.spec", which actually uses these two components to do its work.

The do-component-build builds the raw mahout artifact directly from source.  You can see the java specific details of mahout compilation in there. 


set -ex

. `dirname $0`/bigtop.bom

mvn clean install -Dmahout.skip.distribution=false -DskipTests -Dhadoop2.version=$HADOOP_VERSION "$@"
mkdir build
for i in distribution/target/mahout*.tar.gz ; do
  tar -C build --strip-components=1 -xzf $i
done


Meanwhile, install_mahout.sh contains the actual logic of how and where mahout jars will go, and a snippet that writes out the mahout startup shell script /usr/bin/mahout.


# Copy in the /usr/bin/mahout wrapper
install -d -m 0755 $PREFIX/$BIN_DIR
cat > $PREFIX/$BIN_DIR/mahout <<EOF

#!/bin/bash

# Autodetect JAVA_HOME if not defined
. /usr/lib/bigtop-utils/bigtop-detect-javahome

# FIXME: MAHOUT-994
export HADOOP_HOME=\${HADOOP_HOME:-/usr/lib/hadoop}
export HADOOP_CONF_DIR=\${HADOOP_CONF_DIR:-/etc/hadoop/conf}

export MAHOUT_HOME=\${MAHOUT_HOME:-$INSTALLED_LIB_DIR}
export MAHOUT_CONF_DIR=\${MAHOUT_CONF_DIR:-$CONF_DIR}
# FIXME: the following line is a workaround for BIGTOP-259
export HADOOP_CLASSPATH="`echo /usr/lib/mahout/mahout-examples-*-job.jar`":\$HADOOP_CLASSPATH
exec $INSTALLED_LIB_DIR/bin/mahout "\$@"
EOF
chmod 755 $PREFIX/$BIN_DIR/mahout


Anyways, hope this quick tour helps those who are trying to get involved with the bigtop packaging process.  It took me a few days to understand how it all works, because after all, packaging software is an intrinsically complex task.  But thankfully, there are TONS of examples of how to package all the different players of the hadoop ecosystem underneath bigtop-packages/src which can easily help you get started.


Wednesday Mar 26, 2014

Bigtop events at ApacheCon 2014, Denver, CO

Details of Bigtop meetup and hackathon durung ApacheCon 2014[Read More]

Friday Apr 19, 2013

BigTop: the way to grow open Hadoop stack acceptance

BigTop is stepping up in its role as the foundation of a standard Hadoop-based data analytics stack, essentially bringing most of the commercial offering to the standard footing.
[Read More]

Sunday Jul 08, 2012

What is Bigtop, and Why Should You Care?

Ever since Apache Bigtop entered an incubation, we've been answering a very basic question: what exactly is Bigtop and why should you or anyone in the Apache (or Hadoop) community care. The earliest and the most succinct answer (the one used for the Apache Incubator proposal) simply stated that "Bigtop is a project for the development of packaging and tests of the Hadoop ecosystem". That was a nice explanation of how Bigtop relates to the rest of the Apache Software Foundation's (ASF) Hadoop ecosystem projects, yet it doesn't really help you understand the aspirations of Bigtop that go beyond what the ASF has traditionally done.

[Read More]

Monday Apr 02, 2012

Bigtop presents full stack based on Apache Hadoop 1.0

First ever full stack of Hadoop 1.0 has been just released. It includes all data analytics components like Hive, HBase, Pig, Mahout and my more. The release is available for immediate download from all ASF mirrors for all major Linux distributions: Ubuntu, Fedora, CentOS, Suse.
[Read More]

Thursday Feb 09, 2012

All you wanted to know about Hadoop, but were too afraid to ask: genealogy of elephants.

Lining up versions of Hadoop and making sense of all of them and their relations can be quit difficult. 

This article attempts to address the moot points and help you understand the "bigger picture" - literally.

[Read More]

Wednesday Dec 28, 2011

Conception and validation of Hadoop BigData stack.

What is BigTop project? What are the goals and how it is getting to achieve it? What are the roots and founding ideas of the project?

I think you'll find the answers for these questions in what hopefully became a series of helpful posts helping IT professionals with Hadoop stack deployment and adoption.

[Read More]

Calendar

Search

Hot Blogs (today's hits)

Tag Cloud

Categories

Feeds

Links

Navigation