The Apache Software Foundation Blog

Wednesday November 09, 2011

The Apache Software Foundation Announces Apache Tika™ v1.0

Standards-based, Content and Metadata Detection and Analysis Toolkit Powers Large-scale, Multi-lingual, Multi-format Repositories at Adobe, the Internet Archive, NASA Jet Propulsion Laboratory, and more.

9 November 2011 —FOREST HILL, MD— The Apache Software Foundation (ASF), the all-volunteer developers, stewards, and incubators of nearly 150 Open Source projects and initiatives, today announced Apache Tika v1.0, an embeddable, lightweight toolkit for content detection and analysis.

"The Apache Tika v1.0 release is five years in the making, providing numerous improvements and new parsing formats," said Chris Mattmann, Apache Tika Vice President, Senior Computer Scientist at NASA Jet Propulsion Laboratory, and University of Southern California Adjunct Assistant Professor of Computer Science. "From a toolkit perspective, it's easy to integrate, and provides maximum functionality with little configuration."

With the increasing amount of information available on the Internet today, automatic information processing and retrieval is urgently needed to understand content across cultures, languages, and continents.

Apache Tika is a one-stop shop for identifying, retrieving, and parsing text and metadata from over 1,200 file formats including HTML, XML, Microsoft Office, OpenOffice/OpenDocument, PDF, images, ebooks/EPUB, Rich Text, compression and packaging formats, text/audio/image/video, Java class files and archives, email/mbox, and more.

Tika entered the Apache Incubator in 2007, became a sub-project of Apache Lucene in 2008, and graduated as an ASF Top-level Project (TLP) in April 2010. Apache Tika has been tested extensively in repositories exceeding 500 million documents across a variety of applications in industry, academia and government labs.

"At NASA, we leverage Apache Tika on several of our Earth science data system projects," explained Dan Crichton, Program Manager and Principal Computer Scientist, NASA Jet Propulsion Laboratory. "Tika helps us processes hundreds of terabytes of scientific data in myriad formats and their associated metadata models. Using Tika with other Apache technologies such as OODT, Lucene, and Solr, we are able to automate, virtualize and increase the efficiency of NASA's science data processing pipeline."

Users and software applications use Apache Tika to explore the information landscape through flexible interfaces in Java, from the command line, REST-ful Web services, and also by consuming its functionality from a multitude of programming languages directly, including Python, .NET and C++. Tika defines a standard application programming interface (API) and makes use of existing libraries such Apache POI and PDFBox to detect and extract metadata and structured text content from various documents using existing parser libraries.

"We've used Apache Tika extensively for a wide range of content extraction tasks, including parsing almost 600 million pages and documents from a large web crawl," said Ken Krugler, Founder and President of Scale Unlimited. "It's proven invaluable as a simple yet robust solution to the challenges of extracting text and metadata from the jungle of formats you find on the web."

"Hippo CMS 7 uses Apache Jackrabbit to index content repositories containing as many as 500,000 documents," explained Arjé Cahn, CTO of Hippo. "We are exploring ways that Apache Tika can enhance access to metadata in our faceted navigation feature, which may result in a possible future patch."

Availability and Oversight

As with all Apache products, Apache Tika software is released under the Apache License v2.0, and is overseen by a self-selected team of active contributors to the project. A Project Management Committee (PMC) guides the Project’s day-to-day operations, including community development and product releases. Apache Tika source code, documentation, and related resources are available at

Apache Tika in Action!

Apache Tika v1.0 will be featured at ApacheCon's Content Technologies track on 10 November 2011. PMC Chair Mattmann will describe the modern genesis of the project and its ecosystem, as well as the newly-launched Manning Publications book, "Tika in Action" co-authored by Mattmann and Zitting.

About The Apache Software Foundation (ASF)

Established in 1999, the all-volunteer Foundation oversees nearly one hundred fifty leading Open Source projects, including Apache HTTP Server — the world's most popular Web server software. Through the ASF's meritocratic process known as "The Apache Way," more than 350 individual Members and 3,000 Committers successfully collaborate to develop freely available enterprise-grade software, benefiting millions of users worldwide: thousands of software solutions are distributed under the Apache License; and the community actively participates in ASF mailing lists, mentoring initiatives, and ApacheCon, the Foundation's official user conference, trainings, and expo. The ASF is a US 501(3)(c) not-for-profit charity, funded by individual donations and corporate sponsors including AMD, Basis Technology, Cloudera, Facebook, Google, IBM, HP, Matt Mullenweg, Microsoft, PSW Group, SpringSource/VMware, and Yahoo!. For more information, visit

"Apache", "Apache Tika", and "ApacheCon" are trademarks of The Apache Software Foundation. All other brands and trademarks are the property of their respective owners.

# # #


[Trackback] ♺ @TheASF Apache Software Foundation Announces #Apache #Tika™ v1.0 #OpenSource #Content #Metadata #Analysis #Toolkit

Posted by metztli on November 09, 2011 at 01:27 PM GMT #


Posted by on November 12, 2011 at 07:47 AM GMT #

Post a Comment:
Comments are closed for this entry.



Hot Blogs (today's hits)

Tag Cloud