Introducing the Trunk Data Platform: TOSIT’s Curated Open-Source Big Data Distribution

Ever since the merger of Cloudera and Hortonworks, the primary commercial Hadoop distribution for on-premises workloads is CDP Private Cloud. CDP combines the best features of CDH and HDP. However, with HDP 3.1’s End of Support (EOS) approaching in December 2021, Cloudera’s clients are required to migrate to CDP. But what about clients who are unable to upgrade regularly to meet EOS dates? Some clients are not interested in the cloud features offered by Cloudera and simply want to continue running their “legacy” Hadoop workloads. Furthermore, there is still interest in having a free Big Data distribution without support for non-business critical workloads, which was possible with Hortonworks’ HDP. Lastly, there are concerns about the decrease in open-source contributions since the merger of the two companies. Trunk Data Platform (TDP) was created to address these issues by providing shared governance, accessibility, and being 100% open-source. TDP is supported by TOSIT, a French non-profit organization that promotes open-source software, with members including industry leaders such as Carrefour, EDF, Orange, and the French Ministry for the Economy and Finance.

Trunk Data Platform (TDP) is built on well-known Apache projects from the Hadoop ecosystem, providing a secure and robust foundation for various Big Data use cases. The components included in TDP are Apache ZooKeeper, Apache Hadoop, Apache Hive, Apache Tez, Apache Spark, Apache Ranger, Apache HBase, Apache Phoenix, Apache Phoenix Query Server, and Apache Knox. These components have been chosen to ensure compatibility and are based on the latest version of HDP 3.1.5. TDP maintains a table summarizing the version of each component in its main repository.

Building TDP involves compiling the source code of the underlying Apache projects. The complexity of these projects, their inter-dependencies, and the different programming languages used make the building process challenging. To ensure reproducibility, TDP utilizes a Docker image that contains all the necessary tools and dependencies. Testing is a critical part of the TDP release process to ensure compatibility between the components. Jenkins is used to automate the building and testing of TDP, and the test reports are saved for each project. Deploying TDP involves packaging the components into .tar.gz files and using an Ansible collection to manage the deployment and configuration of the TDP stack. The Ansible playbooks can be run manually or through TDP Lib, a Python CLI that provides advantages such as deploying components in the correct order based on their dependencies and managing configuration versioning.

TDP is still a work in progress and has plans to expand the list of components in the distribution and experiment with new Apache Incubator projects. A Web UI is also being designed to handle configuration management, service monitoring, and alerting, powered by the TDP lib. Contributions and involvement in TDP can be made through the Getting started repository or by contributing pull requests or reporting issues in the Ansible collection or TOSIT-IO repositories. For any questions, contact david@adaltas.com.

Related Articles

Latest Updates