What is the problem we are trying to solve?

Hadoop is topical at the moment as it is the platform of choice for Big Data projects. Most companies are beginning to use Machine Learning and AI projects on their data to gain better business insights for their own companies’ internal needs and their clients. The vast majority of these projects are being carried out on a Hadoop platform as this is the platform best capable of handling the volumes of data required for analysis, and the platform most analytic tools have been developed for.

The problem here is that Hadoop is difficult. It is not one language or one operating system, but rather an ecosystem of disparate systems that work together on a distributed processing environment.

So apart from the complexity of having to master several languages (Python, Java, Scripting, SQL, etc.) to program over a dozen different modules (Hive, Spark, Sqoop, etc.), you also have to understand how the data is stored and processed on a distributed processing environment. It is not easy to make this work out of the box.

Exposition of the problem

The first step in any Hadoop project is a Proof of Concept. This will determine if there is value to be derived from the data and if the project is worth pursuing onto production stage. POC’s are very straightforward for Hadoop Specialists to design and plan and there are a number of distinct stages to them:

Establish the business use case to be implemented (e.g. Identify customers for upselling opportunities, etc.)

  • Data Identification: Identify what data is required for use in the POC and ensure that it is available for the duration of the project.
  • Use Case: Establish the business use case to be implemented (e.g. using Machine Learning to identify customers for upselling opportunities, etc.) and what is required in terms of development and technology to prove this use case.
  • Data Platform: Decide on a platform of choice: on-premise or cloud and the size of your Hadoop cluster 3/4/5 etc. nodes.
  • Hadoop Distribution: Decide on which distribution you will use for the POC – Apache, Cloudera, Hortonworks or MapR.
  • Data Ingestion: The nature of the data will determine the ingestion method. Static or Streaming, structured or unstructured – the options could be Sqoop, Kafka, Ni-Fi, StreamSets, etc.

    The nature of the data will determine the ingestion method. Static or Streaming, structured or unstructured– the options could be Sqoop, Kafka,  Ni- Fi , StreamSets, etc.

  • Data Storage: The nature of the data and the type of processing you expect to carry out on it will determine the storage platform that is used, options include HBase, HIVE, MongoDB, Impala, etc.
  • Data Security: Do you develop for a Kerberos environment or not? It is certainly easier not to do so, but the work carried out on the POC will be of little use if you need to deploy Kerberos in production.
  • Data Transformation: Before you apply the use case solution to the data, you will typically need to combine and re-format the data to suit your processing requirements. This can be done using SQL, Spark or other options.
  • Data Governance: Finally, you may or may not decide to implement data governance on your POC, depending on the nature of your data and use case.

Designing a POC plan is the easy part. It is the implementation of the POC where things can start to go wrong.

In our experience most people do not have access to the full skill set of technologies required to successfully set up a POC (we have been involved in the implementation of over 100 Hadoop projects at this stage). If the full skill set is not in place this results in the project taking longer than expected and not delivering the required results. The most common problems that we have encountered are:

  • Data Ingest: Writing Sqoop or Kafka code to move data from an EDW or a file into Hadoop is relatively straightforward – the problems occur when people don’t understand the changes to data types and special characters that need to be made to the code and data to ensure that it can run successfully in Hadoop.
  • Cluster Security: Kerberos is tricky to implement, and developing applications for Kerberos environments is more difficult than non- Kerberos environments. A lot of projects avoid Kerberos at the start in order to get up and running quickly, but this can be a false saving which appears later in the project.
  • Data Transformation: SQL for Hive is difficult, especially if the queries are complex, and even for an experienced DBA it takes time to get up to speed with it.
  • Infrastructure: As simple as possible is best, and cloud solutions that can be deployed quickly offer major time savings over on-premise hardware.

We have seen simple projects with non-complex data sources and data transformations taking weeks and months to get up and running correctly, leading
to major delays in projects.

The biggest problem we have seen is the original objective of the POC gets subsumed in the building of a Hadoop environment to prove the POC. The purpose of the POC is usually to determine if there is a business case to support the data use case that is being investigated. It is not to develop a Hadoop Cluster. This should be for the next phase of the project when the use case has been proven and accepted.

Infrastructure: As simple as possible is best, and cloud solutions that can be deployed quickly offer major time savings over on-premise hardware.

How to Solve the problem

It is possible to deliver a Hadoop POC within 1 month. This can be carried out by following the steps below.

  1. Hadoop Distribution: Select a distribution from one of the enterprise providers – Cloudera, Hortonworks or MapR.
  2. Infrastructure: Deploy on one of the major cloud infrastructure providers – Azure or AWS – and use a virtualised environment for the POC. The BM Cloudburst product will deploy a fully kerberised cluster on Azure in less than 1 hour, allowing you a platform to develop on.
  3. Use Case: Focus all of your energies on developing the application to substantiate the use case.

    Focus all of your energies on developing the application to substantiate the use case.

  4. Data Ingest: Use BM Data Ingest for ingestion of data onto your cluster. It has multiple connectors for different data sources and converts all of the data to work in Hadoop. This automatically generates the ingest code and has a drag and drop interface that can be easily understood and used by non-Hadoop experts. It is available to purchase on a monthly-use basis and data can be ingested in less than 1 day.
  5. Data Transformation: Use BM Data Transformer to combine and manipulate the data so it is available on Hive for your use case. All transformations are carried out in Spark using an extensive library, with a simple easy to use drag and drop interface requiring no Hadoop knowledge. All of the underlying code is developed automatically. Most data transformations can be created and deployed in minutes.

Following the above 5 steps will get a cluster deployed and operational with data ingested and manipulated within a matter of days, allowing you to spend the rest of the month working on your use case application.

Apart from being the fastest solution on the market for a Hadoop POC deployment, it has extraordinary cost savings. It uses low-cost tools to automate the process and removes the need for any skilled Hadoop knowledge. Using this methodology, any Data Science team can prove the business for a Hadoop Big Data project without ever having to be Hadoop experts.

Download Hadoop Proof of Concept book Download the book on how to Create, deploy & develop a Hadoop proof of concept in less than 1 month and for under €15,000.