The Bluemetrix Team
Create, deploy & develop a Hadoop proof of concept in less than 1 month
What is the problem we are trying to solve?
Hadoop is topical at the moment as it is the platform of choice for Big Data projects. Most companies are beginning to use Machine Learning and AI projects on their data to gain better business insights for their own companies’ internal needs and their clients. The vast majority of these projects are being carried out on a Hadoop platform as this is the platform best capable of handling the volumes of data required for analysis, and the platform most analytic tools have been developed for.
The problem here is that Hadoop is difficult. It is not one language or one operating system, but rather an ecosystem of disparate systems that work together on a distributed processing environment.
So apart from the complexity of having to master several languages (Python, Java, Scripting, SQL, etc.) to program over a dozen different modules (Hive, Spark, Sqoop, etc.), you also have to understand how the data is stored and processed on a distributed processing environment. It is not easy to make this work out of the box.
Download the Ebook Create, deploy & develop a Hadoop proof of concept in less than 1 month.
Exposition of the problem
The first step in any Hadoop project is a Proof of Concept. This will determine if there is value to be derived from the data and if the project is worth pursuing onto production stage. POC’s are very straightforward for Hadoop Specialists to design and plan and there are a number of distinct stages to them:
Data Identification: Identify what data is required for use in the POC and ensure that it is available for the duration of the project.
Use Case: Establish the business use case to be implemented (e.g. using Machine Learning to identify customers for upselling opportunities, etc.) and what is required in terms of development and technology to prove this use case.
Data Platform: Decide on a platform of choice: on-premise or cloud and the size of your Hadoop cluster 3/4/5 etc. nodes.
Hadoop Distribution: Decide on which distribution you will use for the POC – Apache, Cloudera, Hortonworks or MapR.
Data Ingestion: The nature of the data will determine the ingestion method. Static or Streaming, structured or unstructured – the options could be Sqoop, Kafka, Ni-Fi, StreamSets, etc.
The nature of the data will determine the ingestion method. Static or Streaming, structured or unstructured– the options could be Sqoop, Kafka, Ni- Fi , StreamSets, etc.
Data Storage: The nature of the data and the type of processing you expect to carry out on it will determine the storage platform that is used, options include HBase, HIVE, MongoDB, Impala, etc.
Data Security: Do you develop for a Kerberos environment or not? It is certainly easier not to do so, but the work carried out on the POC will be of little use if you need to deploy Kerberos in production.
Data Transformation: Before you apply the use case solution to the data, you will typically need to combine and re-format the data to suit your processing requirements. This can be done using SQL, Spark or other options.
Data Governance: Finally, you may or may not decide to implement data governance on your POC, depending on the nature of your data and use case.
Designing a POC plan is the easy part. It is the implementation of the POC where things can start to go wrong.
In our experience most people do not have access to the full skill set of technologies required to successfully set up a POC (we have been involved in the implementation of over 100 Hadoop projects at this stage). If the full skill set is not in place this results in the project taking longer than expected and not delivering the required results. The most common problems that we have encountered are:
Data Ingest: Writing Sqoop or Kafka code to move data from an EDW or a file into Hadoop is relatively straightforward – the problems occur when people don’t understand the changes to data types and special characters that need to be made to the code and data to ensure that it can run successfully in Hadoop.
Cluster Security: Kerberos is tricky to implement, and developing applications for Kerberos environments is more difficult than non- Kerberos environments. A lot of projects avoid Kerberos at the start in order to get up and running quickly, but this can be a false saving which appears later in the project.
Data Transformation: SQL for Hive is difficult, especially if the queries are complex, and even for an experienced DBA it takes time to get up to speed with it.
Infrastructure: As simple as possible is best, and cloud solutions that can be deployed quickly offer major time savings over on-premise hardware.
We have seen simple projects with non-complex data sources and data transformations taking weeks and months to get up and running correctly, leading to major delays in projects.
The biggest problem we have seen is the original objective of the POC gets subsumed in the building of a Hadoop environment to prove the POC. The purpose of the POC is usually to determine if there is a business case to support the data use case that is being investigated. It is not to develop a Hadoop Cluster. This should be for the next phase of the project when the use case has been proven and accepted.