There are lots being said about the best big data platform and how it should be designed. Most models go from the far complex design to the financially unreachable, but today I come up with Andreas Kretz’s big data platform blueprint which is perfect for any scenario, as it suits the four common big data platform design patterns: Ingest, Store, Analyse and Display.
As this model does not focus in technology, it can perfectly handle big data as your needs.
–Extracted from iotdonequick.com:
The blueprint is focused on the four key areas: Ingest, store, analyse and display.
Having the platform split like this turns it it a modular platform with loosely coupled interfaces.
Why is it so important to have a modular platform?
If you have a platform that is not modular you end up with something that is fixed or hard to modify. This means you can not adjust the platform to changing requirements of the company.
Because of modularity it is possible to switch out every component, if you need it.
Now, lets talk more about each key area.
Ingestion is all about getting the data in from the source and making it available to later stages. Sources can be everything form tweets, server logs to IoT sensor data like from cars.
Sources send data to your API Services. The API is going to push the data into a temporary storage.
The temporary storage allows other stages simple and fast access to incoming data.
A great solution is to use messaging queue systems like Apache Kafka, RabbitMQ or AWS Kinesis. Sometimes people also use caches for specialised applications like Redis.
A good practice is that the temporary storage follows the publish, subscribe pattern. This way APIs can publish messages and Analytics can quickly consume them.
This is the typical big data storage where you just store everything. It enables you to analyse the big picture.
Most of the data might seem useless for now, but it is of upmost importance to keep it. Throwing data away is a big no no.
Why not throw something away when it is useless?
Although it seems useless for now, data scientists can work with the data. They might find new ways to analyse the data and generate valuable insight from it.
What kind of systems can be used to store big data?
Systems like Hadoop HDFS, Hbase, Amazon S3 or DynamoDB are a perfect fit to store big data.
The analyse stage is where the actual analytics is done. Analytics, in the form of stream and batch processing.
Streaming data is taken from ingest and fed into analytics. Streaming analyses the “live” data thus, so generates fast results.
As the central and most important stage, analytics also has access to the big data storage. Because of that connection, analytics can take a big chunk of data and analyse it.
This type of analysis is called batch processing. It will deliver you answers for the big questions.
The analytics process, batch or streaming, is not a one way process. Analytics also can write data back to the big data storage.
Often times writing data back to the storage makes sense. It allows you to combine previous analytics outputs with the raw data.
Analytics insight can give meaning to the raw data when you combine them. This combination will often times allow you to create even more useful insight.
A wide variety of analytics tools are available. Ranging from MapReduce or AWS Elastic MapReduce to Apache Spark and AWS lambda.
Displaying data is as important as ingesting, storing and analysing it. People need to be able to make data driven decisions.
This is why it is important to have a good visual presentation of the data. Sometimes you have a lot of different use cases or projects using the platform.
It might not be possible for you to build the perfect UI that fits everyone. What you should do in this case is enable others to build the perfect UI themselves.
How to do that? By creating APIs to access the data and making them available to developers.
Either way, UI or API the trick is to give the display stage direct access to the data in the big data cluster. This kind of access will allow the developers to use analytics results as well as raw data to build the the perfect application.