Analyzing data always have great benefits and is one the greatest challenge for an organization. Today’s business generates a massive amount of digital data, which is cumbersome to store, transport and analyze. Along with the concern for the data size also comes with the issue of nature of the data such as data varieties and Velocity. Diversity in the nature of data challenges computer scientists and programmers to develop tools and technique that can solve data analytics problem efficiently. Making distributed system and off-loading workload to commodity clusters is one of the better approaches to solving data problem. Frameworks like Hadoop, Spark, Storm solves individual nature of the problem. To gain maximum values from the data with the least effort and expenses, organizations want to run individual frameworks on the same cluster of commodity servers. By distributing our frameworks over commodity clusters, we solve just one of our problem, other problems of concern such as effective cluster resource utilization, frequent change of frameworks, data sharing between frameworks and much more zooms in. A solution like the static partition of clusters and allocation a set of virtual machines per framework are not able to achieve the desired solution.
Traditional computation usually performs on a single machine, wherein we submit a job to the machine and it then returns the result. With respect to nature and the complexity of jobs, we determine our computer system requirements. This approach works well for most of the cases, but when we start doing large scale computation, it goes beyond the capability of the system. Scaling-up and installing new hardware like CPU, main memory, storage devices solves this problem with some instance, but sooner and later we again reach the edge of the system’s capacity.
Distributing workload among commodity system solves our problem very well. In this approach we divided our Jobs into smaller tasks and distributed the tasks over commodity cluster, each cluster with their own CPU and main memory to compute tasks efficiently, and later the driver system collects each processed tasks for computing final result.
A distributed system is a collection of independent computers that appears to its users as a single coherent system.
Scaling-out our workload on smaller commodity clusters is a really efficient and cost effective technique. Frameworks like Hadoop and Spark uses commodity cluster for distributed processing and provides cost effective and reliable solution on heterogeneous datasets.
Data storage frameworks like Hbase, Cassandra, Google BigTable uses distributed cluster system for reliable data storage.
Modern applications development is no more a single systems application, the application we developing sooner or later will get a huge of users, and will generate great traffic and the huge amount of data. These applications need to be run with zero downtime, a user needs the response in sub-seconds, if an application fails to give the response in sub seconds it labeled as non-performing.
Companies like Google, Facebook, Twitter uses distributed system to run their application efficiently. For a simple stats, Facebook’s application got login approx. 1.5 billion daily and need to process 600TB data daily to serve these massive amounts of the user.
Building a distributed system
Before start to build a distributed system, we need to find the answer, why build a distributed system?
Most of the time a large application fails to reposed the uses due to application hits the upper limit of CPU, Memory, IO and storage. It forces us to partition our application in smaller blocks. These smaller partitioned application block can run on different machines and communicate through messages to achieve a goal of the application.
A distributed system application works independently and communication through messages.
A distributes application can run on a local cluster within an organization or multiple data centers all around the world.
A distributed system should have following key characteristics:
- Resource Sharing
- Fault Tolerance
Failures are frequent issues in distributed systems. A distributed system should have the ability to handle failure and react smartly to overcome it.
A Software developer often requires a framework that facility to gain low-level cluster details like CPU, Memory profiling and gives facility to communicate between machines.
In an Upcoming post, we will learn how Apache Mesos helps us to build a distributed system.