Apache Hadoop is an open source, Scalable, and Fault tolerant framework written in Java. It efficiently processes large volumes of data on a cluster of commodity hardware. Hadoop is not only a storage system but is a platform for large data storage as well as processing. This Big Data Hadoop tutorial provides a thorough Hadoop introduction.
What is Hadoop Technology?
Hadoop is an open-source tool from the ASF – Apache Software Foundation. Open source project means it is freely available and we can even change its source code as per the requirements. If certain functionality does not fulfill your need then you can change it according to your need. Most of Hadoop code is written by Yahoo, IBM, Facebook, Cloudera.
It provides an efficient framework for running jobs on multiple nodes of clusters. Cluster means a group of systems connected via LAN. Apache Hadoop provides parallel processing of data as it works on multiple machines simultaneously.
Hadoop consists of three key parts –
Hadoop Distributed File System(HDFS) – It is the storage layer of Hadoop.
Map-Reduce – It is the data processing layer of Hadoop.
YARN – It is the resource management layer of Hadoop.
Apache Hadoop is not only a storage system but is a platform for data storage as well as processing. It is scalable (as we can add more nodes on the fly), Fault-tolerant (Even if nodes go down, data processed by another node).
Following characteristics of Hadoop make it a unique platform:
Flexibility to store and mine any type of data whether it is structured, semi-structured or unstructured. It is not bounded by a single schema.
Excels at processing data of complex nature. Its scale-out architecture divides workloads across many nodes. Another added advantage is that its flexible file-system eliminates ETL bottlenecks.
Scales economically, as discussed it can deploy on commodity hardware. Apart from this its open-source nature guards against vendor lock.
What is Hadoop Architecture?
Hadoop works in master-slave fashion. There is a master node and there are n numbers of slave nodes where n can be 1000s. Master manages, maintains and monitors the slaves while slaves are the actual worker nodes. In Hadoop architecture, the Master should deploy on good configuration hardware, not just commodity hardware. As it is the centerpiece of Hadoop cluster.
Master stores the metadata (data about data) while slaves are the nodes which store the data. Distributedly data stores in the cluster. The client connects to master node to perform any task. Now in this Hadoop for beginners tutorial for beginners, we will discuss different components of Hadoop in detail.
There are three most important Apache Hadoop Components. In this Hadoop tutorial, you will learn what is HDFS, what is Hadoop MapReduce and what is Yarn Hadoop. Let us discuss them one by one-
What is HDFS?
Hadoop HDFS or Hadoop Distributed File System is a distributed file system which provides storage in Hadoop in a distributed fashion.
In Hadoop Architecture on the master node, a daemon called namenode run for HDFS. On all the slaves a daemon called datanode run for HDFS. Hence slaves are also called as datanode. Namenode stores meta-data and manages the datanodes. On the other hand, Datanodes stores the data and do the actual task.
HDFS is a highly fault-tolerant, distributed, reliable and scalable file system for data storage. First Follow this guide to learn more about features of HDFS and then proceed further with the Hadoop tutorial.
HDFS is developed to handle huge volumes of data. The file size expected is in the range of GBs to TBs. A file is split up into blocks (default 128 MB) and stored distributedly across multiple machines. These blocks replicate as per the replication factor. After replication, it stored at different nodes. This handles the failure of a node in the cluster. So if there is a file of 640 MB, it breaks down into 5 blocks of 128 MB each (if we use the default value).
What is MapReduce?
In this Hadoop Basics Tutorial, now its time to understand one of the most important pillars if Hadoop, i.e. Hadoop MapReduce. The Hadoop MapReduce is a programming model. As it is designed for large volumes of data in parallel by dividing the work into a set of independent tasks. MapReduce is the heart of Hadoop, it moves computation close to the data. As a movement of a huge volume of data will be very costly. It allows massive scalability across hundreds or thousands of servers in a Hadoop cluster.
Hence, Hadoop MapReduce is a framework for distributed processing of huge volumes of data set over a cluster of nodes. As data stores in a distributed manner in HDFS. It provides the way to Map–Reduce to perform parallel processing.
What is YARN Hadoop?
YARN – Yet Another Resource Negotiator is the resource management layer of Hadoop. In the multi-node cluster, as it becomes very complex to manage/allocate/release the resources (CPU, memory, disk). Hadoop Yarn manages the resources quite efficiently. It allocates the same on request from any application.
On the master node, the ResourceManager daemon runs on the YARN then for all the slave nodes NodeManager daemon runs.
Learn the differences between two resource manager Yarn vs. Apache Mesos. Next topic in the Big Data Hadoop for beginners is a very important part of Hadoop i.e. Hadoop Daemons
With a steep rise in NoSQL databases, several organizations are making a transition from traditional databases to open-source, distributed and high performance ones like ‘Cassandra’. Since its origin at Facebook in 2008 by Prashant Malik and Avinash Lakshman, Cassandra has become Apache’s one of the most popular projects. And why not? With the unique capacity to deliver near real-time performance, Cassandra makes lives of Web Developers, Software Engineers and Data Analysts far easier than it was in the company of traditional RDBMS. The wonders Cassandra is creating in the Big Data industry is phenomenal!
Let’s look at some of the Cassandra advantages and why companies like Facebook, eBay, Reddit, Twitter, NetFlix, IBM and several others are using Cassandra?
Apache Cassandra Advantages:
Cassandra is Apache’s open-source project, this means it is available for FREE! Yes, you can download the application and use the way you want. Infact, its open-source nature has given birth to a huge Cassandra community where like-minded people share their views, queries, suggestions related to Big Data. Further, Cassandra can be integrated with other Apache open-source projects like Hadoop (with the help of MapReduce), Apache Pig and Apache Hive.
2. Peer to Peer Architecture:
Cassandra follows a peer-to-peer architecture, instead of master-slave architecture. Hence, there is no single point of failure in Cassandra. Moreover, any number of servers/nodes can be added to any Cassandra cluster in any of the datacenters. As all the machines are at equal level, any server can entertain request from any client. Undoubtedly, with its robust architecture and exceptional characteristics, Cassandra has raised the bar far above than other databases.
3. Elastic Scalability:
One of the biggest advantages of using Cassandra is its elastic scalability. Cassandra cluster can be easily scaled-up or scaled-down. Interestingly, any number of nodes can be added or deleted in Cassandra cluster without much disturbance. You don’t have to restart the cluster or change queries related Cassandra application while scaling up or down. This is why Cassandra is popular of having a very high throughput for the highest number of nodes. As scaling happens, read and write throughput both increase simultaneously with zero downtime or any pause to the applications.
4. High Availability and Fault Tolerance:
Another striking feature of Cassandra is Data replication which makes Cassandra highly available and fault-tolerant. Replication means each data is stored at more than one location. This is because, even if one node fails, the user should be able to retrieve the data with ease from another location. In a Cassandra cluster, each row is replicated based on the row key. You can set the number of replicas you want to create. Just like scaling, data replication can also happen across multiple data centres. This further leads to high level back-up and recovery competencies in Cassandra.
5. High Performance:
The basic idea behind developing Cassandra was to harness the hidden capabilities of several multicore machines. Cassandra has made this dream come true! Cassandra has demonstrated brilliant performance under large sets of data. Thus, Cassandra is loved by those organizations that deal with huge amount of data every day and at the same time cannot afford to lose such data.
6. Column Oriented:
Cassandra has a very high-level data model – this is column-oriented. It means, Cassandra stores columns based on the column names, leading to very quick slicing. Unlike traditional databases, where column names only consist of metadata, in Cassandra column names can also consist of the actual data. Thus, Cassandra rows can consist of masses of columns, in contrast to a relational database that consists of a few number of columns. Cassandra is endowed with a rich data model.
7. Tunable Consistency:
Characteristics like Tunable Consistency, makes Cassandra an incomparable database. In Cassandra, Consistency can be of two types – Eventual consistency and Strong consistency. You can adopt any of these, based on your requirements. Eventual consistency makes sure that the client is approved as soon as the cluster accepts the write. Whereas, Strong consistency means that any update is broadcasted to all machines or all the nodes where the particular data is situated. You also have the freedom to blend both eventual and strong consistency. For instance, you can go for eventual consistency in case of remote data centers where latency is quite high and go for Strong consistency for local data centers where latency is low.
Since its creation, Cassandra is famous for being a Schema-less/schema-free database in its column family. In Cassandra, columns can be created at your will within the rows. Cassandra data model is also famously known as a schema-optional data model. In contrast to a traditional database, in Cassandra there is no need to show all the columns needed by your application at the surface as each row is not expected to have the same set of columns.
It is because of the above reasons, Cassandra is in great demand among several companies, where MySQL is getting replaced by NoSQL databases. A database that was initially created to solve the inbox search issues at Facebook, has come a long way to solve Big Data problems. Today, Cassandra is used in diverse applications…whether it is streaming videos or supporting various business units or production applications.