What is MapReduce?
1)for writing distributed applications devised at Google for efficient processing of large amounts of data (multi-terabyte data-sets), on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner.
2)MapReduce is a parallel programming model
3)The MapReduce program runs on Hadoop which is an Apache open-source framework.
Data and desired process is divided in multiple tasks.
4)Applying the desired code on local machine on divided data is called
Map.
5)To produce the desired output, all these individual outputs have to be
merged or reduced to a single output. This reduction of multiple outputs to
a single one is also a process which is done by REDUCER. In Hadoop, as
many reducers are there, those many number of output files are
generated.
What is Hadoop....?
Hadoop is an open source software programming framework for storing
a large amount of data and performing the computation.
Its framework is based on Java programming with some native code
in C and shell scripts.
In short, Hadoop is used to develop applications that could perform
complete statistical analysis on huge amounts of data.
Components of Hadoop :
1) HDFS: Hadoop Distributed File System. Google published its paper GFS
and on the basis of that HDFS was developed. It states that the files will be
broken into blocks and stored in nodes over the distributed architecture.
2) YARN: Yet another Resource Negotiator is used for job scheduling and
manage the cluster.
3) Map Reduce: This is a framework which helps Java programs to do the
parallel computation on data using key value pair. The output of Map task is
consumed by reduce task and then the out of reducer gives the desired result.
4) Hadoop Common: These Java libraries are used to start Hadoop and are
used by other Hadoop modules.