A Hadoop Ecosystem Overview
Posted on Wednesday December 4th 2013 by Jubair
Big data is important now more than ever. Over 90% of all the data in today’s databases has been accumulated in the past two years, drawing upon major players such as the New York Stock Exchange, machine logs, or even Facebook. Previously, massive storage systems were compiled on hard drives which took a great deal of time both in writing information and distributing it across networks. Capabilities for storing and analyzing this huge amount of data were highly limited.
Hadoop is a solution to this problem. It is a large data ecosystem that allows for distributed processing and storage. The system stores datasets across many computers with shared access, providing much shorter analysis times. Additionally, replicated datasets are stored across systems in case of hardware failure.
There are two major components of Hadoop: the Hadoop Distributed Filesystem (HDFS), and MapReduce. HDFS relies upon clusters of nodes, which distribute and store data. MapReduce provides a framework for distributed processing of the data stored.
There are a variety of tools and software components that interface with Hadoop, including processing tools, analytics platforms, and storage management systems, among others. A few of the major Hadoop tools are:
- Common: a set of elements that supports other Hadoop components by providing filesystems and input-output abstractions
- Avro: A system that serializes data for storage and enables cross-language communication.
- Pig: A language for data flow that permits parallel computation and foundation for execution. It runs on HDFS and Mapreduce node clusters.
- Hive: A system that summarizes data warehouse and generates a query language based on SQL.
- HBase: A distributed and scaled database that stores structured data in large tables.
- ZooKeeper: A distributed and accessible tool for coordinating applications.
- Sqoop: A service that and transfers aggregate data between connected databases.
- Oozie: A system for managing and scheduling Hadoop tasks, such as MapReduce, Hive, or Sqoop jobs.
The Hadoop ecosystem continues to grow as more tools are produced for distributed computing and big data processing. By using systems such as Hadoop, big data operations are available for small companies.