The Hadoop Distributed File System (HDFS) is a descendant of the Google File System, which was developed to solve the problem of big data processing at scale. HDFS is simply a distributed file system. This means that a single large dataset can be stored in several different storage nodes within a compute cluster. HDFS is how Hadoop is able to offer scalability and reliability for the storage of large datasets in a distributed fashion.
Motivation for a Distributed Approach
Why would a company choose a distributed architecture over the usual single-node architecture?
Fault Tolerance: A distributed architecture is more immune to failures due to the existence of replicas.
Scalability: HDFS can allow adding more nodes to a cluster. It is not limited. This allows a business to easily scale with demand.
Access speeds: A distributed architecture offers fast data retrieval from storage compared to traditional relational databases.
Cost effectiveness: At the scale of big data, a distributed approach is cheaper in terms of storage and retrieval computations since the extra processing required to read and write into a relational database would cost so much more.
With all its advantages, HDFS and Hadoop work best for use cases with very large amounts of data. If your use case has relatively small and manageable data, an HDFS would not be ideal.
Namenode: The master node. It manages access to files on the system and also manages the namespace of the file system.
DataNode: The slave node. It performs read/write operations on the file system as well as block operations such as creation, deletion, and replication.
Block: The unit of storage. A file is broken up into blocks, and different blocks are stored on different datanodes as per instructions of the namenode.
Several datanodes can be grouped into a single unit and referred to as a Rack. The rack design depends on the big data architect and what they wish to achieve. A typical architecture may have a main rack and some replication racks for purposes of redundancy in case of failures.
Basic Usage Commands
To use the Hadoop file system, you need some background knowledge in Linux or command line usage since most of the commands are similar to those in Linux.
To start up the HDFS, assuming you have Hadoop installed on your machine, run the command
The general pattern is to use basic Linux file system commands but preceded by the command
hadoop fs so they run on HDFS and not the local file system.
To create a new directory within the HDFS, use the command
hadoop fs -mkdir <directory_name>
To copy a file from your host machine onto HDFS, use the command
hadoop fs -put ~/path/to/localfile /path/to/hadoop/directory
The reverse of
get, which copies files from HDFS back to host file system.
hadoop fs -get /hadoop/path/to/file ~/path/to/local/storage
More Linux file system commands such as
ls, and others are also available in HDFS.