Skip to content

Distributed File System

HDFS

Predecessors - GFS

Before the advent of the Hadoop Distributed File System (HDFS), the Google File System (GFS) served as a pioneering technology in the realm of scalable distributed storage. Developed by Google, GFS was designed to manage and process large datasets across thousands of inexpensive, commodity hardware components, a setup where frequent hardware failures were the norm. GFS excelled at handling large files, typically 100MB or larger, and was optimized for high sustained bandwidth, prioritizing bulk data processing over low-latency operations. This foundational work by Google on GFS inspired the creation of HDFS, which brought similar principles and innovations to the open-source community, enabling a wider range of organizations to harness the power of distributed data storage and processing.

How it works

Read

  1. Client Request Initiation: When an application needs to read a file, it initiates a request through the GFS client. The client sends the file name and the chunk index to the GFS master.

  2. Master Response: The GFS master, which maintains a table of file metadata, processes the request. It looks up the file name and chunk index, then returns the chunk handle (an identifier) and the IP addresses of all chunk servers that hold replicas of the requested chunk.

  3. Client Caching: The client caches the received information, using the file name and chunk index as the key. This cache helps minimize future interactions with the master for the same chunk.

  4. Fetching Data from Chunk Server: With the chunk handle and locations of replicas, the client sends a request to one of the chunk servers, typically the closest one, specifying the chunk handle and the byte range within that chunk.

  5. Data Transfer: The chunk server responds by sending the requested data directly to the client. This reduces the need for further interactions with the master unless the cached information becomes outdated or the file is reopened.

  6. Data Delivery: The client then forwards the received data to the application, completing the read operation.

Write

  1. Client Lease Request: When an application needs to write data, the client first asks the GFS master which chunk server holds the current lease for the chunk and the locations of other replicas.

  2. Master Response: The master replies with the identity of the primary chunk server and the locations of secondary replicas.

  3. Data Push to Replicas: The client pushes the data to all the replicas. This data push ensures that all replicas have the data before the write operation is officially initiated.

  4. Write Request to Primary: After all replicas acknowledge receiving the data, the client sends a write request to the primary chunk server. This request includes the chunk handle and the specific data to be written.

  5. Forwarding Write Request: The primary chunk server assigns a serial number to the write operation and forwards the request to all secondary replicas, ensuring that the mutation order is consistent across all replicas.

  6. Mutation Application: Each secondary replica applies the mutations in the serial number order assigned by the primary.

  7. Completion Acknowledgment: Once the secondary replicas complete the operation, they reply to the primary chunk server.

  8. Final Client Response: The primary chunk server then replies to the client, confirming the successful write operation. If any errors occur at any of the replicas, they are reported back to the client.

HDFS(Hadoop Distributed File System)

HDFS is an open-sourced implementation of GFS. Both, of them shares a lot of similarities, in terms of how it works. They are designed to manage large files across a distributed system of commodity hardware.

In terms of their architecture, both systems utilize a master-slave model, but the responsibilities and handling of metadata differ. The GFS master manages metadata and coordinates access to chunk servers, tracking the locations of chunks and handling leases for writes. On the other hand, the HDFS NameNode manages the filesystem namespace, regulates access to files by clients, and maintains the metadata for the entire filesystem, including the locations of blocks. This difference in handling metadata impacts how each system scales and manages file access.

Another key distinction is the data block size and handling of data consistency. GFS typically uses a chunk size of 64MB and offers a relaxed consistency model where modifications are atomic at the chunk level, but changes may not be immediately visible to readers. HDFS, however, uses a default block size of 128MB, which is configurable, and ensures strong consistency where data written by a client is visible to readers as soon as the write is acknowledged. This provides a more predictable data access pattern in HDFS compared to GFS.

When it comes to data writing and appending, GFS supports appending data to files but does not support in-place updates. HDFS originally did not support appending to files, but later versions introduced this functionality, although like GFS, it does not support in-place updates. This reflects their optimization for large streaming reads and writes, with GFS prioritizing high sustained bandwidth over low latency, and HDFS focusing on the MapReduce programming model common in the Hadoop ecosystem.

Practical Example

Using HDFS is quite straightforward once it has been set up properly. We interact with the HDFS CLI and the keywords that they use is very similar to linux commands. For example, ls command in linux displays all the folder and files in the current directory. In HDFS, we can achieve the same behavior by appending haddoop fs in front of the command and make ls as the option argument as shown below

sh
hadoop fs -ls

A few more examples is making a directory and putting file into the HDFS.

sh
hadoop fs -mkdir /hdfs/target
hadoop fs -put /users/path /hdfs/target

These interface is very convenient since whoever have some experience with linux commands, can use the HDFS quite easily.