Tuesday, September 23, 2014

On GPFS

From GPFS paper [2002], there's no open source version.

GPFS, to some perspective, still is the fastest parallel file system implemented in shared disk environment over storage area network (SAN). Since any file system node can access any portion of shared disks, GPFS requires a lock mechanism to support fully parallel access both to file data and metadata.

GPFS uses a centralized global lock manager, in conjunction with local lock managers in each file system node. Lock manager handing out lock tokens to requested node.

GPFS guarantees single-node equivalent POSIX semantics for file system operations across the cluster, meaning a read on node A will see either all or none of concurrent write on node B (read/write atomicity). But with one exception, the access time (atime in metadata) updates only periodically, due to concurrent read is very common, synchronizing atime would be very expensive.

The paper claims there are two approaches to achieving the necessary synchronization:
1. Distributed Locking : every FS operation acquires read/write lock to synchronize with conflicting operations.
2. Centralized Management : all conflicting operations are forwarded to a designated node, which performs requests.

GPFS uses byte-range locking for updates to file data, and dynamically elected "metanodes" for centralized management of file metadata. The argument of using different approach for data and metadata is this: 
(1) when lock conflicts are frequent, (e.g. many nodes may access different parts of a file, but all need to access the same metadata), the overhead for distributed locking may exceed the cost of forwarding requests to a central node.
(2) if different nodes operates on different pieces of file data, distributed locking allows greater parallelism.

Also, a smaller lock granularity means more overhead due to frequent lock requests. Whereas larger granularity may cause more frequent lock conflicts. Thus, byte-rang lock for file data, lock-per-file used for metadata.

However, there could be third approach, a middle solution: Panopticon


No comments:

Post a Comment