NFS Design and Write Optimization

May 04, 2017

This semester, as part of a project for my Operating Systems class, I ended up reading a lot about the design of the Network File System (NFS). I’ve always been fascinated by how file systems work and the NFS’ distributed nature introduces an interesting layer of complexity. This post is split into three sections. The first describes a simple distributed file system, the second gives a brief overview of the NFS protocol, and the third discusses some techniques that have been used to improve the NFS’ write performance.

The NFS is a simple, but mature, distributed file system protocol developed by Sun Microsystems in 1984 ¹. Despite its age, it still makes for an interesting system to study because of its simplicity and widespread usage. The NFS’ protocol specifications are public, and so anyone can implement it. The NFS was designed to allow local programs to access remote files in a manner that is transparent to the application ². Once a NFS file server is mounted onto a local machine, it allows users of that machine to treat the directories and files in the server as if they were stored locally. This is done through having a POSIX compliant API. The NFS allows users to access their files from multiple clients (as long as those clients are connected to the file server) and gives administrators a centralized administration of the server.

A Simple Distributed File System

A distributed file system is much more complex than a local file system because there are many more things that can go wrong. Apart from the things that could go wrong in a local file system, such as file system corruption, a distributed file system has to deal with failing servers and unreliable networks. Here are 3 instances where a client might find its request failing ¹.

Request from client to server is lost, e.g. dropped by the network.
The server is down and therefore unable to complete the client’s request.
Reply from server to client is lost in transit.

Distributed file systems also have to address consistency issues and file staleness. Storing files in a separate place, i.e. on the server, can result in users working on old versions of their files if they are not aware of the consistency semantics of the distributed file system in use. For example, if user A reads a file and then modifies it without updating the server or updates the server but there was a network delay, then, user B who meanwhile reads the file (after it has been modified, but the changes not propagated to the server), will unknowingly receive the older version of the file.

NFS Protocol Overview

We start with NFS version 2 since NFS version 1 was implemented internally within Sun Microsystems and was never released ³. One of the NFS’s design philosophies was to have a dumb server, but smart clients ³. This was because NFSv2’s main design goal was to create a simple protocol with fast and simple server crash recovery ¹. In light of this, two main design choices were made, to have the NFS protocol be stateless and to have most operations be idempotent. These two design choices made it easier to design a system that is resilient to server failures because it allowed clients to retry requests without having to maintain a shared state with the server. In this system, it is the clients that have the complicated task of managing state.

Once a client mounts the remote file system, user programs are able to access the files on the file server as if the files are stored locally. This removes the need to have access methods or path names that includes the host name. Clients would only need to deal with host names once ⁴. The NFS provides POSIX compliant system calls to the user program and the client file system translates those system calls into the equivalent NFS protocols. NFS protocols use Remote Procedure Calls (RPCs) to communicate with the file server. This delegates the responsibility to deal with network issues, for example to handle retries and timeouts, to the RPC subroutine. The RPC is a blocking procedure and it closely mirrors a local procedure call.

Stateless Servers

The NFS server is stateless. Since most file system operations require some sort of state, for example, when using file descriptors, the client-side file system in NFS manages this locally. The NFS protocols expects a file handle, a unique identifier for the file consisting of a volume identifier, inode number, and a generation number. The client side file system maps local file descriptors to file handles and depending on the operation, includes other information such as offset number, that might be needed by the file server. This allows for the file server to be stateless, since it receives all relevant information that it needs to serve a client’s request inside the request itself. Hence, the responsibility of managing state is with the clients. Although the server is stateless, it can still cache information in memory to improve performance, as long as the system does not depend on this information to work correctly ⁵.

(Mostly) Idempotent Operations

Having idempotent operations for most of the NFS protocol’s operations allow clients to retry operations on the file server multiple times without worrying about corrupting the files. For example, if a write operation is not idempotent, executing a write operation might cause the next write operation to occur in the next block of memory. Then, running multiple write operations results in different blocks of memories getting written over, compared to in an idempotent operation that has the same results whether an operation is executed once or multiple times. Moreover, since operations are idempotent, clients do not need to have complicated or special ways to deal with the different types of operation failures that could happen. Clients can simply retry an operation when it fails, regardless of how it failed. However, it is important to note that not all operations are idempotent. The MKDIR and REMOVE operations are not idempotent.

Limitations

The NFS was designed for simplicity, but had several limitations, mainly with regards to its performance, consistency, and semantic difficulties ⁵.

Performance. The NFS has slow write performance and this is given a more thorough treatment in the next section.
Consistency. Servers in NFS do not keep track of which clients are using which files. Because of this, it is the responsibility of the clients to check if their cached files have been modified. Clients do this by polling the server every time they would like to use their cached data by issuing a GETATTR request. However, this request itself is cached by the client for a few seconds to avoid overloading the server with GETATTR requests from multiple clients. This works since the common case is that a client is usually the only one using a file ¹. As a result, the system is eventually consistent and there are windows of inconsistency where stale data might be used by clients.
Semantic Difficulties. There are certain features that are difficult to implement and still maintain an NFS server’s stateless and idempotent properties. Furthermore, there are also instances in which there is a semantic conflict, since not all NFS operations are idempotent, such as the case of the MKDIR operation.

The NFSv2’s security limitations made it rather easy for any user on a client to pretend to be another user and access other files on the server ¹. Later implementations of NFS included integration with the RPCSEC_GSS security protocol and Kerberos V5 to address some of its security limitations ⁶.

Slow Writes

The WRITE operation in NFS is a blocking operation. Servers have to write data to persistent storage, i.e. disk, before giving a response to the client. If a server queues write operations in cache and replies to the client, the write is not guaranteed. The server might crash before it has a chance to save the data or the writing to disk might have had an error, for example, if the disk ran out of space. Then, the client would be left thinking its data has been saved when in fact, it was not. Writing to disk is slow. Hence, this results in a significant decrease in performance for NFS file servers. In some cases, this slow write performance can be the major performance bottleneck in the system ¹ ⁷ and results in writes to NFS file servers to be slower than writes to local disks in UNIX systems ⁵. A variety of solutions, using both hardware and software approaches, have been developed to improve on NFS’s write performance.

Caching

One of the first approaches to improve performance was to use caching, but on the client side. NFS clients use a flush-on-close, a.k.a. close-to-open, consistency semantics ¹. So, clients would buffer write requests and flush all updates to the server when the file is closed to ensure that all other clients receive the updated file on subsequent opens. Such write buffering reduces an application’s write latency since each write puts data in the client’s cache and succeeds immediately. While this improved write performance on the client, it put an unnecessary load on the server and its disk since a significant number of files are deleted or overwritten shortly after they were created. In Ousterhout’s study of the UNIX 4.2 BSD file system, as much as 20-30% of newly-written information was deleted within 30 seconds, and about 50% was deleted within 5 minutes ⁸. A more optimized implementation could have clients keep these short-lived files in memory instead until they are deleted and avoid interacting with the server entirely.

Non-volatile RAM (NVRAM)

The second approach involves caching again, this time on the server. But, this time with different hardware. With non-volatile RAM (NVRAM), it is possible for servers to now buffer write operations in memory. Since non-volatile memory will survive a server reboot (the cached data is not lost) and writing to cache is faster than to disk, this improves the speed and hence performance of write operations as servers can respond to clients right after writing to non-volatile memory and transferring the data to disk at a later time. If a crash occurs, servers commit the data in memory to disk after a reboot. However, using NVRAM as a cache of unwritten disk blocks makes it part of the disk subsystem. A disadvantage of this is that if the NVRAM fails, it can corrupt the file system in ways that fsck cannot detect or repair ⁹. This technique of using a NVRAM to achieve higher write speeds is seen in the Prestoserve NFS Accelerator products ⁹.

Custom File Systems

The NFS protocol does not require that its file server use a specific file system. While there are trade-offs when it comes to choosing which file systems to use, having a choice allows us to use file systems that are optimized for our use case of needing faster write speeds. Some examples of write-optimized file systems include the Log-structured File System (LFS) that was proposed by Mendel Rosenblum and John Ousterhout and the Write Anywhere File Layout (WAFL) that was used by Network Appliances for its NFS server appliances. These file systems were designed at a time when the general purpose Unix FFS, which was a predecessor of the extended file system (ext) ¹, was more commonly used as the NFS file server’s file system. Hence, most of the performance comparisons done are in comparison to the FFS and achieved a significant speedup in write speeds.

Conclusion

The NFS’ design choices has made it simple for the server to recover from crashes but has a disadvantage of having slow write speeds. Some of the different approaches that were used to improve this includes using caches, a write gathering technique, specialized hardware such as a NVRAM, and using custom, specialized file systems.

When introduced, the NFS was successful because of its simplicity and robustness ⁵. However, despite its simplicity, or perhaps because of it, the NFS has faced consistency and performance issues. While we have seen a glimpse of what NFS has to offer, other distributed file systems have been designed since NFS’s first release, some of them for more specialized use cases than the NFS. Despite newer file systems, the NFS continues to be used today as a virtual file system for users to store their data over the network. Its long usage is testimony to its robustness and portability.

References

Remzi H. Arpaci-Dusseau and Andrea C. Arpaci-Dusseau. Operating Systems: Three Easy Pieces. 0.91. Arpaci-Dusseau Books, May 2015.↩
Russel Sandberg. The Sun Network File System: Design, Implementation and Experience. Tech. rep. In Proceedings of the Summer 1986 USENIX Technical Conference and Exhibition, 1986.↩
Brian Pawlowski et al. “NFS Version 3: Design and Implementation”. In: Proceedings of the Summer 1994 USENIX Technical Conference. 1994, pp. 137–151.↩
Russel Sandberg. The Sun Network File System: Design, Implementation and Experience. Tech. rep. In Proceedings of the Summer 1986 USENIX Technical Conference and Exhibition, 1986.↩
John K. Ousterhout. “The Role of Distributed State”. In: In CMU Computer Science: a 25th Anniversary Commemorative. ACM Press, 1991, pp. 199–217.↩
M. Eisler. NFS Version 2 and Version 3 Security Issues and the NFS Protocol’s Use of RPCSEC_GSS and Kerberos V5. RFC 2623. RFC Editor, June 1999.↩
Chet Juszczak. “Improving the Write Performance of an NFS Server”. In: Proceedings of the USENIX Winter 1994 Technical Conference on USENIX Winter 1994 Technical Conference. WTEC’94. San Francisco, California: USENIX Association, 1994, pp. 20–20. URL: http://dl.acm.org/citation.cfm?id=1267074.1267094.↩
John K. Ousterhout et al. “A Trace-driven Analysis of the UNIX 4.2 BSD File System”. In: Proceedings of the Tenth ACM Symposium on Operating Systems Principles. SOSP ‘85. Orcas Island, Washington, USA: ACM, 1985, pp. 15–24. ISBN: 0-89791-174-1. DOI: 10.1145/323647. 323631. URL: http://doi.acm.org/10.1145/323647.323631.↩
Dave Hitz, James Lau, and Michael Malcolm. “File System Design for an NFS File Server Appliance”. In: Proceedings of the USENIX Winter 1994 Technical Conference on USENIX Winter 1994 Technical Conference. WTEC’94. San Francisco, California: USENIX Association, 1994, pp. 19–19. URL: http://dl.acm.org/citation.cfm?id=1267074.1267093.↩

Hi! I’m Stacey. Welcome to my blog. I’m a software engineer with an interest in programming languages and web performance. I also like making 🍵, reading fiction, and discovering random word origins.