NFS Design and Write Optimization
May 04, 2017
This semester, as part of a project for my Operating Systems class, I ended up reading a lot about the design of the Network File System (NFS). I’ve always been fascinated by how file systems work and the NFS’ distributed nature introduces an interesting layer of complexity. This post is split into three sections. The first describes a simple distributed file system, the second gives a brief overview of the NFS protocol, and the third discusses some techniques that have been used to improve the NFS’ write performance.
The NFS is a simple, but mature, distributed file system protocol developed by Sun Microsystems in 1984 1. Despite its age, it still makes for an interesting system to study because of its simplicity and widespread usage. The NFS’ protocol specifications are public, and so anyone can implement it. The NFS was designed to allow local programs to access remote files in a manner that is transparent to the application 2. Once a NFS file server is mounted onto a local machine, it allows users of that machine to treat the directories and files in the server as if they were stored locally. This is done through having a POSIX compliant API. The NFS allows users to access their files from multiple clients (as long as those clients are connected to the file server) and gives administrators a centralized administration of the server.
A Simple Distributed File System
A distributed file system is much more complex than a local file system because there are many more things that can go wrong. Apart from the things that could go wrong in a local file system, such as file system corruption, a distributed file system has to deal with failing servers and unreliable networks. Here are 3 instances where a client might find its request failing 1.
- Request from client to server is lost, e.g. dropped by the network.
- The server is down and therefore unable to complete the client’s request.
- Reply from server to client is lost in transit.
Distributed file systems also have to address consistency issues and file staleness. Storing files in a separate place, i.e. on the server, can result in users working on old versions of their files if they are not aware of the consistency semantics of the distributed file system in use. For example, if user A reads a file and then modifies it without updating the server or updates the server but there was a network delay, then, user B who meanwhile reads the file (after it has been modified, but the changes not propagated to the server), will unknowingly receive the older version of the file.
NFS Protocol Overview
We start with NFS version 2 since NFS version 1 was implemented internally within Sun Microsystems and was never released 3. One of the NFS’s design philosophies was to have a dumb server, but smart clients 3. This was because NFSv2’s main design goal was to create a simple protocol with fast and simple server crash recovery 1. In light of this, two main design choices were made, to have the NFS protocol be stateless and to have most operations be idempotent. These two design choices made it easier to design a system that is resilient to server failures because it allowed clients to retry requests without having to maintain a shared state with the server. In this system, it is the clients that have the complicated task of managing state.
Once a client mounts the remote file system, user programs are able to access the files on the file server as if the files are stored locally. This removes the need to have access methods or path names that includes the host name. Clients would only need to deal with host names once 4. The NFS provides POSIX compliant system calls to the user program and the client file system translates those system calls into the equivalent NFS protocols. NFS protocols use Remote Procedure Calls (RPCs) to communicate with the file server. This delegates the responsibility to deal with network issues, for example to handle retries and timeouts, to the RPC subroutine. The RPC is a blocking procedure and it closely mirrors a local procedure call.
Stateless Servers
The NFS server is stateless. Since most file system operations require some sort of state, for example, when using file descriptors, the client-side file system in NFS manages this locally. The NFS protocols expects a file handle, a unique identifier for the file consisting of a volume identifier, inode number, and a generation number. The client side file system maps local file descriptors to file handles and depending on the operation, includes other information such as offset number, that might be needed by the file server. This allows for the file server to be stateless, since it receives all relevant information that it needs to serve a client’s request inside the request itself. Hence, the responsibility of managing state is with the clients. Although the server is stateless, it can still cache information in memory to improve performance, as long as the system does not depend on this information to work correctly 5.
(Mostly) Idempotent Operations
Having idempotent operations for most of the NFS protocol’s operations
allow clients to retry operations on the file server multiple times
without worrying about corrupting the files. For example, if a write
operation is not idempotent, executing a write operation might cause
the next write operation to occur in the next block of memory. Then,
running multiple write operations results in different blocks of
memories getting written over, compared to in an idempotent operation
that has the same results whether an operation is executed once or
multiple times. Moreover, since operations are idempotent, clients do
not need to have complicated or special ways to deal with the
different types of operation failures that could happen. Clients can
simply retry an operation when it fails, regardless of how it
failed. However, it is important to note that not all operations are
idempotent. The MKDIR
and REMOVE
operations are not idempotent.
Limitations
The NFS was designed for simplicity, but had several limitations, mainly with regards to its performance, consistency, and semantic difficulties 5.
- Performance. The NFS has slow write performance and this is given a more thorough treatment in the next section.
- Consistency. Servers in NFS do not keep track of which clients
are using which files. Because of this, it is the responsibility of
the clients to check if their cached files have been
modified. Clients do this by polling the server every time they
would like to use their cached data by issuing a
GETATTR
request. However, this request itself is cached by the client for a few seconds to avoid overloading the server withGETATTR
requests from multiple clients. This works since the common case is that a client is usually the only one using a file 1. As a result, the system is eventually consistent and there are windows of inconsistency where stale data might be used by clients. - Semantic Difficulties. There are certain features that are
difficult to implement and still maintain an NFS server’s stateless
and idempotent properties. Furthermore, there are also instances in
which there is a semantic conflict, since not all NFS operations
are idempotent, such as the case of the
MKDIR
operation.
The NFSv2’s security limitations made it rather easy for any user on a client to pretend to be another user and access other files on the server 1. Later implementations of NFS included integration with the RPCSEC_GSS security protocol and Kerberos V5 to address some of its security limitations 6.
Slow Writes
The WRITE
operation in NFS is a blocking operation. Servers have to
write data to persistent storage, i.e. disk, before giving a response
to the client. If a server queues write operations in cache and
replies to the client, the write is not guaranteed. The server might
crash before it has a chance to save the data or the writing to disk
might have had an error, for example, if the disk ran out of
space. Then, the client would be left thinking its data has been saved
when in fact, it was not. Writing to disk is slow. Hence, this results
in a significant decrease in performance for NFS file servers. In some
cases, this slow write performance can be the major performance
bottleneck in the system 1 7 and results in writes to NFS file
servers to be slower than writes to local disks in UNIX systems
5. A variety of solutions, using both hardware and software
approaches, have been developed to improve on NFS’s write performance.
Caching
One of the first approaches to improve performance was to use caching, but on the client side. NFS clients use a flush-on-close, a.k.a. close-to-open, consistency semantics 1. So, clients would buffer write requests and flush all updates to the server when the file is closed to ensure that all other clients receive the updated file on subsequent opens. Such write buffering reduces an application’s write latency since each write puts data in the client’s cache and succeeds immediately. While this improved write performance on the client, it put an unnecessary load on the server and its disk since a significant number of files are deleted or overwritten shortly after they were created. In Ousterhout’s study of the UNIX 4.2 BSD file system, as much as 20-30% of newly-written information was deleted within 30 seconds, and about 50% was deleted within 5 minutes 8. A more optimized implementation could have clients keep these short-lived files in memory instead until they are deleted and avoid interacting with the server entirely.
Non-volatile RAM (NVRAM)
The second approach involves caching again, this time on the
server. But, this time with different hardware. With non-volatile RAM
(NVRAM), it is possible for servers to now buffer write operations in
memory. Since non-volatile memory will survive a server reboot (the
cached data is not lost) and writing to cache is faster than to disk,
this improves the speed and hence performance of write operations as
servers can respond to clients right after writing to non-volatile
memory and transferring the data to disk at a later time. If a crash
occurs, servers commit the data in memory to disk after a
reboot. However, using NVRAM as a cache of unwritten disk blocks makes
it part of the disk subsystem. A disadvantage of this is that if the
NVRAM fails, it can corrupt the file system in ways that fsck
cannot
detect or repair 9. This technique of using a NVRAM to achieve
higher write speeds is seen in the Prestoserve NFS Accelerator
products 9.
Custom File Systems
The NFS protocol does not require that its file server use a specific file system. While there are trade-offs when it comes to choosing which file systems to use, having a choice allows us to use file systems that are optimized for our use case of needing faster write speeds. Some examples of write-optimized file systems include the Log-structured File System (LFS) that was proposed by Mendel Rosenblum and John Ousterhout and the Write Anywhere File Layout (WAFL) that was used by Network Appliances for its NFS server appliances. These file systems were designed at a time when the general purpose Unix FFS, which was a predecessor of the extended file system (ext) 1, was more commonly used as the NFS file server’s file system. Hence, most of the performance comparisons done are in comparison to the FFS and achieved a significant speedup in write speeds.
Conclusion
The NFS’ design choices has made it simple for the server to recover from crashes but has a disadvantage of having slow write speeds. Some of the different approaches that were used to improve this includes using caches, a write gathering technique, specialized hardware such as a NVRAM, and using custom, specialized file systems.
When introduced, the NFS was successful because of its simplicity and robustness 5. However, despite its simplicity, or perhaps because of it, the NFS has faced consistency and performance issues. While we have seen a glimpse of what NFS has to offer, other distributed file systems have been designed since NFS’s first release, some of them for more specialized use cases than the NFS. Despite newer file systems, the NFS continues to be used today as a virtual file system for users to store their data over the network. Its long usage is testimony to its robustness and portability.
References
- Remzi H. Arpaci-Dusseau and Andrea C. Arpaci-Dusseau. Operating Systems: Three Easy Pieces. 0.91. Arpaci-Dusseau Books, May 2015.↩
- Russel Sandberg. The Sun Network File System: Design, Implementation and Experience. Tech. rep. In Proceedings of the Summer 1986 USENIX Technical Conference and Exhibition, 1986.↩
- Brian Pawlowski et al. “NFS Version 3: Design and Implementation”. In: Proceedings of the Summer 1994 USENIX Technical Conference. 1994, pp. 137–151.↩
- Russel Sandberg. The Sun Network File System: Design, Implementation and Experience. Tech. rep. In Proceedings of the Summer 1986 USENIX Technical Conference and Exhibition, 1986.↩
- John K. Ousterhout. “The Role of Distributed State”. In: In CMU Computer Science: a 25th Anniversary Commemorative. ACM Press, 1991, pp. 199–217.↩
- M. Eisler. NFS Version 2 and Version 3 Security Issues and the NFS Protocol’s Use of RPCSEC_GSS and Kerberos V5. RFC 2623. RFC Editor, June 1999.↩
- Chet Juszczak. “Improving the Write Performance of an NFS Server”. In: Proceedings of the USENIX Winter 1994 Technical Conference on USENIX Winter 1994 Technical Conference. WTEC’94. San Francisco, California: USENIX Association, 1994, pp. 20–20. URL: http://dl.acm.org/citation.cfm?id=1267074.1267094.↩
- John K. Ousterhout et al. “A Trace-driven Analysis of the UNIX 4.2 BSD File System”. In: Proceedings of the Tenth ACM Symposium on Operating Systems Principles. SOSP ‘85. Orcas Island, Washington, USA: ACM, 1985, pp. 15–24. ISBN: 0-89791-174-1. DOI: 10.1145/323647. 323631. URL: http://doi.acm.org/10.1145/323647.323631.↩
- Dave Hitz, James Lau, and Michael Malcolm. “File System Design for an NFS File Server Appliance”. In: Proceedings of the USENIX Winter 1994 Technical Conference on USENIX Winter 1994 Technical Conference. WTEC’94. San Francisco, California: USENIX Association, 1994, pp. 19–19. URL: http://dl.acm.org/citation.cfm?id=1267074.1267093.↩
Hi! I’m Stacey. Welcome to my blog. I’m a software engineer with an interest in programming languages and web performance. I also like making 🍵, reading fiction, and discovering random word origins.