Distributed File Systems

 

A File System is a refinement of the more general abstraction of permanent storage. Databases and Object Repositories are other examples. A file system defines the naming structure, characteristics of the files and the set of operations associated with them.

 

The classification of computer systems and the corresponding file system requirements are given below. Each level subsumes the functionality of the layers below in addition to the new functionality required by that layer.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 


Distributed File Systems constitute the highest level of this classification. Multiple users using multiple machines connected by a network use a common file system that must be efficient, secure and robust.

 

The issues of file location and file availability become significant. Ideally, file location should be transparent so that files are accessible by a path name in the naming structure that is independent of the physical location of the file. It is of course simpler to statically bind names or sub trees of the namespace to machine locations that must be provided by the user to access the relevant files. This is not satisfactory for users as they are required to remember machine names and when these names are hardwired into applications it is inconvenient to relocate parts of the file system data when carrying out ordinary administration activities. Availability is achieved through replication, which introduces complications for maintaining consistency. Availability is important, as a user may not have control over the physical location where a file is stored and, when using that file at another site, should still have access irrespective of the status of the host where the file may be located.

 

File Systems and Databases

File Systems and Databases have many similar properties and requirements. However there are many conceptual differences as well.

 

Encapsulation: File systems view data in files as a collection of uninterpreted bytes whereas databases contain information about the type of each data item and relationships with other data items. As the data is typed the database can constrain the values of certain items. A database therefore is a higher level of abstraction of permanent storage as it subsumes a portion of the functionality that would be required by applications built on file systems. In addition to enforcing constraints it provides query and indexing mechanisms on data.

 

Naming: File Systems organise files into directory hierarchies. The purpose of these hierarchies is to help humans deal with a large number of named objects. However, for very large file namespaces this method of naming can still be difficult for users to deal with. Databases allow associative access to data, that is, data is identified by content that matches search criteria.

 

The ratio of search time to usage time is the factor that determines whether access by name is adequate. If an item is used very frequently, the ratio is low and a file system is adequate. Not surprisingly, the usage patterns of a file system exhibit considerable temporal locality of this kind. Databases usage patterns exhibit very little temporal locality.

 


File Systems and Databases (Continued)

Dealing with usage patterns: Distributed file systems and databases therefore use two different strategies for enhancing performance. Distributed file systems use data shipping where data is brought to the point of use. It is cached and commonly reused frequently. Distributed databases use function shipping where computation is shipped to the site of the data storage. Transporting the search function requires only a small amount of network bandwidth. The alternative would be to transport the entire amount of data to be searched to where the search was to be done.

 

Granularity of Concurrency: Databases are typically used by applications that involve concurrent read and write sharing of data at fine granularity by large numbers of users. In addition there are requirements for strict consistency of this data and atomicity of groups of operations (transactions). With file systems, sharing is more coarse grained, at the file level, and individual files are rarely shared for write access by multiple users although read sharing is quite common, for example, system executable files. It is this combination of application characteristics that makes it a lot more difficult to implement distributed databases than distributed file systems.

 

Empirical Observations

Many of the current techniques for designing distributed file systems have arisen from empirical studies of the usage of existing centralised file systems.

 

The size distribution of files is useful for determining the most efficient means of mapping files to disk blocks. It is found that most files are relatively small.

 

The relative and absolute frequency of different types of file operations has influenced the design of cache fetch algorithms and concurrency mechanisms. Most file access is sequential rather than random and read operations are more common than write. It is also noted that users rarely share files for writing.

 

Information on file mutability is also useful. It is found that most files that are written are overwritten often. Consider for example temporary files generated by a compiler or files which we are editing and save repeatedly. This information can guide a cache write policy.

 

The type of a file may substantially influence its properties. Type specific file information is useful in file placement and in the design of replication mechanisms. Executable files are easily replicated as access is read only.

 

What is the size of the set of files referenced in a short period? This will influence the size of the cache and catch fetching policy. Most applications exhibit temporal locality within the file namespace and this set if found to be small in general.

 

Problems with Empirical Data

Gathering empirical data requires modification to an operating system to monitor these parameters. This process itself may impact on the system performance in a small way.

 

When interpreting the data, consideration must be given to the environment in which it was gathered. A study of a file system used in an academic environment may not be sufficiently general for other kinds of environments.

 

Another concern relates to the interdependency of the design studied and the data observed. The nature of a particular file system implementation may unduly influence the way in which users use the system. For example, if the caching system used whole-file transfer there would be a disincentive to create very large files.

 

Another issue is that most empirical studies are carried out on existing centralised file systems under the assumption that user behaviour and programming characteristics will not change significantly in a distributed system.


Mechanisms for Building Distributed File Systems

 

Mounting

Mount mechanisms allow the binding together of different file namespaces to form a single hierarchical namespace. The Unix operating system uses this mechanism. A special entry, known as a mount point, is created at some position in the local file namespace that is bound to the root of another file name space. From the users point of view a mount point is indistinguishable from a local directory entry and may be traversed using standard path names once mounted. File access is therefore location transparent for the user but not for the system administrator. The kernel maintains a structure called a mount table which maps mount points to appropriate file systems. Whenever a file access path crosses a mount point, this is intercepted by the kernel, which then obtains the required service from the remote server.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 


Client machines can maintain mount information individually as is done in Sun's Network File System. Every client individually mounts all the required file systems at specific points in its local file space. Note that each client need not necessarily mount the file systems at the same points. This makes it difficult to write portable code which can locate files in the distributed file system irrespective of where it is executed. Movement of files between servers requires each client to unmount and remount the affected subtree. Note that to simplify the implementation, Unix does not allow hard links to be placed across mount points, i.e. between two separate file systems. Also, servers are unaware of where the subtrees exported by them have been mounted.

 

Mount information can be maintained at servers in which case it is possible that every client sees an identical namespace. If files are moved to a different server then mount information need only be updated at the servers. Servers each manage domains or volumes which are subtrees of the overall name space. Clients may initially direct file requests to any server which will direct the client to the server which manages the domain containing the required file. This information may be used to guide the client more efficiently for future accesses.

 

Client Caching

Caching is the architectural feature which contributes the most to performance in a distributed file system. Caching exploits temporal locality of reference. A copy of data stored at a remote server is brought to the client. Other metadata such as directories, protection and file status or location information also exhibit locality of reference and are good candidates for caching.

 

Data can be cached in main memory or on the local disk. A key issue is the size of cached units, whether entire files or individual file blocks are cached. Caching entire files is simpler and most files are in fact read sequentially in their entirety, but files which are larger than the client cache cannot be fetched. Cache validation can be done in two ways. The client can contact the server before accessing the cache or the server can notify clients when data is rendered stale. This can reduce client-server traffic.

 

A number of approaches to propagating changes back to the server are also possible. A change may be propagated when the file is closed or by deferring it for a set period of time. This would be useful where files are overwritten frequently. A policy of deferred update is also useful in situations where a host can become disconnected from the network and propagates updates upon reconnection.

Hints

Caching is an important mechanism to improve the performance of a file system, however, guaranteeing the consistency of the cached items requires elaborate and expensive client/server protocols. An alternative approach to maintaining consistency is to treat cached data (mostly meta data) as hints. With this scheme, cached data is not expected to be completely consistent, but when it is, it can dramatically improve performance. For maximum performance benefit, a hint should nearly always be corrrect. To prevent the occurrence of negative consequences if a hint is erroneous, the classes of applications that use hints must be restricted to those that can recover after discovering that the cached data is invalid, that is, the data should be self validating upon use. Note that file data cannot be treated as a hint because use of the cached copy will not reveal whether it is current or stale.

 

For example, after the name of a file or directory is mapped to a physical object in the server, the address of that object can be stored as a hint in the client's cache. If the address fails to map to the object on a subsequent access, it is purged from the cache and the client must consult with the file server to determine the actual location of the object. It is unlikely that file locations will change frequently over time and so use of hints can benefit access performance.

 

Bulk Data Transfer

All data transfer in a network requires the execution of various layers of communication protocols. Data is assembled and disassembled into packets, it is copied between the buffers of various layers in the communication protocols and transmitted in individually acknowledged packets over the network. For small amounts of data, the transit time across the network is low, but there are relatively high latency costs involved with the communication protocols establishing peer connections, packaging the small data packets for individual transmission and acknowledging receipt of each packet at each layer.

 

Transferring data in bulk reduces the relative cost of this overhead at the source and destination. With this scheme, multiple consecutive data packets are transferred from servers to clients (or vice versa) in a burst. At the source, multiple packets are formatted and transmitted with one context switch from client to kernel. At the destination, a single acknowledgement is used for the entire sequence of packets received.

 

Caching amortizes the cost of accessing remote servers over several local references to the same information, bulk transfer amortizes the cost of the fixed communication protocol overheads and possibly disk seek time over many consecutive blocks of a file. Bulk transfer protocols depend on the spatial locality of reference within files for effectiveness. Remember there is substantial empirical evidence that files are read in their entirety. File server performance may be enhanced by transmitting a number of consecutive file blocks in response to a client block request.

 

Encryption

Encryption is used for enforcing security in distributed systems. A number of possible threats exist such as unauthorised release of information, unauthorised modification of information, or unauthorised denial of resources. Encryption is primarily of value in preventing unauthorised release and modification of information.

 

The Kerberos protocol is most commonly used as a mechanism which employs encryption for initially establishing trusted communication between two parties and for generating a private session key for encrypting and decrypting subsequent messages between them. Private session keys tend to be shorter than those used for public key encryption and this makes it cheaper (easier) to perform the encryption.

 

For performance, encryption/decryption may be performed by special hardware at the client and server. It may be difficult to justify the cost or need for this hardware to users. The benefit is not as tangible as extra memory, processor speed or graphics capability and may be viewed as an expensive frill until the importance of security is perceived.


Stateful and Stateless File Servers

A client operates on files during a session with a file server. A session is an established connection for a sequence of requests and responses between the client and server. During this session a number of items of data may be maintained by the parties such as the set of open files and their clients, file descriptors and handles, current position pointers, mounting information, lock status, session keys, caches and buffers. This information is required partly to avoid repeated authentication protocols for each request and repeated directory lookup operations for satisfying each file access during the session. The information may reside partly in clients or servers or both.

 

How state information is divided affects the performance and management of the distributed file system. For example, file position pointers, mounting information (as hints) and caches are more naturally maintained by the client. The session key is shared between both client and server. Locks represent global information shared by all clients, so it seems most suitable for the server to manage this information.

 

A file server is said to be stateful if it maintains internally some of the state information relating to current sessions and stateless if it maintains none at all. The more state information known about a connection means the more flexibility there is in exercising control over it. However, stateful servers are to be avoided generally in distributed systems. Stateless servers make it much easier to implement fault tolerant services then one that has to keep track of and recover state information. Client crashes have no effect on stateless servers. Crashes of stateless servers are easy to recover from and can be perceived by clients as longer response delays (or unresponsiveness) without disrupting their sessions.

 

Each file request to a stateless server must contain full information about the file handle, position pointer, session key and other necessary information.  In addition, the following issues must be addressed:-

 

Idempotency: How does the server deal with possible repeated client requests caused by the previous failures (nonresponsiveness) of the server? One solution is to make all requests idempotent meaning that a request can be executed any number of times with the same effect. For example, NFS provides only idempotent services like read-a-block or write-value-to-block but not append-a-block. Not all services can be made idempotent, for example lock management. In this case the client attaches sequence numbers to each request so that the server can detect duplicate or out of sequence requests. These sequence numbers must be maintained by the server. The need for sequence numbers at the client and server level may be eliminated if RPC communication is used on a reliable connection oriented TCP transport service. RPC call mechanism may offer at-least-once or at-most-once or maybe semantics.

 

File Locking: How can sharing clients implement a file locking mechanism on top of stateless file servers?

 

Session Key Management: How is the session key generated and maintained between the client and stateless file server? Can a one time session key be used for each file access?

 

Cache Consistency: Is the file server responsible for controlling cache consistency among clients? What sharing semantics are supported? Ideally, we like updates of a file to be completed instantaneously and the consistent results visible to other clients immediately but practically this is difficult to achieve and is often relaxed. Unix semantics are to propagate a write to a file and its copies immediately so that reads will return the latest value. Subsequent accesses from the client that issued the write must wait for the write to complete. The primary objective is to maintain currency of the data. Transaction semantics are to store writes tentatively which are committed only when consistency constraints are met at the end of the transaction. The primary objective is to maintain consistency of the data. Session semantics are to perform writes to a cached working copy of a file which is made permanent at the end of the session. The primary objective is to maintain efficiency of data access.

 

For practical purposes, it seems some minimal state information may have to be maintained by the server.