CS402 Parallel and Distributed Systems

CS402 Parallel and Distributed Systems

Dermot Kelly

Introduction

Parallel computing is the simultaneous execution of the same task (split up and specially adapted) on multiple processors in order to obtain results faster. The idea is based on the fact that the process of solving a problem usually can be divided into smaller tasks, which may be carried out simultaneously with some coordination.

The terms parallel processor architecture or multiprocessing architecture are sometimes used for a computer with more than one processor, available for processing. Systems with thousands of such processors are known as massively parallel. The recent multicore processors (chips with more than one processor core) such as Intel’s Pentium D, dual core Xeon and Itanium, and AMDs dual core Opteron and Athlon 64X2 are some commercial examples which bring parallel computing to the desktop without having to install multiple processor chips. Current software, which is largely written using sequential programming languages, needs to be adapted or rewritten to make effective use of modern parallel hardware which can support true multiprocessing within individual applications.

While a system of n parallel processors is less efficient than one n-times-faster processor, the parallel system is often cheaper to build. For tasks which require very large amounts of computation, have time constraints on completion and especially for those which can be divided into n execution threads, parallel computation can be an excellent solution. It is often difficult however, in programming terms, to achieve anticipated linear speedup for applications with increased multiprocessor support due to the interplayu of cache coherence, interprocess synchronisation and communication problems.

Most high performance computing systems, also known as supercomputers, have parallel architectures.

Multiprocessing is commonly known as the use of multiple independent processors within a single system. Tightly coupled multiprocessor systems contain multiple CPUs that are connected at the bus level. These CPUs may have access to a central shared memory, or may participate in a memory hierarchy with both local and shared memory. Loosely coupled multiprocessor systems (often referred to as clusters) are based on multiple standalone single or dual processor commodity computers which are highly integrated via a high speed intercommunication system (Gigabit Ethernet is common). A Linux Beowulf cluster is an example of a loosely coupled system.

Tightly coupled systems perform better, due to faster access to memory and intercommunication and are physically smaller and use less power than loosely coupled systems, but historically they are more expensive up front and don't retain their value as well. Nodes in a loosely coupled system can live a second life as desktops upon retirement from the cluster.

A Distributed System is composed of a collection of independent physically (and geographically) separated computers that do not share physical memory or a clock. Each processor has its own local memory and the processors communicate using local and wide area networks. The nodes of a distributed system may be of heterogeneous architectures.

A Distributed Operating System attempts to make this architecture seamless and transparent to the user to facilitate the sharing of heterogeneous resources in an efficient, flexible and robust manner. Its aim is to shield the user from the complexities of the architecture and make it appear to behave like a timeshared centralized environment.

Distributed Operating Systems offer a number of potential benefits over centralized systems. The availability of multiple processing nodes means that load can be shared or balanced across all processing elements with the objective of increasing throughput and resource efficiency. Data and processes can be replicated at a number of different sites to compensate for failure of some nodes or to eliminate bottlenecks. A well designed system will be able to accommodate growth in scale by providing mechanisms for distributing control and data.

Communication is the central issue for distributed systems as all process interaction depends on it. Exchanging messages between different components of the system incurs delays due to data propagation, execution of communication protocols and scheduling. Communication delays can lead to inconsistencies arising between different parts of the system at a given instant in time making it difficult to gather global information for decision making and making it difficult to distinguish between what may be a delay and what may be a failure.

Fault tolerance is an important issue for distributed systems. Faults are more likely to occur in distributed systems than centralized ones because of the presence of communication links and a greater number of processing elements, any of which can fail. The system must be capable of reinitializing itself to a state where the integrity of data and state of ongoing computation is preserved with only some possible performance degradation.

A distributed architecture consisting of heterogeneous hardware with the complexity problems hinted at above is a potentially very unfriendly platform for a user. The principle of transparency is used at many levels in the design of a distributed system to mask implementation details and complexities from the user. This ranges from providing the ability to access local and remote devices in a uniform way, independent of location to automated data and process replication, load balancing and recovery.

The goal of a distributed operating system is to provide a high-performance and robust computing environment with the least awareness of the management and control of distributed resources but with the flexibility to manipulate the environment if required.

An Engineering Perspective – When is a distributed architecture preferable over a centralised one?

You will be familiar with the Software Development Process outlined below.

During the requirements analysis phase of the development, a number of system properties will be identified. These properties may be classified as functional and non-functional components of the design.

The functional requirements are further refined and grouped into related functions which are then localised in particular components or modules of the system.

Non-functional requirements are concerned with the quality of the system and are more difficult to attribute to particular parts of the system. These requirements are of a more global nature.

Examples of such non-functional requirements might be Scalability of the Design, Openness facilitating understandable and easy to use interfaces, the ability to accommodate Heterogeneity perhaps in the hardware platform or software language, Resource Sharing and Fault Tolerance.

It is the non-functional requirements of this kind which lead to the adoption of a distributed architecture.

Remember that centralised systems are simpler and cheaper to build, so preference should be given to a centralised system if a distributed architecture can be avoided. However, many of the non-functional requirements listed above cannot be achieved properly in a centralised system.

This course will address issues in the design of parallel and distributed systems focusing on:-

Architectural Models, Software System Models, Models of Synchrony

Processes and Threads, Synchronisation and Interprocess Communication

Load Distribution and Scheduling Algorithms

Event Ordering and Time Services,

Atomic Transactions, Concurrency, Recovery

Distributed Shared Memory, File Systems, Database Systems

Fault Tolerance

Security

Case Studies of Parallel Programming Environments & Distributed Operating Systems