Preface Chapter 1: Parallel Machines and Computations 1.1 The Evolution of Parallel Architectures 1.1.1 Parallelism in Sequential Computers 1.1.2 Vector or SIMD Computers 1.1.3 Multiprocessors or MIMD Computers 1.2 Interconnection Networks 1.3 Application of Architectural Parallelism 1.4 Getting Started in SIMD and MIMD Programming 1.5 Parallelism in Algorithms 1.6 Conclusion 1.7 Bibliographic Notes Chapter 2: Potential for Parallel Computations 2.1 Parameters Characterizing Algorithm Parallelism 2.2 Prefix Problem 2.3 Parallel Prefix Algorithms 2.3.1 Upper/Lower Parallel Prefix Algorithm 2.3.2 Odd/Even Parallel Prefix Algorithm 2.3.3 Ladner and Fischer''s Parallel Prefix 2.4 Characterizing Algorithm Behavior for Large Problem Size 2.5 Programming Parallel Prefix 2.6 Speedup and Efficiency of Parallel Algorithms 2.7 The Performance Perspective 2.7.1 Factors That Influence Performance 2.7.2 A Simple Performance Model--Amdahl''s Law 2.7.3 Average Execution Rate 2.8 Conclusion 2.9 Bibliographic Notes Chapter 3: Vector Algorithms and Architectures 3.1 Vector and Matrix Algorithms 3.2 A Vector Architecture---Single Instruction Multiple Data 3.3 An SIMD Instruction Set 3.3.1 Registers and Memories of an SIMD Computer 3.3.2 Vector, Control Unit, and Cooperative Instructions 3.3.3 Data-Dependent Conditional Operations 3.3.4 Vector Length and Strip Mining 3.3.5 Routing Data Among the PEs 3.4 The Prime Memory System 3.5 Use of the PE Index to Solve Storage Layout Problems 3.6 SIMD Language Constructs--Fortran 90 3.6.1 Arrays and Array Sections 3.6.2 Array Assignment and Array Expressions 3.6.3 Fortran 90 Array Intrinsic Functions 3.6.4 Examples of SIMD Operations in Fortran 90 3.7 Pipelined SIMD Vector Computers 3.7.1 Pipelined SIMD Processor Structure Processor/Memory Interaction Number and Types of Pipelines Implementation of Arithmetic 3.7.2 The Memory Interface of a Pipelined SIMD Computer 3.7.3 Performance of Pipelined SIMD Computers 3.8 Vector Architecture Summary 3.9 Bibliographic Notes Chapter 4: MIMD Computers or Multiprocessors 4.1 Shared Memory and Message-Passing Architectures 4.1.1 Mixed-Type Multiprocessor Architectures 4.1.2 Characteristics of Shared Memory and Message Passing 4.1.3 Switching Topologies for Message Passing Architectures 4.1.4 Direct and Indirect Networks 4.1.5 Classification of Real Systems 4.2 Overview of Shared Memory Multiprocessor Programming 4.2.1 Data Sharing and Process Management 4.2.2 Synchronization 4.2.3 Atomicity and Synchronization 4.2.4 Work Distribution 4.2.5 Many Processes Executing One Program 4.3 Shared Memory Programming Alternatives and Scope 4.3.1 Process Management--Starting, Stopping, and Hierarchy 4.3.2 Data Access by Parallel Processes 4.3.3 Work Distribution 4.3.4 Multiprocessor Synchronization Atomicity Hardware and Software Synchronization Mechanisms Fairness and Mutual Exclusion 4.4 A Shared Memory Multiprocessor Programming Language 4.4.1 The OpenMP Language Extension Execution Model Process Control Parallel Scope of Variables Work Distribution Synchronization and Memory Consistency 4.4.2 The OpenMP Fortran Applications Program Interface API Constructs of the OpenMP Fortran API 4.4.3 OpenMP Fortran Examples and Discussion 4.5 Pipelined MIMD--Multithreading 4.6 Summary and Conclusions 4.7 Bibliographic Notes Chapter 5: Distributed Memory Multiprocessors 5.1 Distributing Data and Operations Among Processor/Memory Pairs 5.2 Programming with Message Passing 5.2.1 The Communicating Sequential Processes CSP Language 5.2.2 A Distributed Memory Programming Example: Matrix Multiply 5.3 Characterization of Communication 5.3.1 Point-to-Point Communications 5.3.2 Variable Classes in a Distributed Memory Program 5.3.3 High-Level Communication Operations 5.3.4 Distributed Gauss Elimination with High-Level Communications 5.3.5 Process Topology Versus Processor Topology 5.4 The Message Passing Interface, MPI 5.4.1 Basic Concepts in MPI Communicator Structure The Envelope The Data Point-to-Point Communication Concepts Collective Communications Concepts 5.4.2 An Example MPI Program--Matrix Multiplication 5.5 Hardware Managed Communication--Distributed Cache 5.5.1 Cache Coherence 5.5.2 Shared Memory Consistency 5.6 Conclusion---Shared Versus Distributed Memory Multiprocessors 5.7 Bibliographic Notes Chapter 6: Interconneetion Networks 6.1 Network Characteristics 6.2 Permutations 6.3 Static Networks 6.3.1 Mesh 6.3.2 Ring 6.3.3 Tree 6.3.4 Cube Networks 6.3.5 Performance 6.4 Dynamic Networks 6.4.1 Bus 6.4.2 Crossbar 6.4.3 Multistage Interconnection Networks MINs Benes Network Butterfly Network Omega Network 6.4.4 Combining Networks--Mutual Exclusion Free Synchronization 6.4.5 Performance 6.5 Conclusion 6.6 Bibliographic Notes Chapter 7: Data Dependence and Parallelism 7.1 Discovering Parallel Operations in Sequential Code 7.2 Variables with Complex Names 7.2.1 Nested Loops 7.2.2 Variations on the Array Reference Disambiguation Problem 7.3 Sample Compiler Techniques 7.3.1 Loop Transformations 7.3.2 Loop Restructuring 7.3.3 Loop Replacement Transformations 7.3.4 Anti- and Output Dependence Removal Transformations 7.4 Data Flow Principles 7.4.1 Data Flow Concepts 7.4.2 Graphical Representation of Data Flow Computations 7.4.3 Data Flow Conditionals 7.4.4 Data Flow Iteration 7.4.5 Data Flow Function Application and Recursion 7.4.6 Structured Values in Data Flow--Arrays 7.5 Data Flow Architectures 7.5.1 The MIT Static Data Flow Architecture 7.5.2 Dynamic Data Flow Computers Manchester Data Flow Computer The MIT Tagged-Token Data Flow Machine 7.5.3 Issues to Be Addressed by Data Flow Machines 7.6 Systolic Arrays 7.7 Conclusion 7.8 Bibliographic Notes Chapter 8: Implementing Synchronization and Data Sharing 8.1 The Character of Information Conveyed by Synchronization 8.2 Synchronizing Different Kinds of Cooperative Computations 8.2.1 One Producer with One or More Consumers 8.2.2 Global Reduction 8.2.3 Global Prefix 8.2.4 Cooperative Update of a Partitioned Structure 8.2.5 Managing a Shared Task Set 8.2.6 Cooperative List Manipulation 8.2.7 Parallel Access Queue Using Fetch&add 8.2.8 Histogram--Fine Granularity, Data-Dependent Synchronization 8.3 Waiting Mechanisms 8.3.1 Hardware Waiting 8.3.2 Software Waiting 8.3.3 Multilevel Waiting 8.4 Mutual Exclusion Using Atomic Read and Write 8.5 Proving a Synchronization Implementation Correct 8.5.1 Implementing Produce/Consume Using Locks 8.5.2 Temporal Logic 8.5.3 Proof of Correctness 8.6 Alternative Implementations of Synchronization--Barrier 8.6.1 Features of Barrier Synchronization 8.6.2 Characterization of Barrier Implementations 8.7 Conclusion 8.8 Bibliographic Notes Chapter 9: Parallel Processor Performance 9.1 Amdahl'' s Law Revisited 9.1.1 The Effect of Work Granularity on Amdahl'' s Law 9.1.2 Least Squares Estimation of Amdahl''s Law Parameters 9.2 Parameterized Execution Time 9.2.1 Pipelined Vector Machine Performance 9.2.2 Pipelined Multiprocessor Performance Program Sections with Restricted Parallelism Programs Requiring Critical Sections for Parallel Execution Granularity Correction to the Critical Section Model 9.2.3 Multiple Pipelined Multiprocessors 9.3 Performance of Barrier Synchronization 9.3.1 Accounting for Barrier Performance 9.3.2 Instrumentation for Barrier Measurement 9.3.3 Examples of Barrier Performance Measurement 9.4 Statistical Models for Static and Dynamic Parallel Loops 9.4.1 Dynamic Scheduling Model Dynamically Scheduled Iterations of Equal Length Dynamically Scheduled Iterations of Different Lengths 9.4.2 Static Scheduling Model Prescheduled Iterations of Equal Length Prescheduled Iterations of Different Lengths 9.4.3 Comparison with Experimental Results 9.5 Conclusions 9.6 Bibliographic Notes Chapter 10: Temporal Behavior of Parallel Programs 10.1 Temporal Characterization of Cache Behavior 10.1.1 A Temporal Locality Metric for Cache Behavior 10.1.2 Example Application of the Locality Metric to Bubble Sort 10.2 Read Sharing in Multiprocessors with Distributed Caches 10.2.1 A Simple Example of Read Sharing 10.2.2 The KSR-1 Architecture 10.2.3 Read Multiplicity Metric 10.2.4 Experiments 10.2.5 Programmed Poststore and Prefetch 10.3 Message Waiting in Message Passing Multiprocessors 10.4 Conclusion 10.5 Bibliographic Notes Chapter 11: Parallel I/O 11.1 The Parallel I/O Problem 11.1.1 Data Dependence and I/O 11.1.2 I/O Format Conversion 11.1.3 Numerical Examples of I/O Latency and Bandwidth Requirements 11.2 Hardware for Parallel I/O 11.2.1 Control of the Primary Memory Side of the Transfer 11.2.2 I/O Channel Concurrency 11.2.3 Peripheral Device Parallelism 11.3 Parallel Access Disk Arrays--RAID 11.4 Parallel Formatted I/O in Shared Memory Multiprocessors 11.4.1 Parallel Input Using C I/O Routines fread0 and sscanf0 Application Independent Input Operations Application Dependent Input Operations 11.4.2 Parallel Output Using C I/O Routines sprintf0 and fwrite0 Application Dependent Output Code Application Independent Part of the Output Code 11.5 Collective I/O in Multiprocessors--MPI-IO 11.5.1 MPI-2 I/O Concepts 11.5.2 MPI-IO Examples Cyclically Mapped Read from a Striped Disk Array MPI-IO for Distributed Matrix Multiply 11.6 Conclusions 11.7 Bibliographic Notes Appendix A: Routines of the MPI Message Passing Library A. 1 Point-to-Point Communications Routines A.2 Collective Communications Routines A.3 MPI Data Types and Constructors A.4 Communicators, Process Groups, and Topologies A.5 MPI Environment and Error Handling A.6 Summary and MPI-2 Extensions Appendix B: Synchronization Mechanisms B. 1 Hardware Level Synchronization B.2 Language Level Synchronization B.3 Waiting Mechanism Bibliography Index