Introduction
Without fast iteration cycles, the most prior thing we should be paying attention to, in the field of adapting scientific algorithms to high performance computing devices with parallel, shared memory, multi-core systems, is setting up an empirical model that predicts the performance given specific hardware architecture and algorithms.
In this blog-like article, I intend to summarize all required knowledge from a thesis(Anthony Joseph, 2009) to build performance models step by step.
Background knowledge
This section intends to fill the knowledge gas for those who are not familiar with the technical details of hardwares. For those who are already experts in the field, they can skip this section and move onto the next.
One should consult Hennessy and Patterson’s texts (computer organization and design) for details.
Microprocessor
A microprocessor has a Central Processing Unit (CPU) on an integrated circuit. The CPU operates on instructions defined by an Instruction Set Architecture (ISA), for example, x86/ARM/RISC-V. The process can be summarized in a loop, in which both instructions and data undergo:
- fetch: instructions are fetched from the main memory
- decode: instructions are decoded and required data are fetched
- execution: CPU executes the instruction and the instruction is retired
- write back: result of the operation from the execution is stored back
- The von Neumann architecture means that both data and instructions are stored in main memory. The main memory is separated from the CPU so that data needs to be explicitly moved from main memory into the on-chip registers.
The explicitly required movement of data is normally the rate determining step, i.e. the performance largely depends on how fast we can move the data. To solve this problem to some extent, computer scientists proposed to use cache memory sit in between CPU and the main memory.
In its essence, the usage of cache memory brings locality. There are two types of locality: space locality and time locality. Space locality means that required data are stored in chunk, so we need fewer memory operations. Temporal locality means that the data/instruction can stay in the cache for a long time.
Parallelism
- Data level parallelism (DLP): A form of parallelism with a sign of a single operation manipulating a set of data. An example is Single Instruction Multiple Data (SIMD)
- Instruction level parallelism (ILP): This form of parallelism intends to maximize the utilization of hardware resources by increasing number of instruction being executed at any given time. Instructions are retired ‘in-order’ or ‘out-of-order’ to maximize ILP.
- Super scalar execution
- Instruction pipelining
- Out-of-order execution
- Branch prediction and Speculative execution
-