CS61C Review Doc
Created by Yunhao Cao (Github@ToiletCommander) in Fall 2021 for UC Berkeley CS61C.
Reference Notice: Material highly and mostly derived from Prof Wawrzynek & Weaver's lecture slides, some ideas were borrowed from wikipedia & CS-Illustrated Berkeley.
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License
C and Memory Representation
Number Representation
Integer
Convert Number Radices
We have number 159 in decimal, how do we convert this into binary?
Thus 159 in binary is 0b010011111
, I left the leftmost 0 there to avoid confusion with negative numbers.
Then what is 159 in hexidecimal? I love to view it starting from binary perspective
159 = 0b1001.1111
You see that I removed the front 0 because each 4 bit corresponds to a hexidecimal character.
159 = 0x9F
since 0b1001 = 9 in decimal, and 9 in dec. corresponds to 9 in hex, and 0b1111 = 15 in decimal, and 15 corresponds to F in hex.
Two's Complement
It's represented by:
MSB(bit) as a sign bit ⇒ 0 representing positive and 1 representing negative.
If MSB is 0, then just treat it as unsigned integer
If MSB is 1, then the number is where k is the rest bit positions interpreted as unsigned integer.
This representation can take values between to
The -1 on the right bound was to account for existance of 0
For example, if I have a 16-bit integer, a value of 0b1111111111111111
(16 of 1)
we would interpret it as , and the value of it being
Easier two ways to convert complex two's negative values:
- flip the bits first and add 1 to the final result.
- minus the number by 1 first and then flip the bits.
The two methods both work!
Signed Magnitude
MSB(bit) is a sign bit ⇒ 0 representing positive and 1 representing negative
The rest bit positions can be interpreted as unsigned integer representing the magnitude(absolute value).
Bias
Actual value is the binary(unsigned) value plus a fixed bias
bias = -127 ⇒ actual number is bin value with -127 added to it
0b00000000
→ -127
0b11111111
→ 128
Integer Multiplication
m bits * n bits = m + n bit product
Integer Division
Floating Point
⇒ a is called mantissa, the dot is called binary(or decimal, depending on the representation of the number) point, and r is called radix, or base, and e is called exponent
for example, has a mantissa of 1, a radix of 2 and an exponent of -1
Floating point standard is invented by Prof. Kahan in UC berkeley!
We have:
- 1 bit for sign (0 ⇒ +, 1 ⇒ -)
- e bits for exponent(E) with exponent bias of , so ranges from to
- (n-e-1) bits for fraction(F)
Single Precision Standard(32b) ⇒ 1 bit for sign, 8 bits for exponent, 23 bits for significand, exponent bias of -127
Double Precision Standard(64b) ⇒ 1 bit for sign, 11 bits for exponent, 52 bits for fraction, exponent bias of -1023
We will get 1 extra bit of precision because leading 1 is implicit (cuz if the leading mantissa is 0, then we can simply decrement the exponent...)
Common Formula (for single precision):
Here we see Exponent is a biased integer representation, while significand is an unsigned integer representation
Special Cases
S | Exponent | Significand | Description |
0 | 00000000 | 00000000000000000000000 | Positive Zero - the number is too small to represent and either zero or somewhere between 0 and our smallest number |
1 | 00000000 | 00000000000000000000000 | Negative Zero - the number is too small to represent and either zero or somewhere between 0 and our smallest negative number |
X | 11111111 | 00000000000000000000000 | +/- Infinity |
X | 11111111 | None Zero | NaN |
X | 00000000 | None Zero | Denorm |
NaN?
Denorm - Denormalized Numbers
All zero in exponent bit positions, nonzero in significand.
No implied leading 1, but implicit exponent = ⇒ -126 in 32b, -1022 in 64b ⇒ so the exponent behavior is really being 000000001 in the exponent bits, but the implicit leading 1 is deleted and replaced by an implicit leading 0.
Pointer and Reference
int *p; //variable p is address of an int
p = &y; //assign address of y to p
z = *p; //assign value at address in p to z
addresses are unsigned integer values.
Managing Heap
- malloc(size) returns pointer to uninitialized memory
- calloc(size, number) returns pointer to zeroed memory
- free(pointer) frees allocated memory
- realloc(pointer, new_size) returns new pointer to resized memory
- Be careful that the new pointer may be different form the old pointer, so if there are other pointers pointing to the same block of memory, you're dead.
Memory Address
Each byte has an unsigned integer address
Commonly thin in terms of words, so 32b for a word in 32b architecture, 64b for a word in 64b architecture, a word is big enough to hold an address
Types
typedef struct {
int x; //int memory aligned each 4 bytes
int y; //int memory aligned each 4 bytes
} Point;
Point p1;
Point *paddr = &p1;
//we can use p1.x or paddr->x
union foo{ //provides enough space for the largest element
int a;
char b;
union foo *c;
}
union foo f;
f.a = 0xDEADB33F; //treat f as an integer and store the value
f.c = &f; //treat f as a pointer and store the address of f itself.
Default alignment rules (32b architecture):
- char - a byte, no alignment needed
- short - 2 bytes, 1/2 word aligned,
- int - 4 bytes, word aligned
- pointers are same size as ints
Arrays
int ar[2];
int ar[] = {100, 200};
Array variable is simply a "pointer" to the first(0th) element
so ar[0] = *ar, ar[i] = *(ar + i)
Some weird thing:
#include <iostream>
int main(int argc, char** args){
int arr[2];
std::cout << arr << std::endl;
std::cout << &arr << std::endl;
}
gets the same result, so &ar = ar
Computer Structures
Very Important Idea: Abstraction - make components into black boxes instead of knowing exactly how each of them work!
We will focus on synchronous digital systems this semester.
Synchronous means all operations and communications in the system are coordinated by a central clock. Compared to asynchronous system, which requires local coordination between communication of components, our synchronous systems will be much easier to design & debug.
Digital means we will represent all values using "discrete" values.
We use Binary in our system because it has good noise immunity, and Moore's law is only possible because of binary representation.
Binary Signals
Hardware signals are combined using primitive operators to implement the capabilities we need for executing ISA instructions
A TT(Truth Table) defines a function ⇒ enumerate each output value for each input combination
a | b | a op b |
0 | 0 | 0 |
0 | 1 | 0 |
1 | 0 | 0 |
1 | 1 | 1 |
Bit Operations
= a AND b
a + b = a OR b
= a XOR b
brackets ⇒ determines calculation priorities
!a or ~a or a with a line above = NOT a
Registers
cMOS register circuits in common use are "edge-triggered", meaning they'll take action based on the rising or falling edge of the clock. We'll assume rising edge for consitency.
1-bit register is called "flip-flop"
We know that registers "hold on" to d values until change in clk causes register to capture new d value and hold on to that new d value and output it in q.
We'll define the following properties of register:
- setup time () - the time that d must have hold stable before the clock edge
- clock to q time / register output delay() - the delay from clk edge change until output value changes
- hold time( - the time that d must hold after the clock edge
Physical Limitations (Of Processors)
For CMOS....
- They leak when off
- They have finite resistance when on
- All circuit nodes have capacitance, so to change their voltage level we must displace charge first
- So for every logic gate, we'll have delay from input change to output change
Energy
To switch on/off every transistor, we see above that it takes energy to do so.
We know , and we know for capacitors, and
So switching power
where:
is the "activity factor", average percentages of capacitance switching per cycle (~ number of nodes to switch)
is the total chip capacitance to be switched
is the operating voltage
is the clock frequency
We can decrease clock frequency to reduce power, but doesn't improve energy efficiency because we would have to run longer to finish our computation
We know is proportional to , .
but, (we would charge slower if is lowered)
We can improve energy efficiency by lowering supply voltage and making up for less performance by using parallelism.
Capacitance is dependent on technology, microarchitecture, circuit details,
is supply voltage
We want to both reduce capacitance and supply voltage to reduce energy per task
Energy efficiency is super important (key metric) in all computing devices because...
- for power-constrained systems (datacenter), need better energy efficiency to get more performance at same power.
- for energy-constrained systems (phone), need better energy efficiency to prolong battery life.
Recent years industry hasn't been able to reduce supply voltage much, it reducing it further would mean increasing "leakage power" where transistor switches don't fully turn off
Size of transistors and capacitance not shrinking as much as before - we are hitting the "power wall" (功耗墙)
Performance
- Latency(延迟) - execution time for each instruction
- Throughput(吞吐量) - total number of instructios executed per unit time
- Energy Efficiency(能效) - Energy per instruction
Iron Law of Processor Performance:
- is determined by
- task specification
- algorithm (e.g. vs. )
- Programming Language
- Compiler
- Instruction Set Architecture(ISA)
- is determined by
- ISA
- Processor Implemention (or microarchitecture)
- CPI (Clock Per Instruction)
- Pipelined Processors, CPI > 1
- Superscalar Processors, CPI < 1
- CPI (Clock Per Instruction)
- is determined by
- processor microarchitecture
- technology (5nm vs. 14nm)
- spply voltage (lower voltage reduces transistor speed, but improves energy efficiency)
Pipelining 指令管线化
It increases throughput(by overlapping execution of multiple instructions) but can never improve latency
But we will eventually run into hazards
- Structural Hazard
- Two or more instructions in the pipeline compete for the same physical resource
- Solved by either
- Instructions take turns to use resource (which means some instructions have to stall)
- Add more hardware (Yeah, I have money!)
- e.g. Regfile Structural Hazard
- each instruction can read up to two operands in decode stage and write one value in writeback stage
- So avoid structural hazard by having two independent read ports and one independent write port
- So reads from one instruction and writes from another can happen simultaneously.
- e.g. Memory Access
- In DM and IM stage, Instruction and Data Memory are used simultaneously
- can be solved by using two separate memories, I$ and D$ (Indeed we would use two separate first-level caches)
- RISC ISAs are designed to avoid structural hazards
- at most one memory access per instruction
- limited operands per instruction
- Data Hazard
- Register Access
- We already have separate ports, but what happens if we write to the same value as we read?
- We can exploit high speed of register file (100ps), let WB update value first then ID reads new value ⇒ this is shown by the shading of the pipelining diagram
- However, if we're working in high-frequency designs, this might not be possible
- ALU Result
- Say we add 1 to t0 in first instruction, and add 2 to t0 in second instruction
- However, the value of t0 after first instruction would not have been written back by the first instruction until we reach the 5th cycle (when WB stage of the first instruction is finally active)
- So basically the second instruction is reading wrong register values when it reaches its register read stage
- We can stall the instruction but it reduces performance
- Or the compiler can try to arrange code to avoid hazards and stalls, but requires knowledge of the pipeline structure
- Or ⇒ We add Data Forwarding!
- ALU result from one pipeline to another
- This requires modifications in datapath as well as in control logic(See Below)
- Say we add 1 to t0 in first instruction, and add 2 to t0 in second instruction
- Load
- There are cases when stalls are unavoidable
- Slot after a load ⇒ Load Delay Slot
- if use the result of the load in load delay slot, there's an unavoidable NOP ⇒ repeat and instruction and forward
- but we can use unrelated instruction into load delay slot ⇒ no performance loss!
- There are cases when stalls are unavoidable
- Register Access
- Control Hazard
- When a branch statement is executed, 3 lines of asm code after branch statement is also loaded into pipeline, and those statements could be executed regardless of branch outcome if not turned into NOP.
- Every taken branch in simple pipleine costs 2 dead cycles
- So we use branch prediction(分支预测) to guess which way branch will go
- We will keep a branch prediction buffer / cache ⇒ small memory addressed by the lowest bits of PC
- During Instruction Decode ⇒ Look up whether branch was taken last time?
- If yes, compute PC + offset and fetch that
- If no, stick with PC + 4
- If branch hasn't been seen before
- assume forward branches are not taken, backward branches are taken
Superscalar Processor 超标量处理器
- We have multiple pipeline hardwares per stage
- Multiple execution units for additional instruction level parallelism
- Performance benefit highly code dependent
- Start multiple instructions per clock cycle
- CPI < 1 (think about this, we have multiple datapath hardwares running at same time)
- since CPI < 1, we will use Instructions Per Cycle (IPC)
- Out-of-Order Execution 乱序执行
- Recorder instructions dynamically in hardware to reduce impact of hazards: EG, memory/cache misses.
Memory Cache
Very important to notice that the performance gap between CPU and DRAM, and it is usually impossible to buy RAMs that are BIG and FAST using a little amount of money. The solution is to use cache!
Big Idea is to use Locality 局部性 - The idea that memory access usually happens around the same place in a given period
Principle of Locality states that programs access small portion of address space at any instant of time (spatial locality) and repeatedly access that portion (temporal locality)
We have two types of locality
- Temporal Locality 时间局部性
- If memory location is referenced, then it will tend to be referenced again soon
- Spacial Locality 空间局部性
- If a memory location is referenced, the locations with nearby addresses will tend to be referenced soon
So cache uses the fact that the program uses memory and exhibits those locality characteristics...
And it...
- Give illusion of speed of fastest memory with size of largest memory
- However, if you overwhelm the cache your performance may drop off a cliff.
- Now processor instead of going to the memory and ask for data it will go to the cache and ask for data
- Processor asks for data in 0x12F0
- cache checks if has copy of data at address 0x12F0
- If yes, return the data to processor
- If not, cache asks for 0x12F0 from memory and stroes that value in cache
- Then cache returns data to processor.
How Cache uses Processor address:
Tag(MSB - n+o) | Set Index(n+o - o) | Block offset(o - LSB) |
Remaining Portion of Processor Address | Bit length determined by number of sets | Byte address within block, bit length determined by block size. |
Organization of sets and blocks:
- Directly Mapped
- Associativity = 1
- Set # of Sets = # of Blocks
- Requires only 1 comparator
- Fully associative
- Associativity = # of blocks
- One set per cache ⇒ Fetched memory can go anywhere
- No index field, 1 comparator per block
- N-way Set Associative
- Associativity = N, N places for a block (every set contains N blocks)
- # of sets = number of blocks / N
- N comparators
Total Cache Capacity =
Replacement Policy:
When miss occurs, which way is a block selected for replacement?
- Least Recently Used (LRU): one that has been unused the longest
- Must track when each way's block was used relative to other blocks in the set
- Example Simple "Psuedo" Implemention:
- Hardware replacement pointer points to one cache entry
- Whenever access is made to the entry the pointer points to the next entry
- Otherwise, don't move the pointer
- It's actually a "not-most-recently-used" policy
- Random Replacement
- Choose a random block and evict it.
Types of Cache Miss
- Compulsory (强制性失误), aka cold start / first reference miss
- First access to a block
- Capacity (空间性失误)
- Cache cannot contain all blocks accessed by the program
- Misses that would not occur with infinite cache
- Conflict (冲突性失误) aka collision miss
- Multiple memory locations mapped to same cache set
- Misses that would not occur with ideal fully associative cache
- Coherency (连贯性失误) ⇒ Only if sharing data between two processor cores
- Share a cache line between two processor cores
- every time one does a write the other will take a cache miss
- Even if writing to different parts of the cache line
- Everyone's reading is fine
- Share a cache line between two processor cores
Write policy:
- Cache Hit
- Write-through
- Write cache and write the memory
- Very slow, so include a "write buffer" to allow processor to continue once data is stored in the write buffer.
- Buffer will update the data in parallel with the processor.
- Write cache and write the memory
- Write-back
- Write only to cache (dirty bit = 1) and only back to memory when the block has to be evicted from cache.
- Write-through
- Cache miss:
- No-write-allocate: only write to main memory
- Write-allocate (fetch on write): fetch into cache
Some extra information stored in cache
- Valid bit
- When program start, cache does not have valid information for this program
- Need an indicator whether this tag entry is valid
- Dirty-bit (Write-back policy)
- If data in this cache has changed
- Shared-bit (MOESI policy)
- If this data is official / only / shared copy
Cache Coherency Policy - MOESI
Performance Measures
- Hit rate: fraction of accesses that hit in the cache
- Miss rate: 1 - Hit rate
- Global Miss Rate - the fraction of references that miss some level of a multilevel cache
- misses in this ache divided by total number of memory accesses generated by the CPU
- the fraction of references to one level of a cache that miss
- Global Miss Rate - the fraction of references that miss some level of a multilevel cache
- Miss penalty: time to replace a block from lower level in memory hierarchy to cache
- Hit time: time to access cache memory (including tag comparison)
- Average Memory Access Time(AMAT) =
Cache Design Space
Several Interacting Dimensions including cache size, block size, associativity, replacement policy, write policy, etc.
Optimal choice is a compromise and depends on access characteristics ⇒ simplicity often wins
Design Choices
- Increasing Associativity
- Hit time increases with large step from DM to ≥ 2 ways
- Since we need to mux correct way to processor
- Hit time slightly increases for further increase in associativity
- Miss rate goes down due to reduced from conflict misses
- But most gain is from 1→2→4 way with limited benefit from higher associativities
- Miss penalty mostly unchanged, since replacement policy runs in parallel with fetching missing line from memory
- Hit time increases with large step from DM to ≥ 2 ways
- Increasing # of entries
- Hit time increases since reading tags and data from larger memory structures
- Miss rate goes down due to reduced capacity and conflict misses
- Miss rate drops ~2x for every ~4x increase in capacity
- Miss penalty unchanged
- but at some point, increase in hit time may overcome the improvement in hit rate, yielding a decrease in performance
- Increasing block size
- Hit time unchanged but might be slightly reduced as number of tags is reduced
- Miss rate goes down at first due to spatial locality, then increases due to increased conflict misses (fewer blocks in cache)
- Miss panelty rises with larger size, but with fixed constant initial latency that is amortized over whole block.
- Add a "victim cache"
- Small fully associative cache that holds the last few evicted cache lines
Operating System
- Runs before any user programs (after BIOS and boot loader) when computer is first turned on, and intermittently runs as long as computer is on.
- Finds and controls all I/O devices in the machine in a general way
- Relying on hardware-specific "device drivers"
- Starts Services(100+)
- File System,
- Network Stack (Ethernet, WIFI, Bluetooth...)
- Loads, runs and manages programs
- Multiple programs at the same time (time-sharing)
- Isolate programs from each other (isolation)
- Multiplex resources between applications (e.g. devices)
Sharing Of Resources
- OS gives each process isolation even when multiple processes share the same hardware resources
- Each process has the view that it "owns" the whole machine when it is running
- Share time on the CPU: Context Switch
- Change from one process to another on the CPU Core
- Save and restore the state of current process to pick up where it is left off (running status ⇒ runnable status)
- Share space in memory: Virtual Memory
- Each process has the "illusion" of access to the full address space
- One process cannot see what another process has stored in memory
- Requires following from hardware
- Memory translation
- Each running process has a mapping from "virtual" to "physical" addresses that are different for process
- When doing load/store, the program issues a virtual address, but actual memory stored is a physical address
- Protection and privilege
- Split the processor into at least two running modes: "User" and "Supervisor"
- Lesser privilege cannot change its memory mapping
- But Supervisor can change the mapping for any given program, and also has its own set of mapping of virtual ⇒ actual
- Traps & Interrupts
- A way of going into Supervisor mode on demand
- CSR Registers ⇒ "Control and Status Registers"
CSRRW rd rs csr
means- read the old value of the specific control and status register and put it into rd
- If rs ≠ x0, place the new value in the CSR
- They are sed to communicate requests with the hardware
- The hardware enforces privileges, so program running at User level cannot change Supervisor-level CSRs.
- Memory translation
Traps / Interrupts / Exceptions
- Interrupt
- Caused by an event external to current running program
- e.g. Key press, disk I/O
- Asynchronous to current program, we can handle interrupt on any convenient instuction
- "Whenever it's convenient, just don't wait too long"
- Exception
- Caused by some event during execution of one instruction of current running program
- e.g. Memory Error, Bus Error, Illegal Instruction, Raised Exception
- Synchronous
- Must handle exception precisely on instruction that caused the exception
- "Drop whaever you are doing ad act now"
- Trap
- Action of servicing interrupt or exception by hardware jump to "interrupt or trap handler" code
Trap Handler's View
- View of machine state is that every instruction prior to the trapped one has completed, and no instruction after the trap has executed.
- Implies that handler can return from an interrupt by restoring user registers and jumping back to interrupted instruction
- Interrupt handler software doesn't need to understand the pipeline of the machine, or what program was doing
- More complex to handle exception by interrupt
- Providing precise traps is tricky in a pipelined superscalar out-of-order procecssor!
Hardware Action
- Let's say we're running program A and all of a sudden in MA(memory access) stage program A did a invalid memory read!
- Hardware first flush instructions currently in pipeline (convert to nops or "bubbles")
- Then it adjust the privilege level
- Then it disables interrupts
- We don't want to get interrupted when handling an interrupt
- Write the old program counter into the
sepc
CSR- It's the PC that triggered the exception (or first instruction that hasn't yet executed if an interrupt)
- Write the reason into the
scause
CSR
- Set the PC to the value in the
stvec
CSR- This is the address of the "trap handler" ⇒ Single function that handles ALL exceptions and interrupts
Software Action
- Save all the registers
- Intent is to make the previous program think that nothing whatsoever actually happened!
- Steps
- Suervisor mode has a
sscratch
CSR- Use it to point to a piece of memory to store things for the trap handler
- Swap x1 for sscratch
csrrw x1 x1 sscratch
- Now save all the other registers into that location
sw x2 4(x1)
sw x3 8(x1)
- ...
- Store the PC from the sepc CSR
csrrw x2 x0 sepc
sw x2 124(x1)
- finally save x1 and restore sscratch
csrrw x2 x1 sscratch
sw x2 0(x1)
- Suervisor mode has a
- Figure out what the exception or interrupt is
- Read the appropriate CSRs and other pieces to do what is necessary
- Restore all the registers
- Restore the value for
sepc
- If ECALL, increment by 4 to make it look like a function call
- otherwise just redo the instruction that triggered the exception
- Swap x1 temporarily using sscratch
- if an ECall, set a0 to the returned value
- Restore the value for
- Return to the right point in execution
- execute the SRET instruction
- back to the hardware
Hardware Action Again
- Re-enable interrupts
- Now we're done with trap handler we can get interrupted again
- Reset back down to user level
- Restore the PC to the value of sepc
Now the progra continues on like nothing ever happened.
Caches probably got trashed ⇒ for security reasons
Context Switch
- Hardware provides the OS an interrupt - "timer interrupt"
- At a regular interval
- Whe triggered, trap handler can execute a context swtich
- Take those saved registers that were stored in the area pointed to by sscratch
- copy them to a bookkeeping data structure for current process(Process Control Block)
- copy the
satp
(table pointer) value to that data structure so we know its memory mapping
- Pick some other process's data structure
- Deetermined by the "scheduler" (调度器) ⇒ See CS162
- Load the process's registers, satp, sepc, etc.
- Tell the caches to flush themselves
- Needed for proper isolation
- We'd be taking a ton of misses anyway since the new process has no temporal locality with the old process
- return with
sret
I/O
Options
- Special input/output instructions & hardware
- Memory mapped I/O
- portion of address space dedicated to I/O
- I/O device registers there (no memory)
- Use load/store instructions
I/O Models
Common I/O devices neither deliver nor accept data matching processor speed
So we have two models(Programmed IO)
- Polling
- Processor checks status before acting
- Device registers generally serve two functions
- Control Register - says it's OK to read/write (IO Ready)
- Data Register - contains data
- Processor reads from Control Register in loop
- Waiting for device to set Ready bit in Control Register (0 → 1)
- Processor then loads from (input) or writes to (output) data register
- I/O Interrupt
- Interrupt when IO is ready or needs attention
- Interrupt current program
- Transfers control to the trap handler in the operating system
- If there's a lot of IO,
- We are spending a lot on context switch, flsuhing caches, pipeline flush, etc.
- Interrupt when IO is ready or needs attention
- Both not ideal because
- Device speeds don't align well with CPU speeds
- Energy cost of using beefy general-purpose CPU where simpler hardware would suffice
- So comes Direct-Memory-Access (DMA)
Real World(without DMA):
- Low data rate
- we should use interrupts because overhead of interrupts ends up being low
- but in practice, USB hardware only supports polling
- High data rate
- Start with interrupts
- If there's no data, we don't do anything
- Once start getting data
- We start polling
- Or we use Direct Memory Access (DMA) ⇒ The device just writes the data into memory directly.
- Start with interrupts
DMA:
- Contains CSR registers written by CPU
- Memory address to write/read data
- # of bytes
- I/O device #, direction of transfer
- unit of transfer, amount to transfer per burst
DMA: Incoming Data
- Receive Interrupt from device
- CPU takes interrupt, initiates transfer
- Instructs DMA engine to place data at certain address
- DMA engine handle the transfer
- CPU execute other things
- Upon completion, Device/DMA engine interrupts the CPU
DMA: Outgoing Data
- CPU decides to initiate transfer, confirms that external device is ready
- CPU initiates transfer
- Instructs DMA engine that data is available at certain address
- DMA engine handle the transfer
- CPU is free to execute other things
- DMA engine interrupts the CPU again to signal completion
Where to place DMA engine?
Since DMA messes around with memory, where in the memory hierachy do we plug in the DMA engine?
- Between L1 and CPU?
- Free coherency ⇒ means our memory and cache will stay consistant
- Trash the CPU's working set with transferred data
- Between last-level cache and main memory
- Don't mess with caches
- But need to explicitly manage coherency
- Or just treat like another node in a multiprocessor
- what modern computers do
- DMA engine just acts like another processor for the cache coherence mechanisms
Virtual Memory
Supervisor mode alone isn't enough, we need applications to be able to access only their own memories.
Virtual Mem is good because...
- Protection & Privacy
- Each user have their own private address space and one or more shared address psaces
- Demand Paging
- provides the ability to run programs larger than primary memory
- But the system might start thrashing (repeatedly copy data to and from disk) when your working set exceeds physical memory.
A Processor-generated address can be split into:
Page Number | Offset |
Space required by the page tables (PT) is proportional to the address space, number of users, ... It is too large to keep in CPU registers
So we'll keep PTs in the main memory ⇒ But this requires two references for each memory access, one for retrieve the page base address and anotheer to access the data word
If an instruction references a memory page that isn't in DRAM
- We get an exception of type "page fault"
- Page fault handler
- If no unused page is available, a page currently in DRAM is selected to be replaced
- Replaced page is written to disk, PTE that maps this VPN ⇒ PPN is marked with DPN
- Virtual page doesn't yet exist, assign it an unused page in DRAM
- page exists but was on disk
- Initiate transfer of the page contents we're requesting from disk to DRAM, assigning to an unused DRAM page
- If no unused page is available, a page currently in DRAM is selected to be replaced
Size of Linear Page Table
With 32-bit memory addresses, 4KB( bytes) pages ⇒ Virtual Pages per user(process), if we assume 4-Byte PTEs, PTEs, Bytes required: 4MB page table per process!
You may think that we can make each virtual page larger?
However, larger pages means:
- Internal fragmentation (Not all memory in page gets used)
- Larger page fault penalty (more time to read from disk)
Thinking about 64-bit virtual address space, even 1MB pages would require 8-Byte PTEs (35TB)
However, most processes only use a set of high address (stack), and a set of low address (instructions, heap)
So we will use Hierarchical Page Table ⇒ this exploits sparsity of virtual adress space use
Cost of Virtual Memory
There's a cost to virtual memory, namely the price of Address Translation & Memory Protection
Address translation is very expensive as in a two-level page table, each reference becomes several memory accesses.
To account for the inefficiencies of address translation, we introduce Translation Lookaside Buffers (TLB) that caches some translations
- For a single TLB hit, we will now only have address translation that costs one cycle
- For a TLB miss, we will need a Page-Table walk to get the physical address as well as to refill the TLB table.
TLB Reach = Size of largest virtual address space that can be simultaneously mapped by TLB
TLB Designs
- Typically 32-128 entries
- Each entry maps a large page
- So less spatial locality across pages
- Sometimes fully associative, larger TLBS(256-512 entries) are 4-8 way set-associative
- Larger systems sometimes have multi-level (L1, L2) TLBs
- Random or FIFO(First-In-First-Out) replacement policy
- Two styles of refill
- MIPS style
- The TLB is the only translation in the hardware
- Whenever you get a TLB miss you jup to the page fault handler
- x86
- The page table has a defined structure
- In the event of a TLB miss the hardware walks the page table
- Only if the page is unavailable you jump to the page fault handler
- RISCV Prefers x86 style, but is compliant with MIPs
- MIPS style
Finished TLB workflow
Some Virtual Memory Tricks
- Copy-On-Write Duplication
- Split a process and now have two processes (fork)
- Copy the page table and registers
- and mark both the original and copy's memory as read-only
- Every time either process wants to write a page...
- Traps to the protection fault handler
- The fault handler copies the page, and updates both page-tables to allow writing
- And now we only copy memory when we need to first write it.
- Shared Dynamically Linked Libraries
- Two virtual PTE pointed to the same physical memory space
- Memory Mapped File
- "Load the entire file into a contiguous block of memory"
- The system just points the PTE to disk ⇒ when the program actually reads, generates a page fault to OS and OS loads the page into memory
- Dirty bit in PTE
- Need to swap out the page, then if it is dirty write it out
RISC-V
Simpel RISC-V RV32I Datapath Diagram (From CS61C Fall 21 Slides)
Some Good Material...
Assembly Language
Immediates are "sign-extended"
Calling Convention
Two types of registers in calling convention
- Callee saved ⇒ The function that gets called saves it at the beginning and restores before returning
- Caller saved ⇒ The function that gets called do whatever it wants with this register, the function that calls was responsible for storing it before calling and restoring it after calling.
CALL(Compiler ⇒ Assembler ⇒ Linker ⇒ Loader)
Compiler(CS164) ⇒ Most Computationally Heavy in the CALL chain
Higher-Level Language Code (foo.c) ⇒ Assembly Language Code (foo.s)
Code Matches Calling Convention for the architecture.
Output may contain pseudo-instructions
- Lexer: Transforms input ⇒ tokens
- Parser: Tokens ⇒ Abstract Syntax Tree
- Semantic Analysis and Optimization: Checks for semantic errors (语义错误), may reorganize code to make it better.
- Code Generation: Outputs the assembly code
Assembler: dumb compiler for assembly language
Assembly Language Code (foo.s) ⇒ Object Code, Information Tables (foo.o)
Assembler Directive | Description |
.text | Subsequent items put in user text segment (machine code) |
.data | Subsequent items put in user data segment (binary rep of data in source file) |
.globl sym | declares sym global and can be referenced from other files |
.string str | Store the string str in memory and null-terminate it |
.word w1.....wn | store the n 32 bit quantities in successive memory words |
- Reads and uses Directives
- Replaces Pseudo-instrutions
- Produces Machine Language rather than just Assembly Language
- Outputs Object File
Tail call optimization
int doSth(){
....//lots of code
return foo(y);
}
- For efficiency, evaluate the arguments for foo() and place them in a0-a7
- Restore ra, all callee saved registers, and sp
- call foo() with j or tail
- foo() will return directory to where doSth needs to return to
Forward Reference Problem
Branch instructions can refer to labels that are "forward" in the program
L1:
slt t0, x0, $a1
beq t0, x0, L2
addi a1, 1, -1
jal x0, L1
L2:
add $t1, $a0, $a1
Solved by taking 2 passes over the program
- First pass remembers positions of labels
- Second pass uses label positions to generate code
We solved the forward refering problem(branch), but we might be referencing libraries outside of current file(jumps, static data addresses), so how do we account for these?
Symbol Table
contains List of "items" in this file that may be used by other files
items include:
- Labels for function calling
- Data: anything in the .data sections and variables which may be accessed across files
Relocation Table
List of "items" this file needs the address of later
- External label jumped to
- Any piece of data referenced by static loading, such as la
Object File Format (Standard Format is ELF, except for Microsoft)
- Object file header: size and position of the other pieces of the object file
- Text segment(
.text
): the machine code
- Data segment(
.data
): binary representation of the static data
- Relocation Information: identifies lines of code that need to be fixed up later
- Symbol Table
- Debugging Information
Linker
Object code files with information tables (foo.o, libc.o) ⇒ Executable code (a.out)
- Combines several object files into a single executable
- Enables seperate compilation of files so that changes to one file do not require recompilation of the whole program
Steps:
- Take text segment from each .o file and put them together
- Take data segment from each .o file and put them together, concatenate this to end of text segments
- Resolve References
- Linker assumes first word of first text segment is at 0x04000000 (virtual memory)
- It knows:
- length of each text and data segment
- ordering of text and data segments
- It calculates:
- absolute address of each label and each piece of data being referenced
- To resolve,
- It searches for reference in all "user" symbol tables
- If not found, search library files
- Once absolute address is determined, fill in the machine code appropriately
Loader
Load program into memory and kickstart the program. (usually implemented by OS)
- Reads executable header and determines size of text and data
- Creates new address space for program large enough to hold text and data segments, as well as stack segment
- Copy instructions and data from file into the new address space
- Copies arguments passed to the program onto the stack
- Initializes machine registers
- Most registers cleared, but stack pointer assigned address of 1st free stack location
- Jumps to start-up routine that copies program's arguments fro stack to registers & sets the PC
- If main routine returns, start-up routine terminates program with the exit system call
- Also responsible for linking dynamically linked libraries(DLL)
Dynamically Linked Libraries
- Storing a program requires less disk space
- Executing two programs requires less memory (if they share a library)
- At runtime, there's time overhead to do link
Parallelism
Calculating Speedup
F = fraction of the task that got affected
S = Speedup factor for the fraction of the task that is affected
Amdahl's Law
If the portion of the program that can be parallelized is small, then the speedup is limited
Strong and Weak Scaling
To get a good speedup on a parallel processor while keeping the problem size fixed is harder than getting good speedup by increasing the size of the problem
Strong scaling: when speedup can be achieved on a parallel processor without increasing the size of the problem
Weak scaling: when speedup is achieved on a parallel processor by increasing the size of the problem proportionally to the increase in the number of processors
It is always important to get things right and then optimize because your runtime of the program is really time to code it + time to run it on the manchine.
Data Level Parallelism
Increases Throughput
Single-Instruction/Multiple-Data Stream ⇒ SIMD
- Processes multiple data streams using a single instruction stream
- Intel SIMD instruction extensions / GPU
- Higher throughput per $
- Much simpler control logic
- Easy to map to MIMD
- Requires less memory since there's less instructions
- Less cost per unit
- 1 Instruction Decoder
- Less Complexity
- Latent/Tacit Synchronization
Multiple-Instruction/Multiple-Data Streams ⇒ MIMD
- Multiple autonomous processors simultaneously executing different instructions on different data
- multicore / warehouse-scale computers
- Lower throughput per $
- VERY hard to map to SIMD
- Requires more memory since there's more instructions
- More cost per unit
- 2+ Instruction Decoder
- More complexity
- Accurate or Explicit Synchronization
Loop Unrolling
Optimizing compilers usually perform this job.
- Expose data-level parallelism for vector(SIMD) instructions or super-scalar multiple instruction issue
- Mix pipeline with unrelated operations to help with reduce hazards
- Reduce loop "overhead"
- But also... makes code size larger
Thread Level Parallelism
Thread ⇒ A sequential flow of instructions that performs some task
Each core provides one or more(superscalar) hardware threads that actively execute instructions
Operation system multiplexes multiple software threads onto the available hardware threads
HyperThreading
Processor resources are expensive and should not be left idle
Hardware swithces threads to bring in other useful work while waiting for cache miss
So we can put in redundant hardware ⇒ don't have to save context on every thread switch
This is attractive for apps with abundant TLP (Thread-level Parallelism)
Data Race
Two memory accesses of the same location from different threads, of which least least one is a write form a data race
So we'll use "Lock" to grant access to a region (critical section)
Test-and-Set
In a single atomic operation:
- Test to see if a memory location is set (contains 1)
- Set it (to 1) if it isn't (it contained a zero when tested)
- Otherwise indicate that the Set is failed, so the program can try again
OpenMP
#pragma omp parallel
{
/* blablabla */
}
#pragma omp parallel for
for(int i=0; i<10; i++){
/* assigns each i to different threads */
}
int sum = 0;
#prgama omp parallel reduction(+: sum){
/* blablabla */
}
//Possible operators include: + / * / - / min / max / & / && / | / || / ^
Private Variables
#pragma omp parallel private (var1)
{
/* blablabla */
}
Methods
omp_set_num_threads(x);
num_th = omp_get_num_threads();
th_id = omp_get_thread_num();
double omp_get_wtime(void);
Request-Level Parallelism
- Hundreds of thousands of requests per second
- Computation partitioned across different requests
- "Load Balancing" on DNS and also request level
- Redundant copies of data
- To break up hot spots
- Makes the system more tolerant of failures
Common IO Devices
Disk
Non-volatile storage that is cheap, large, slow
Two Types:
- Hard Disk Drives (HDD) 硬盘 - faster, more dense, non-removable
- Floppy Disks 软盘 - slower,less dense, removable (replaced by USB "flash drive")
Disk Device Terminology
- several plattes, with information recorded magnetically on both surfaces
- Bits recorded in tracks, which in turn divided into sectors (e.g. every 512B can be a sector)
- Actuator moves head over track ⇒ seek, wait for sector rotate under head, then read or write
The closer the head to the disk, the smaller the "spot size" and thus the dense the recording
Spot size
- Measured in
- ~900 is state of the art
Disks are sealed to keep the dust out so heads can fly at around 3-20nm above the surface of the disk, 99.999% of head/arm weight is supported by the air bearing force(air cussion) between the disk and the head
Disk Access Time = Seek Time + Rotation Time + Transfer Time + Controller Overhead
- Seek Time = time to position the head assembly at the proper cylinder
- Average # of tracks to move arm = Number of tracks / 3 ⇒ Seek time = # of tracks moved * time to move across one track
- Rotation time = time for the disk to the point where the first sectors of the block to access reach the head
- Use average distance of sector from head = 1/2 time of a rotation
- Transfer time = time taken by the sectors of the block and any gaps between them to rotate past the head
Many disks have on-disk caches, which are completely hidden from the outside world, so estimates are different in practice.
Solid State Drive (SSD) / Flash Memory
NMOS transistor with an additional conductor between gate and source/drain which traps electrons.
Memory cells can only withstannd a limited number of program-erase cycles. Controllers use a techinque called wear leveling to distributes writes as evenly as possible across all the flash blocks.
Band with is similar to spinning disk
But there's no seek time so no additional lattency for random access vs. sequential acess of a block
Networking
OSI 7 Layer Network Model
Political - You shall not encrypt...
Application: TLS + HTTP...
Transport: TCP/UDP
Network: IPv4 / IPv6
Data Link: Ethernet
Considerations in building systems
Dependability
Types of Faults in Digital Designs
- Design Bugs (function, timing, power draw)
- Detected and corrected at design time through testing and verification
- Manufacturing Defects (Violation of design rules, inpurities in processing, statistical variations)
- Post production testing for sorting
- spare on-chip resources for repair
- Dealing with this in ICs:
- Designers provide "test vectors"
- Tools help with ATPG(Automatic Test Pattern Generation)
- Special on-chip circuits help speed the testing proess
- BIST (built in self test), Scan-chains
- Designers provide "test vectors"
- Runtime Failures (physical effects and environmental condtions)
- "Hard Faults": aging
- "Soft(transient Faults": electro-magnetic interference, cosmic particles
- Dealt with:
- Redundancy
- Measures of dependability
- Codes for Error Detection/Correction
- Protecting Disk-Drives Against Errors
Dependability(可靠性) Via Redundancy
- Spatial Redundancy
- Replicated data or extra information / hardware to handle hard and soft failures
- Temporal Redundancy
- Redundancy in time(retry) to handle soft failures
Dependability Measures
Reliability: Mean Time To Failure (MTTF)
Service Interruption: Mean Time To Repair (MTTR)
Mean Time Between Failures (MTBF) = MTTF + MTTR
Availability(可用性) =
Improving Availability
- Increase MTTF: More reliable hardware/software + Fault Tolerance
- Reduce MTTR: improved tools and processes for diagnosis and repair
Annualized Failure Rate (AFR) = Average Number Of Failures Per Year
Dependability Design Principle
No single points of failure
It follows barrel effect ⇒ Dependability of the entire system is limited by the part that has lowest dependability
Error Correction/Detection Codes(ECC / EDC)
Error Detection Coding - Parity Bit
Each data value is tagged with an extra bit to force the stored word to have even parity (even number of "1"s)
Of course can also use odd parity bit to protect against error
Hamming ECC
Hamming Distance = # of bit positions where words differ
Of course, you will see that 3b1b has a much better explanation than I do...
RAID ⇒ Redundant Arrays of Inexpensive Disks
But reliability...
- If 1 disk as MTTF of 50k hours
- 70 disk will have a MTTF of ~700 hours(), assuming failures are independent
- This is because....Think about MTTF being the time it take for a dice to roll to 6. then the probability of the first disk failing for 70 disks will be much smaller for one disk to fail
- But we know when failures occur because disks use a lot of CRC coding
So we introduce RAID!
- Files are "striped" across multiple disks
- Redundancy yields high data availability
- Disks will still fill
- But we can reconstruct contents from data redundantly stored in the array
- 6 Raid Levels, 0, 1, 5, 6 most common today
Raid 0: Striping
- "Split this block of storage in half, the first half to the first disk, the second to the second disk"
- Not actually RAID (since no redundancy)
- Improves bandwidth linearly
- Doesn't really help latency
- And FAILURES WILL HAPPEN
Raid 1: Disk Mirroring / Shadowing (online sparing)
- Each disk fully duplicated onto its "mirror"
- Very high availability achieved
- Writes go to disk and mirror
- Reads from original disk, unless failure
Raid 2: Hamming Code for Error Correction
- Bits of a data block are distributed across all disks
- Check disks store parity groups of Hamming code
- All disks involved in read/write operations
Raid 3: Single Parity Disk (bitwise)
- Disk drives themselves code data and detect failures
- Reconstruction of data can be done with single parity disk if we know which disk failed
- Writes change data disk and P disk
RAID 4: High I/O Rate Parity (blockwise)
- Interleave data at the setor level (data block) rather than bit level
- Permits more parallelism
- Independent small reads, parallel large reads
- Reconstruction of data can be done with single parity disk
- Reading without fault involve only data disk
- Writing involves data + parity disk
RAID 5: High I/O Rate Interleaved Parity
- Independent writes possible because of interleaved parity
- No longer the gold standard
- can experience 1 disk failure and contine operation
- But disk failures are not independent!
RAID 6: Add another parity block per stripe
- Now 2 blocks per stripe rather than 1
- Sacrifice capacity for increased redundancy
- Array can tolerate 2 disk failures and continue operating
Warehouse Computing
Measurements
Power Usage Effectiveness
Energy Efficiency is the primary concern in the design of WSC(Warehouse-Scale Computing)
Power Usage Effectiveness(PUE)
Cloud Services
- SaaS
- deliver apps over Internet, eliminating need to install/run on customer's computers, simplifying maintenance and support
- Google Docs, Win Apps in the Cloud
- PaaS
- Deliver computing "stack" as a service, using cloud infrastructure to implement apps
- Hadoop on EC2, Apache Spark on GCP
- IaaS
- Rather than purchasing servers, software, data center space, clients buy resources as an outsourced service
- Amazon Elastic Compute Cloud, Google Compute Platform
MapReduce
Specify the computation in terms of
- A map function
- and a reduce function
Fine granularity tasks 细粒度任务: Many more map tasks than machines
MapReduce Process: