ARM CPU Project
From ScienceZero
Contents
- 1 Goals
- 2 Plan
- 3 Instruction set for ARM3
- 3.1 Programming model
- 3.2 The instruction set
- 3.3 Condition codes
- 3.4 Arithmetic and bitwise instructions
- 3.5 Comparisons
- 3.6 Multiply instructions
- 3.7 Branching instructions
- 3.8 Single register/memory swap instruction (ARM3 or higher)
- 3.9 Single register load/store instructions
- 3.10 Multiple load/store instructions
- 3.11 SWI instruction
- 4 Instruction set binary representation
- 5 Documentation
- 6 Hardware realisation
Goals
- Implement an ARM v2a (ARM3) on a FPGA - Artix 7
- Useful completeness - Full data processing instruction set, maybe simplified in other areas
- Useful Performance - 25 MHz ARM3 or higher
Plan
- Writing software to learn the detailed workings of the CPU
- Emulators written in C++ and C
- Assembler written in F#
Instruction set for ARM3
Programming model
- Little endian
- 32 bit data bus, 32 bit address
- 26 bit Program counter ( Add a 6 bit register for bits 16-31 )
After system reset, the ARM begins processing at address 0x0, with interrupts disabled and in SVC mode. This address is the location of the Reset Vector, which should be a branch to the reset code.
Modes
Mode Short MM Description Shadow registers User mode usr 0b00 Normal program execution, no privileges None Fast interrupt Request fiq 0b01 Fast interrupt handling R8 - R14 Interrupt request irq 0b10 Normal interrupt handling R13, R14 Supervisor svc 0b11 Privileged mode for the operating system R13, R14
Hardware vectors
Address Name Content 0x00 Reset B branchThru0error 0x04 Undefined instruction LDR PC,UndHandler 0x08 SWI B decodeSWI 0x0C Prefetch abort LDR PC,PabHandler 0x10 Data abort LDR PC,DabHandler 0x14 Address exception LDR PC,AexHandler 0x18 IRQ B handleIRQ 0x1C FIQ FIQ code --> 0xFB. Having the code here avoids jumping to the FIQ handler and saves 3 cycles.
Registers
R0-R12 General purpose R13 (SP) General purpose, commonly used as stack pointer R14 (LR) Link register, PC is copied to R14 by Branch-link instructions Can be used as general purpose if correctly saved and restored R15 (PC) NZCVIFAAAAAAAAAAAAAAAAAAAAAAAAMM 31 0 N Negative flag Z Zero flag C Carry flag V Overflow flag I Interrupt request disable F Fast interrupt request disable A Address bits, the 2 LSBs are always zero. M Mode The program counter is always 2 instructions beyond the currently executing instruction because of the pipeline.
The instruction set
<> Optional. (x|y) Either x or y but not both. #exp Expression (0-31). Rn Register number(0-15). shift indicates one of the following: ASL (Rn AND 255 | #exp) Arithmetic shift left by contents of Rn or #exp LSL (Rn AND 255 | #exp) Logical shift left. ASR (Rn AND 255 | #exp) Arithmetic shift right. LSR (Rn AND 255 | #exp) Logical shift right. ROR (Rn AND 255 | #exp) Rotate right. RRX Rotate right one bit with extend. LSB->C C->MSB ASL and LSL are the same, but LSL is preferred.
Condition codes
AL Always This is the default CC Carry clear C=0 CS Carry set C=1 EQ Equal Z=1 GE Greater than or equal N=V GT Greater than N=V and Z=0 HI Higher (unsigned) C=1 and Z=0 LE Less than or equal N<>V or Z=1 LS Lower or same (unsigned) C=0 or Z=1 LT Less than N<>V MI Negative N=1 NE Not equal Z=0 NV Never **Do not use**, NOP = MOV R0,R0 PL Positive N=0 VC Overflow clear V=0 VS Overflow set V=1 LO Lower (unsigned) same as CC HS Higher/same (unsigned) same as CS
Arithmetic and bitwise instructions
opcode<cond><S> Rd,<Rn>,(#exp|Rm<,shift>) #exp has a range of X ROR N*2 X=0-255 N=0-15 ADC Add with carry Rd=Rn+Rm+C ADD Add Rd=Rn+Rm SBC Subtract with carry Rd=Rn-Rm-(1-C) SUB Subtract Rd=Rn-Rm RSC Reverse subtract with carry Rd=Rm-Rn-(1-C) RSB Reverse subtract Rd=Rm-Rn AND Bitwise AND Rd=Rn AND Rm BIC Bitwise AND NOT Rd=Rn AND (NOT Rm) ORR Bitwise OR Rd=Rn OR Rm EOR Bitwise EOR Rd=Rn EOR Rm MOV Move Rd=Rm
Comparisons
opcode<cond><S|P> Rn,(#exp|Rm<,shift>) CMN Compare Rn+Rm CMP Compare Rn-Rm TEQ Test equal Rn EOR Rm TST Test Rn AND Rm "P" can set the PSR to a given value if in a privileged mode. "S" is default behaviour
Multiply instructions
MUL<cond><S> Rd,Rm,Rs Multiply Rd=Rm*Rs MLA<cond><S> Rd,Rm,Rs,Rn Multiply-accumulate Rd=Rm*Rs+Rn Integer multiplication returns 32LSB of product of two 32bit operands. Rd must not be R15 or same as Rm. Timing is dependent of Rs. If "S" is given, N and Z are set on the result, C and V are undefined.
Branching instructions
B<cond> expression Branch, PC+=expression BL<cond> expression Branch and link, R14=PC+4&PSR & PC+=expression
Single register/memory swap instruction (ARM3 or higher)
SWP<cond><B> Rdest,Rsrc,[Rbase]
Single register load/store instructions
LDR<cond><B><T> Rd,(address|#exp) #exp has a range of +-4095 bytes. STR<cond><B><T> Rd,(address|#exp) "B" Byte transfer. "T" Force address translation from a privileged mode. (Not pre-index!)
Address syntax
"!" update Rn after use pre-index post-index [Rn] [Rn,#exp]<!> [Rn],#exp [Rn,<->Rm]<!> [Rn],<->Rm [Rn,<->Rm,shift #s]<!> [Rn],<->Rm,shift #s
The PSR is never modified. The PSR flags are not used if Rn=R15. (PC is 8 bytes ahead, pipelining!) The PSR flags are used when the PC is used as Rm.
Multiple load/store instructions
LDM<cond>type Rn<!>,{Rlist}<^> STM<cond>type Rn<!>,{Rlist}<^> "!" update Rn after use For a load with R15 in the list "^" forces update of the PSR. Otherwise "^" forces the load/store to access the User mode registers. Rn is taken from the current bank, so update of Rn goes to the User bank. Rlist is a list of register to transfer in a low to high order. type DA Decrement Rn After EA Empty Ascending stack DB Decrement Rn Before ED Empty Descending stack IA Increment Rn After FA Full Ascending stack IB Increment Rn Before FD Full Descending stack In an empty stack the stack pointer points to first free slot. In a full stack the SP points to the last data item written to it. An ascending stack grows from low to high memory addresses. A descending stack grows from high to low memory addresses. You can always load the base register(Rn). Only if Rn is the lowest register then the original Rn is stored. This will only have effect if you use "!". If R15 is in the Rlist: The PSR is saved with the PC, the PC is 12 bytes ahead. The PSR is only loaded if you use "^", the mode decides what to update. If R15 is used as Rn: The PSR is used as a part of the address!. Write back is switched off.
SWI instruction
SWI<cond> <expression> Software interrupt used for system calls Set the processor to SVC mode, and then the processor jumps to the reset vector at address 0x8. The R14_svc will be corrupted if you execute a SWI in SVC mode.
Instruction set binary representation
Conditional execution
Instruction Bitmap No Cond Code Executes if: 0000 xxxx xxxxxxxx xxxxxxxx xxxxxxxx 0 EQ(Equal) Z 0001 xxxx xxxxxxxx xxxxxxxx xxxxxxxx 1 NE(Not Equal) ~Z 0010 xxxx xxxxxxxx xxxxxxxx xxxxxxxx 2 CS(Carry Set) C 0011 xxxx xxxxxxxx xxxxxxxx xxxxxxxx 3 CC(Carry Clear) ~C 0100 xxxx xxxxxxxx xxxxxxxx xxxxxxxx 4 MI(MInus) N 0101 xxxx xxxxxxxx xxxxxxxx xxxxxxxx 5 PL(PLus) ~N 0110 xxxx xxxxxxxx xxxxxxxx xxxxxxxx 6 VS(oVerflow Set) V 0111 xxxx xxxxxxxx xxxxxxxx xxxxxxxx 7 VC(oVerflow Clear) ~V 1000 xxxx xxxxxxxx xxxxxxxx xxxxxxxx 8 HI(HIgher) C and ~Z 1001 xxxx xxxxxxxx xxxxxxxx xxxxxxxx 9 LS(Lower or Same) ~C and Z 1010 xxxx xxxxxxxx xxxxxxxx xxxxxxxx A GE(Greater or equal) N = V 1011 xxxx xxxxxxxx xxxxxxxx xxxxxxxx B LT(Less Than) N = ~V 1100 xxxx xxxxxxxx xxxxxxxx xxxxxxxx C GT(Greater Than) (N = V) and ~Z 1101 xxxx xxxxxxxx xxxxxxxx xxxxxxxx D LE(Less or equal) (N = ~V) or Z 1110 xxxx xxxxxxxx xxxxxxxx xxxxxxxx E AL(Always) True 1111 xxxx xxxxxxxx xxxxxxxx xxxxxxxx F NV(Never) False
Data Processing Instructions
cccc 000 oooo s nnnn dddd aaaaa ttk mmmm Register form ADD Rd, Rn, Rm, LSL Ra cccc 000 1101 s 0000 dddd aaaaa ttk mmmm Register form MOV Rd, Rm, LSL Ra ( Rn = 0 for MOV/MVN instructions ) cccc 001 oooo s nnnn dddd rrrr bbbbbbbb Immediate form ADD Rd, Rn, #bbbbbbbb ROR #rrrr0 cccc 001 1010 s nnnn 0000 rrrr bbbbbbbb Immediate form CMP Rn, #0 ( Rd = 0 for CMP/CMN/TST/TEQ instructions ) oooo Name Meaning Operation Condition codes 0000 AND Boolean And Rd = Rn AND Op2 0001 EOR Boolean Eor Rd = Rn EOR Op2 0010 SUB Subtract Rd = Rn - Op2 0011 RSB Reverse Subtract Rd = Op2 - Rn 0100 ADD Addition Rd = Rn + Op2 N and Z from Rd, C and V from the ALU. 0101 ADC Add with Carry Rd = Rn + Op2 + C 0110 SBC Subtract with carry Rd = Rn - Op2 - (1 - C) 0111 RSC Reverse sub w/carry Rd = Op2 - Rn - (1 - C) 1000 TST Test bit Rn AND Op2 1001 TEQ Test equality Rn EOR Op2 1010 CMP Compare Rn - Op2 1011 CMN Compare Negative Rn + Op2 1100 ORR Bitwise Or Register Rd = Rn OR Op2 1101 MOV Move value Rd = Op2 N and Z from Rd, if the shifter is used, C is set to be the last bit shifted out. 1110 BIC Bit clear Rd = Rn AND NOT Op2 1111 MVN Move Not Rd = NOT Op2 ttk 000 LSL #a Logical Shift Left 001 LSL Ra Logical Shift Left 010 LSR #a Logical Shift Right 011 LSR Ra Logical Shift Right 100 ASR #a Arithmetic Shift Right 101 ASR Ra Arithmetic Shift Right 110 ROR #a Rotate Right ROR #0 -> RRX Rotate Right one bit with extend ( Carry -> Value -> Carry ) 111 ROR Ra Rotate Right LSL by 32 has result zero, carry out equal to bit 0 of Rm. LSL by more than 32 has result zero, carry out zero. LSR by 32 has result zero, carry out equal to bit 31 of Rm. LSR by more than 32 has result zero, carry out zero. ASR by 32 or more has result filled with and carry out equal to bit 31 of Rm. ROR by 32 has result equal to Rm, carry out equal to bit 31 of Rm. ROR by n where n is greater than 32 will give the same result and carry out as ROR by (n and 31). If Rn = R15 then the value used is R15 with all the PSR bits masked out. If Op2 involves R15, then all 32 bits are used.
Branch Instructions
cccc 101L oooooooo oooooooo oooooooo Destination address = current address + 8 + (4 * sign extended offset) The top 6 bits of the destination address are cleared. If L = 1, then the address of the next instruction is copied into R14 before the branch is taken. The called function can return with MOV PC,R14, or MOVS PC,R14 to return with the original condition codes.
Multiplication
cccc 0000 00ASdddd nnnnssss 1001mmmm If the S bit is set, the N and Z flags are set on the result, C is undefined, and V is unaffected. If the A bit is set, then the effect of the operation is Rd = Rm * Rs + Rn otherwise, Rd = Rm * Rs. The destination register shall not be the same as the operand register Rm. R15 shall not be used as an operand or as the destination register.
Single Data Transfer
cccc 010P UBWLnnnn ddddoooo oooooooo Immediate form cccc 011P UBWLnnnn ddddcccc ctt0mmmm Register form If L = 1, then a load is performed. If P = 1, then Pre-indexed addressing is used, otherwise post-indexed addressing is used. If U = 1, then the offset given is added to the base register - otherwise it is subtracted. If B = 1, then a byte of memory is transferred, otherwise a word is transferred. The interpretation of the W bit depends on the addressing mode used: For pre-indexed addressing, W being set forces the writing back of the final address used for the address translation into the base register. (i.e. A side effect of the transfer is Rn := Rn +/- offset. This is signified to assemblers by postfixing the instruction with !.) For post-indexed addressing, the address is always written back, and the bit being set indicates that an address translation should be forced before the transfer takes place. This is signified to assmeblers by postfixing the mnemonic stub with `T'. An address translation causes the chip to tell the memory system that this is a user mode transfer, regardless of whether the chip is in a user mode or a privileged mode at the time. This is useful e.g. when writing emulators: suppose for instance that a user mode program executes an STF instruction to an area of memory that may not be written by user mode code. If this is executed by an FPA, it will abort. If it is executed by the FPE, it should also abort. But the FPE runs in a privileged mode, so if it were to use normal stores, they wouldn't abort. To make aborts work properly, it instead uses normal stores if it was called from a privileged mode, but STRTs if it was called from a user mode. If the immediate form of the instruction is used, the o field gives a 12-bit offset. If the register form is used, then it is decoded as for the data processing instructions, with the restriction that shifts by register amounts are not allowed. If R15 is used as Rd, the PSR is not modified. The PC should not be used in Op2. Other restrictions: Don't use writeback or post-indexing when the base register is the PC. Don't use the PC as Rd for an LDRB or STRB. When using post-indexing with a register offset, don't make Rn and Rm the same register (doing so makes recovery from aborts impossible). Unaligned reads rotate the data so that the byte at the address read is in the least significant position in the destination register. What happens with unaligned writes?
Block Data Transfer
cccc 100P USWLnnnn llllllll llllllll The U bit indicates whether the address will be modified by +4 (set), or -4 (clear) for each register. The W bit always indicates writeback. If set, the L bit indicates a load operation should be performed. If clear, a save. The P bit is used indicate whether to increment/decrement the base before or after each load/store (see the table below). Bit l is set if Rl is to be loaded/stored by this operation. Stub Meaning P U DA Decrement Rn After each store/load 0 0 DB Decrement Rn Before each store/load 1 0 IA Increment Rn After each store/load 0 1 IB Increment Rn Before each store/load 1 1 Synonyms for these exist which are clearer when implementing stacks: Stub Meaning EA Empty Ascending stack ED Empty Decending stack FA Full Ascending stack FD Full Decending stack
The S bit controls two special functions, both of which are indicated to the assembler by putting "^" at the end of the instruction: • If the S bit is set, the instruction is LDM and R15 is in the register list, then: ◦ In 26-bit privileged modes, all 32 bits of R15 will be loaded. In 26-bit user mode, the 4 flags and 24 PC bits of R15 will be loaded. Bits 27, 26, 1 and 0 of the loaded value will be ignored. In 32-bit modes, all 32 bits of R15 will be loaded, though note that the two bottom bits are always zero, so any ones loaded to them will be ignored. In addition, the SPSR of the current mode will be transferred to the CPSR; since user mode does not have an SPSR, this type of instruction should not be used in 32-bit user mode. If the S bit is set and either the instruction is STM or R15 is not in the register list, then the user mode registers will be transferred rather than those for the current mode. This type of instruction should not be used in user mode.
Special cases occur when the base register is used in the list of registers to be transferred. The base register can always be loaded without any problems. However, don't specify writeback if the base register is being loaded - you can't end up with both a written-back value and a loaded value in the base register! The base register can be stored with no complications as long as writeback is not used. Storing a list of registers including the base register using writeback will write the value of the base register before writeback to memory only if the base register is the first in the list. Otherwise, the value which is used is not defined. Further special cases occur if the program counter is present in the list of registers to load and save. The PSR is always saved with the PC (in 26 bit modes) (and the PC will always be 12 bytes further on, rather than the usual 8 (in all modes)). On a load, only the bits of the PSR that are alterable in the current mode can be affected, and then only if the S bit is set. The PC should not be used as the base register.
Software interrupt
cccc 1111 yyyyyyyy yyyyyyyy yyyyyyyy SWI #y
Single Data Swap
cccc 0001 0B00nnnn dddd0000 1001mmmm SWP Rd, Rm, [Rn]
Co-processor instructions
The exact meaning of these instructions depends on the particular co-processor in use. The only part which is obligatory is that pppp must be the coprocessor number. Defined co-processors:
- 1 Floating Point unit
- 2 Floating Point unit
- 15 Cache Controller
Co-processor data operations
cccc 1110 oooonnnn ddddpppp qqq0mmmm CDP p, o, CRd, CRn, CRm, q CDP p, o, CRd, CRn, CRm
Co-processor data transfer and register transfers
cccc 110P UNWLnnnn DDDDpppp oooooooo LDC/STC cccc 1110 oooLNNNN ddddpppp qqq1MMMM MRC/MCR
Floating-point
Programmer's model
32 bit (S) - sign(1) exponent(8) fraction(24) 64 bit (D) - sign(1) exponent(11) fraction(52) 80 bit (E) - sign(1) zeros(16) exponent(15) J(1) fraction(63) Packed decimal - 0xseeeeddddddddddddddddddd value = +/- d * 10 ^ (+/- e)
Floating-point status register
This register (FPSR) contains the IEEE flags, the result flags are only available after a comparison. Can only be cleared by the WFS instruction. bit Cumulative flags 0 IVO - Invalid operation 1 DVZ - Division by zero 2 OFL - Overflow 3 UFL - Underflow 4 INX - Inexact bit Interrupt masks 16 IVO 17 DVZ 18 OFL 19 UFL 20 INX
Rounding
- Nearest (default)
- +infinity (P)
- -infinity (M)
- Zero (Z)
Instructions
Documentation
Hardware realisation
Pipeline
Optimization
Ideas for speeding up the CPU
Read buffer
- The last 64 bit word read from memory can be held in a register to speed up consecutive 32 bit and 8 bit reads.
- Must be synced with or invalidated by writes.
- A simple bitmask may be used to enable the buffer for different memory regions.
Write buffer
- A write buffer stop memory store instructions from hanging until the store is complete.
- Several consecutive 8 or 32 bit writes can be merged into 64 bit writes to maximize bandwidth to main memory.
- Memory reads must check the write buffer.
- The write buffer may need a timeout so the last writes don't stay in the buffer for a long time.
Two instructions per clock
- May be able to run two data instructions per clock without significant increase in complexity or decrease in frequency.
- We start the two instructions at the same time.
- While they're running, check rules and disable the write gate for one of them if they're not eligible to run concurrently.
- Use CCs to find instructions that will not run.
- CCs can be used even when there's a dependency between the two concurrent instructions.
Hiding latency
- Memory instructions may have significant latency
- Continue executing instructions until there is a blocking dependency