Goals

Implement an ARM v2a (ARM3) on a FPGA - Artix 7
Useful completeness - Full data processing instruction set, maybe simplified in other areas
Useful Performance - 25 MHz ARM3 or higher

Plan

Writing software to learn the detailed workings of the CPU
- Emulators written in C++ and C
- Assembler written in F#

Instruction set for ARM3

Programming model

Little endian
32 bit data bus, 32 bit address
26 bit Program counter

After system reset, the ARM begins processing at address 0x0, with interrupts disabled and in SVC mode. This address is the location of the Reset Vector, which should be a branch to the reset code.

Modes

Mode                   Short  MM    Description                               Shadow registers
User mode              usr    0b00  Normal program execution, no privileges   None
Fast interrupt Request fiq    0b01  Fast interrupt handling                   R8 - R14
Interrupt request      irq    0b10  Normal interrupt handling                 R13, R14
Supervisor             svc    0b11  Privileged mode for the operating system  R13, R14

Hardware vectors

Address Name                   Content
0x00    Reset                  B     branchThru0error
0x04    Undefined instruction  LDR   PC,UndHandler
0x08    SWI                    B     decodeSWI
0x0C    Prefetch abort         LDR   PC,PabHandler
0x10    Data abort             LDR   PC,DabHandler
0x14    Address exception      LDR   PC,AexHandler
0x18    IRQ                    B     handleIRQ
0x1C    FIQ                    FIQ code --> 0xFB. Having the code here avoids jumping to the FIQ handler and saves 3 cycles.

Registers

R0-R12
 General purpose
R13 (SP)
 General purpose, commonly used as stack pointer
R14 (LR)
 Link register, PC is copied to R14 by Branch-link instructions
 Can be used as general purpose if correctly saved and restored

R15 (PC)
 NZCVIFAAAAAAAAAAAAAAAAAAAAAAAAMM
31                              0

N Negative flag
Z Zero flag
C Carry flag
V Overflow flag
I Interrupt request disable
F Fast interrupt request disable
A Address bits, the 2 LSBs are always zero.
M Mode

The program counter is always 2 instructions beyond the currently executing instruction because of the pipeline.
The program counter is 26 bits long and the two LSBs are always 0 and replaced by MM in R15.

PSR = Program Status Register

The instruction set

<>     Optional.
(x|y)  Either x or y but not both.
#exp   Expression (0-31).
Rn     Register number(0-15).
shift  indicates one of the following:
       ASL (Rn AND 255 | #exp)  Arithmetic shift left by contents of Rn or #exp
       LSL (Rn AND 255 | #exp)  Logical shift left.
       ASR (Rn AND 255 | #exp)  Arithmetic shift right.
       LSR (Rn AND 255 | #exp)  Logical shift right.
       ROR (Rn AND 255 | #exp)  Rotate right.
       RRX                      Rotate right one bit with extend. LSB->C C->MSB
       ASL and LSL are the same, but LSL is preferred.

Condition codes

AL    Always                   This is the default
CC    Carry clear              C=0
CS    Carry set                C=1
EQ    Equal                    Z=1
GE    Greater than or equal    N=V
GT    Greater than             N=V and Z=0
HI    Higher (unsigned)        C=1 and Z=0
LE    Less than or equal       N<>V or Z=1
LS    Lower or same (unsigned) C=0  or Z=1
LT    Less than                N<>V
MI    Negative                 N=1
NE    Not equal                Z=0
NV    Never              **Do not use**, NOP = MOV R0,R0
PL    Positive                 N=0
VC    Overflow clear           V=0
VS    Overflow set             V=1
LO    Lower (unsigned)         same as CC
HS    Higher/same (unsigned)   same as CS

Arithmetic and bitwise instructions

opcode<cond><S> Rd,<Rn>,(#exp|Rm<,shift>)
#exp has a range of X ROR N*2  X=0-255 N=0-15

ADC  Add with carry                  Rd = Rn + Rm + C
ADD  Add                             Rd = Rn + Rm
SBC  Subtract with carry             Rd = Rn - Rm - (1 - C)
SUB  Subtract                        Rd = Rn - Rm
RSC  Reverse subtract with carry     Rd = Rm - Rn - (1 - C)
RSB  Reverse subtract                Rd = Rm - Rn
AND  Bitwise AND                     Rd = Rn AND Rm
BIC  Bitwise AND NOT                 Rd = Rn AND (NOT Rm)
ORR  Bitwise OR                      Rd = Rn OR Rm
EOR  Bitwise EOR                     Rd = Rn EOR Rm
MOV  Move                            Rd = Rm
MVN  Move NOT                        Rd = NOT Rm

Comparisons

opcode<cond><S|P> Rn,(#exp|Rm<,shift>)

CMN  Compare    Rn + Rm
CMP  Compare    Rn - Rm
TEQ  Test equal Rn EOR Rm
TST  Test       Rn AND Rm

"P" can set the PSR to a given value if in a privileged mode.
"S" is default behaviour

Multiply instructions

MUL<cond><S> Rd,Rm,Rs        Multiply             Rd=Rm*Rs
MLA<cond><S> Rd,Rm,Rs,Rn     Multiply-accumulate  Rd=Rm*Rs+Rn 

Integer multiplication returns 32LSB of product of two 32bit operands.
Rd must not be R15 or same as Rm. Timing is dependent of Rs.
If "S" is given, N and Z are set on the result, C and V are undefined.

Branching instructions

B<cond>  expression   Branch, PC+=expression
BL<cond> expression   Branch and link, R14=PC+4&PSR & PC+=expression

Single register/memory swap instruction (ARM3 or higher)

SWP<cond><B> Rdest,Rsrc,[Rbase]

Single register load/store instructions

LDR<cond><B><T> Rd,(address|#exp)    #exp has a range of +-4095 bytes.
STR<cond><B><T> Rd,(address|#exp)
"B" Byte transfer.
"T" Force address translation from a privileged mode. (Not pre-index!)

Address syntax

"!" update Rn after use  
pre-index                post-index
[Rn]
[Rn,#exp]<!>                [Rn],#exp
[Rn,<->Rm]<!>               [Rn],<->Rm
[Rn,<->Rm,shift #s]<!>      [Rn],<->Rm,shift #s

The PSR is never modified.
The PSR flags are not used if Rn=R15. (PC is 8 bytes ahead, pipelining!)
The PSR flags are used when the PC is used as Rm.

Multiple load/store instructions

LDM<cond>type Rn<!>,{Rlist}<^>
STM<cond>type Rn<!>,{Rlist}<^>
"!" update Rn after use 
For a load with R15 in the list "^" forces update of the PSR.
Otherwise "^" forces the load/store to access the User mode registers.
Rn is taken from the current bank, so update of Rn goes to the User bank.

Rlist is a list of register to transfer in a low to high order.

type
DA  Decrement Rn After    EA  Empty Ascending stack
DB  Decrement Rn Before   ED  Empty Descending stack
IA  Increment Rn After    FA  Full Ascending stack
IB  Increment Rn Before   FD  Full Descending stack

In an empty stack the stack pointer points to first free slot.
In a full stack the SP points to the last data item written to it.
An ascending stack grows from low to high memory addresses.
A descending stack grows from high to low memory addresses.

You can always load the base register(Rn).
Only if Rn is the lowest register then the original Rn is stored.
This will only have effect if you use "!".

If R15 is in the Rlist:
 The PSR is saved with the PC, the PC is 12 bytes ahead.
 The PSR is only loaded if you use "^", the mode decides what to update.

If R15 is used as Rn:
 The PSR is used as a part of the address!.
 Write back is switched off.

SWI instruction

SWI<cond> <expression>
Software interrupt used for system calls
Set the processor to SVC mode, and then the processor jumps to the reset vector at address 0x8.
The R14_svc will be corrupted if you execute a SWI in SVC mode.

Instruction set binary representation

Conditional execution

Instruction Bitmap                  No   Cond Code             Executes if:
0000 xxxx xxxxxxxx xxxxxxxx xxxxxxxx 0    EQ(Equal)             Z
0001 xxxx xxxxxxxx xxxxxxxx xxxxxxxx 1    NE(Not Equal)        ~Z
0010 xxxx xxxxxxxx xxxxxxxx xxxxxxxx 2    CS(Carry Set)         C
0011 xxxx xxxxxxxx xxxxxxxx xxxxxxxx 3    CC(Carry Clear)      ~C

0100 xxxx xxxxxxxx xxxxxxxx xxxxxxxx 4    MI(MInus)             N
0101 xxxx xxxxxxxx xxxxxxxx xxxxxxxx 5    PL(PLus)             ~N
0110 xxxx xxxxxxxx xxxxxxxx xxxxxxxx 6    VS(oVerflow Set)      V
0111 xxxx xxxxxxxx xxxxxxxx xxxxxxxx 7    VC(oVerflow Clear)   ~V

1000 xxxx xxxxxxxx xxxxxxxx xxxxxxxx 8    HI(HIgher)            C and ~Z
1001 xxxx xxxxxxxx xxxxxxxx xxxxxxxx 9    LS(Lower or Same)    ~C and  Z
1010 xxxx xxxxxxxx xxxxxxxx xxxxxxxx A    GE(Greater or equal)  N =  V
1011 xxxx xxxxxxxx xxxxxxxx xxxxxxxx B    LT(Less Than)         N = ~V

1100 xxxx xxxxxxxx xxxxxxxx xxxxxxxx C    GT(Greater Than)     (N =  V) and ~Z
1101 xxxx xxxxxxxx xxxxxxxx xxxxxxxx D    LE(Less or equal)    (N = ~V) or   Z
1110 xxxx xxxxxxxx xxxxxxxx xxxxxxxx E    AL(Always)            True
1111 xxxx xxxxxxxx xxxxxxxx xxxxxxxx F    NV(Never)             False

(adds|adcs|subs|sbcs|rsbs|rscs) rd,rn,rm
C = (rn AND rm).[31] OR (if (rn OR rm).[31] > rd.[31] then 1 else 0)
Vadd = if (registers.[rn].[nf] = op2.[nf]) && (result.[nf] <> op2.[nf]) then 1 else 0 // pos + pos = neg | neg + neg = pos
Vsub = if (registers.[rn].[nf] <> op2.[nf]) && (result.[nf] = op2.[nf]) then 1 else 0 // neg - pos = pos | pos - neg = neg
N = rd.[31]
Z = rd = 0

(ands | eors | tsts | teqs | orrs, movs, bics, mvns)
  C = last bit shifted out of the barrel shifter or preserved if LSL #0
  V = unaffected
  N = rd.[31]
  Z = rd = 0
 
Some unknown masking if R15 is used as source or destination

Data Processing Instructions

cccc 000 oooo s nnnn dddd aaaaa ttk mmmm  Register form 1  ADD Rd, Rn, Rm, LSL #a
cccc 000 oooo s nnnn dddd aaaa0 ttk mmmm  Register form 2  ADD Rd, Rn, Rm, LSL Ra
cccc 000 1101 s 0000 dddd aaaaa ttk mmmm  Register form    MOV Rd, Rm, LSL Ra ( Rn = 0 for MOV/MVN instructions )
cccc 001 oooo s nnnn dddd rrrr bbbbbbbb   Immediate form   ADD Rd, Rn, #bbbbbbbb ROR #rrrr0
cccc 001 1010 1 nnnn 0000 rrrr bbbbbbbb   Immediate form   CMP Rn, #0 ( Rd = 0 and s = 1 for CMP/CMN/TST/TEQ instructions )
cccc 001 1010 1 nnnn 1111 rrrr bbbbbbbb   Immediate form   CMPP Rn, #0 ( Update condition codes directly - TSTP/TEQP/CMNP/CMPP )
 
oooo  Name Meaning              Operation                Condition codes
0000  AND  Boolean And          Rd = Rn AND Op2
0001  EOR  Boolean Eor          Rd = Rn EOR Op2
0010  SUB  Subtract             Rd = Rn - Op2
0011  RSB  Reverse Subtract     Rd = Op2 - Rn
0100  ADD  Addition             Rd = Rn + Op2            N and Z from Rd, C and V from the ALU.
0101  ADC  Add with Carry       Rd = Rn + Op2 + C
0110  SBC  Subtract with carry  Rd = Rn - Op2 - (1 - C)
0111  RSC  Reverse sub w/carry  Rd = Op2 - Rn - (1 - C)
1000  TST  Test bit                  Rn AND Op2
1001  TEQ  Test equality             Rn EOR Op2
1010  CMP  Compare                   Rn - Op2
1011  CMN  Compare Negative          Rn + Op2
1100  ORR  Bitwise Or Register  Rd = Rn OR Op2
1101  MOV  Move value           Rd = Op2                 N and Z from Rd, if the shifter is used, C is set to be the last bit shifted out. 
1110  BIC  Bit clear            Rd = Rn AND NOT Op2
1111  MVN  Move Not             Rd = NOT Op2

ttk
000  LSL #a         Logical Shift Left
001  LSL Ra         Logical Shift Left
010  LSR #a         Logical Shift Right
011  LSR Ra         Logical Shift Right
100  ASR #a         Arithmetic Shift Right 
101  ASR Ra         Arithmetic Shift Right
110  ROR #a         Rotate Right
     ROR #0 -> RRX  Rotate Right one bit with extend ( Carry -> Value -> Carry )
111  ROR Ra         Rotate Right

Only the lower 8 bits of Ra will be used.
LSL by 32 has result zero, carry out equal to bit 0 of Rm.
LSL by more than 32 has result zero, carry out zero.
LSR by 32 has result zero, carry out equal to bit 31 of Rm.
LSR by more than 32 has result zero, carry out zero.
ASR by 32 or more has result filled with and carry out equal to bit 31 of Rm.
ROR by 32 has result equal to Rm, carry out equal to bit 31 of Rm.
ROR by n where n is greater than 32 will give the same result and carry out as ROR by (n and 31).

If Rn = R15 then the value used is R15 with all the PSR bits masked out. 
If Op2 involves R15, then all 32 bits are used.

Branch Instructions

cccc 101L oooooooo oooooooo oooooooo
Destination address = current address + 8 + (4 * sign extended offset)
The top 6 bits of the destination address are cleared. 
If L = 1, then the address of the next instruction is copied into R14 before the branch is taken.
The called function can return with MOV PC,R14, or MOVS PC,R14 to return with the original condition codes.

Multiplication

cccc 0000 00ASdddd nnnnssss 1001mmmm
If the S bit is set, the N and Z flags are set on the result, C is undefined, and V is unaffected. 
If the A bit is set, then the effect of the operation is Rd = Rm * Rs + Rn otherwise, Rd = Rm * Rs. 
The destination register shall not be the same as the operand register Rm.
R15 shall not be used as an operand or as the destination register.

Single Data Transfer

cccc 010P UBWLnnnn ddddoooo oooooooo Immediate form
cccc 011P UBWLnnnn ddddcccc ctt0mmmm Register form

If L = 1, then a load is performed.
If P = 1, then Pre-indexed addressing is used, otherwise post-indexed addressing is used.
If U = 1, then the offset given is added to the base register - otherwise it is subtracted.
If B = 1, then a byte of memory is transferred, otherwise a word is transferred.

The interpretation of the W bit depends on the addressing mode used:
For pre-indexed addressing, W being set forces the writing back of the final address used for the address translation into the base register.
(i.e. A side effect of the transfer is Rn := Rn +/- offset. This is signified to assemblers by postfixing the instruction with !.)
For post-indexed addressing, the address is always written back, and the bit being set indicates that an address translation should be forced
before the transfer takes place. This is signified to assmeblers by postfixing the mnemonic stub with `T'.

An address translation causes the chip to tell the memory system that this is a user mode transfer, regardless of whether the chip is in
a user mode or a privileged mode at the time. This is useful e.g. when writing emulators: suppose for instance that a user mode program
executes an STF instruction to an area of memory that may not be written by user mode code. If this is executed by an FPA, it will abort.
If it is executed by the FPE, it should also abort. But the FPE runs in a privileged mode, so if it were to use normal stores, they wouldn't abort.
To make aborts work properly, it instead uses normal stores if it was called from a privileged mode, but STRTs if it was called from a user mode.

If the immediate form of the instruction is used, the o field gives a 12-bit offset. If the register form is used, then it is decoded as for the
data processing instructions, with the restriction that shifts by register amounts are not allowed.

If R15 is used as Rd, the PSR is not modified. The PC should not be used in Op2.

Other restrictions:
Don't use writeback or post-indexing when the base register is the PC.
Don't use the PC as Rd for an LDRB or STRB.
When using post-indexing with a register offset, don't make Rn and Rm the same register (doing so makes recovery from aborts impossible).

Unaligned reads rotate the data so that the byte at the address read is in the least significant position in the destination register.
What happens with unaligned writes?

Block Data Transfer

cccc 100P USWLnnnn llllllll llllllll
The U bit indicates whether the address will be modified by +4 (set), or -4 (clear) for each register.
The W bit always indicates writeback.
If set, the L bit indicates a load operation should be performed. If clear, a save.
The P bit is used indicate whether to increment/decrement the base before or after each load/store (see the table below).
Bit l is set if Rl is to be loaded/stored by this operation.

Stub Meaning P U
DA Decrement Rn After each store/load 0 0
DB Decrement Rn Before each store/load 1 0
IA Increment Rn After each store/load 0 1
IB Increment Rn Before each store/load 1 1

Synonyms for these exist which are clearer when implementing stacks:
Stub Meaning
EA Empty Ascending stack
ED Empty Decending stack
FA Full Ascending stack
FD Full Decending stack

The S bit controls two special functions, both of which are indicated to the assembler by putting "^" at the end of the instruction:
If the S bit is set, the instruction is LDM and R15 is in the register list, then: ◦ In 26-bit privileged modes, all 32 bits of R15 will be loaded.
In 26-bit user mode, the 4 flags and 24 PC bits of R15 will be loaded. Bits 27, 26, 1 and 0 of the loaded value will be ignored.
In 32-bit modes, all 32 bits of R15 will be loaded, though note that the two bottom bits are always zero, so any ones loaded to them will be ignored.
In addition, the SPSR of the current mode will be transferred to the CPSR; since user mode does not have an SPSR, this type of instruction
should not be used in 32-bit user mode.

If the S bit is set and either the instruction is STM or R15 is not in the register list, then the user mode registers will be transferred
rather than those for the current mode. This type of instruction should not be used in user mode.

Special cases occur when the base register is used in the list of registers to be transferred.
The base register can always be loaded without any problems. However, don't specify writeback if the base register is being loaded
- you can't end up with both a written-back value and a loaded value in the base register!
The base register can be stored with no complications as long as writeback is not used.
Storing a list of registers including the base register using writeback will write the value of the base register before writeback to memory
only if the base register is the first in the list. Otherwise, the value which is used is not defined.

Further special cases occur if the program counter is present in the list of registers to load and save.
The PSR is always saved with the PC (in 26 bit modes) (and the PC will always be 12 bytes further on, rather than the usual 8 (in all modes)).
On a load, only the bits of the PSR that are alterable in the current mode can be affected, and then only if the S bit is set.

The PC should not be used as the base register.

Software interrupt

cccc 1111 yyyyyyyy yyyyyyyy yyyyyyyy
SWI #y

Single Data Swap

cccc 0001 0B00nnnn dddd0000 1001mmmm
SWP Rd, Rm, [Rn]

Co-processor instructions

The exact meaning of these instructions depends on the particular co-processor in use. The only part which is obligatory is that pppp must be the coprocessor number. Defined co-processors:

1 Floating Point unit
2 Floating Point unit
15 Cache Controller

Co-processor data operations

cccc 1110 oooonnnn ddddpppp qqq0mmmm
CDP p, o, CRd, CRn, CRm, q
CDP p, o, CRd, CRn, CRm

Co-processor data transfer and register transfers

 cccc 110P UNWLnnnn DDDDpppp oooooooo LDC/STC
 cccc 1110 oooLNNNN ddddpppp qqq1MMMM MRC/MCR

Floating-point

Programmer's model

32 bit (S)     -  sign(1) exponent(8) fraction(24)
64 bit (D)     -  sign(1) exponent(11) fraction(52)
80 bit (E)     -  sign(1) zeros(16) exponent(15) J(1) fraction(63)
Packed decimal - 0xseeeeddddddddddddddddddd
                 value = +/- d * 10 ^ (+/- e)

Floating-point status register

This register (FPSR) contains the IEEE flags, the result flags are only available after a comparison.
Can only be cleared by the WFS instruction.

bit Cumulative flags
0   IVO - Invalid operation
1   DVZ - Division by zero
2   OFL - Overflow
3   UFL - Underflow
4   INX - Inexact

bit Interrupt masks
16  IVO
17  DVZ
18  OFL
19  UFL
20  INX

Rounding

Nearest (default)
+infinity (P)
-infinity (M)
Zero (Z)

Instructions

Documentation

Hardware realisation

Add a 6 bit register for bits 16-31 of the program counter to be able to place code at any address.

Pipeline

Optimization

Ideas for speeding up the CPU

Read buffer

The last 64 bit word read from memory can be held in a register to speed up consecutive 32 bit and 8 bit reads.
- Must be synced with or invalidated by writes.
- A simple bitmask may be used to enable the buffer for different memory regions.

Write buffer

A write buffer stop memory store instructions from hanging until the store is complete.
- Several consecutive 8 or 32 bit writes can be merged into 64 bit writes to maximize bandwidth to main memory.
- Memory reads must check the write buffer.
- The write buffer may need a timeout so the last writes don't stay in the buffer for a long time.

Two instructions per clock

May be able to run two data instructions per clock without significant increase in complexity or decrease in frequency.
- We start the two instructions at the same time.
- While they're running, check rules and disable the write gate for one of them if they're not eligible to run concurrently.
- Use CCs to find instructions that will not run.
- CCs can be used even when there's a dependency between the two concurrent instructions.

Hiding latency

Memory instructions may have significant latency
- Continue executing instructions until there is a blocking dependency
- The registers may have an extra bit to signal if the content is ready
- If the content is not ready the ready bit will stretch the clock for the ALU

Software

Visual emulator (32 bit, not 26 bit) - http://salmanarif.bitbucket.org/visual/index.html
Disassembler - https://onlinedisassembler.com/odaweb/
Assembler - http://sciencezero.org/upload1/armassembler.zip

ARM CPU Project

Contents