Introduction

Most assembly tutorials focus on how to write assembly programs from scratch, often detailing how to call into the C runtime or Windows APIs. We have long since passed the time when it made sense to write software completely in assembly. Many programs spend more than 99% of their time running less than 1% of the code. It is that 1% of the code that should be a candidate for assembly. Your time is too valuable to write the other 99% in assembly as well.

With that background in mind, this brief tutorial focuses on how to call out to x64 assembly routines from Visual C++. It assumes that you already have a x64 project and are looking to replace one or more functions with assembly versions. It is geared towards someone that might have already written x86 inline assembly in Visual C++ as Microsoft has discontinued support for inline assembly in x64.

x64 Assembly overview

This section is a quick overview of the differences between x86 and x64.

x64 registers

x64 provides new registers:

8 general purpose registers: r8 - r15
8 128-bit XMM registers: xmm8 - xmm16.

In addition, the existing registers from the x86 architecture, rax, rbx, rcx, rdx, rsi, rdi, rbp, rsp and rip have been extended from 32 to 64 bits. The 64 bit form has a "r" prefix. Old registers can still be accessed in their smaller bit ranges, for instance:

Bits	64	32	16	8	8
Name	rax	eax	ax	ah	al

The new registers can be accessed like this:

Bits	64	32	16	8 low
Name	r8	r8d	r8w	r8b

Applications can still use segments registers as base for addressing, but the 64-bit mode only recognizes 3 registers, cs, fs and gs. In addition, only fs and gs can be used for base address calculations.

The 32-bit eip register becomes the rip register.

Instructions that are no longer available

In order to make room for the new instruction encodings, some instructions have been removed:

Binary-coded decimal arithmetic instructions: AAA, AAD, AAM, AAS, DAA, DAS
BOUND
PUSHAD and POPAD
Most operations that dealt with segment registers, such as PUSH DS and POP DS. (Operations that use the FS or GS segment registers are still valid.)

New instructions

There are new instructions to support various 64 bit operations.

Data Transfer

Variants of the MOV instruction handle 64-bit immediate constants or memory addresses.

MOV	r,#n	r = #n
MOV	rax, m	Move contents at 64-bit address to rax
MOV	m, rax	Move contents of rax to 64-bit address

Instruction to sign-extend 32-bit operands to 64 bits.

MOVSXD	r1, r/m	Move DWORD with sign extension to QWORD

Ordinary MOV operations into 32-bit subregisters automatically zero extend to 64 bits, so there is no MOVZXD instruction.

2 SSE instructions move 128-bit values (such as GUIDs) from memory to an xmmn register and vice versa.

MOVDQA	r1/m, r2/m	Move 128-bit aligned value to xmmn register, or vice versa
MOVDQU	r1/m, r2/m	Move 128-bit value (not necessarily aligned) to register, or vice versa

Data Conversion

CDQE	Convert dword (eax) to qword (rax).
CQO	convert qword (rax) to oword (rdx:rax).

String Manipulation

MOVSQ	Move qword from rsi to rdi.
CMPSQ	Compare qword at rsi with rdi.
SCASQ	Scan qword at rdi. Compares qword at rdi to rax.
LODSQ	Load qword from rsi into rax.
STOSQ	Store qword to rdi from rax.

Other differences

Absolute 32 bit addresses in x86 becomes 32 bit offsets in x64. This allows instructions to remain the same size but limits jumps to at most 2GB away from the current instruction.

In 64-bit mode, a new form of effective addressing is available to make it easier to write position independent code. Any memory reference may be made rip relative (rip is the instruction pointer register, which contains the address of the location immediately following the current instruction).

The x64 calling convention

The x64 calling convention is also referred to as the x64 ABI (Application Binary Interface). A calling convention describes the interface between a caller and a function:

The order in which parameters are allocated
Where parameters are placed (pushed on the stack or placed in registers)
Which registers may be used by the function
How the stack gets unwound on return

The x64 calling convention is very similar to x86 fastcall. It uses a combination of registers and stack to pass parameters to the function.

Integer, pointer and reference parameters

All arguments are right justified in registers. This is done so the callee can ignore the upper bits of the register if need be and can access only the portion of the register necessary.
All stack parameters are 8 byte aligned.
Any parameter that's not 1, 2, 4, or 8 bytes (including structs) is passed by reference.
Structs and unions of 8, 16, 32, or 64-bits are passed as if they were integers of the same size.

The first 4 integer parameters are passed (in left to right order) in rcx, rdx, r8, r9

Further integer parameters are passed on the stack by pushing them in right to left order (parameters to the left at lower addresses).

Though not detailed here, it is also possible to write assembly functions that are members of a class. In that case, the this pointer is passed in rcx.

Floating point (FP) parameters

The first 4 FP parameters are passed (in left to right order) in xmm0 through xmm3
Further FP parameters are passed on the stack by pushing them in right to left order (parameters to the left at lower addresses)
The x87 register stack is unused.

Return value

Integer, pointer or reference type is passed in rax
FP type is passed in xmm0

Volatile and non-volatile

Function must preserve: rbx, rbp, rdi, rsi, r12, r13, r14, r15, xmm6 - xmm15 and the x87 register stack.
Function may destroy: rax, rcx, rdx, r8, r9, r10, and r11 and xmm0 - xmm5

Stack

At a minimum, a caller must reserve 32 bytes (4 64-bit values) on the stack. This space allows registers passed into the function to be easily copied ("spilled") to a well-known stack location. The function isn't required to spill the parameter register parameters to the stack, but the stack space reservation ensures that it can if needed.

Stack cleanup

The caller is responsible for cleaning up the stack. Typically, the caller will reserve enough stack space for the function that requires the most stack space and just adjust positioning within that stack space to fit all functions that it's calling.

Leaf or frame function

An assembly function can be a leaf or a frame function. Leaf functions don't need to support the stack unwinding process that is part of exception handling and are easier to write than frame functions. But leaf functions have limitations:

Can not call out to other functions
Can not change any non-volatile registers
Can not change the stack pointer

Frame functions, on the other hand, allow non-volatile registers to be changed and having access to more registers can help you write faster code. However, there is a cost in complexity as a frame function must handle the following tasks:

Use a defined prolog to establish an area on the stack called a "stack frame".
Have one or more defined epilogs that free any allocated stack space and restore non-volatile registers before returning to the calling function.
Save register parameters in their shadow locations
Save any non-volatile registers that they use
Allocate stack space for local variables
Establish a register as a stack frame pointer

A frame function can have a fixed amount of stack space or it can allocate stack space dynamically.

If a frame function allocates a fixed amount of stack space, it must maintain 16-byte alignment of the stack pointer in the body of the function (outside the prolog and epilog).

A frame function that dynamically allocates stack space must first allocate any fixed stack space that it needs and then allocate and set up a register for indexed access to this area. The lower base address of this area must be 16-byte aligned and the register must be provided irrespective of whether the function itself makes explicit use of it. The function is then free to leave the stack unaligned during execution although it must re-establish the 16-byte alignment if or when it calls other functions.

Leaf function example

Let's briefly look at a leaf function example (we do a more complete frame function example below). Consider the call

int r = calc (1, 2, 3, 4, 5);

To call this function, the compiler will generate something like

mov         dword ptr [rsp+20h],5
mov         r9d,4
mov         r8d,3
mov         edx,2
mov         ecx,1
call        calc

Arguments 1, 2, 3 and 4 are stored in ecx, edx, r8d and r9d. Argument 5 is stored on the stack at rsp + 20h. 20h is added because the calling convention specifies that space must be reserved on the stack for storing ("spilling") the arguments passed in registers. 4 64 bit registers take up 20h (32) bytes.

To return the sum of all the arguments, we simply do

... code goes here
ret

Frame function example

Let's go through a complete example on setting up a frame function, including setting up Visual Studio for assembly.

Setting up Visual Studio for assembly

We will be using the Yasm assembler.

Adding .asm

Start by adding a asm file to your project:

Right click your project in Solution Explorer
Add | New Item
Type a filename that ends with .asm
Enter the properties of the .asm file and select build with Yasm as a Custom Build Step

Prepare the function declaration

Say you have this function that you want to convert to assembly:

int calc (int a, int b, int c, char d, char* e, float fa, float fb);

To support function overloading, the C++ compiler by default does name mangling. Instead of writing our assembly function with a mangled name, we turn off name mangling by telling the C++ compiler that we're calling out to C by adding extern "C" to the function declaration:

extern "C" {
        int calc (int a, int b, int c, char d, char* e, float fa, float fb);
}

Get the generated call

It can make it easier to write your assembly function when you have the calling assembly code in front of you, so as a first step, set up a call to the function that you will be coding in assembly and use easily identifiable values, for our function, that might be

char e (5);
int r = calc (1, 2, 3, 4, &e, 1.0, 2.0);

Compile your old code in debug mode, set a breakpoint on the call and start your program. When the breakpoint is hit, turn the assembly debug window on. For the call above, you will find something like

mov         byte ptr [rsp+44h],5
movss       xmm0,dword ptr [__real@40000000]
movss       dword ptr [rsp+30h],xmm0
movss       xmm0,dword ptr [__real@3f800000]
movss       dword ptr [rsp+28h],xmm0
lea         rax,[rsp+44h]
mov         qword ptr [rsp+20h],rax
mov         r9b,4
mov         r8d,3
mov         edx,2
mov         ecx,1
call        calc

You can now examine the code to see where all the arguments end up.

Write the function in assembly

We're now ready to write the function in assembly. First comment out the C++ version of the function (or you will get "multiple defined symbols" errors when linking).

The object file format we want to create is called PE32+. This is handled automatically in the Yasm Custom Build Rule.

Open the .asm file and add your function. Below is a complete example .asm file for Yasm that returns a sum of all the arguments in our example call. (For easier cut and paste, the routine is given again without breaks below).

A frame function needs a block of structured exception handling data separate from the function itself. PROC_FRAME generates a function table entry in .pdata and unwind information in .xdata for a function’s data.

PROC_FRAME      calc

Functions can be hot-patchable. Hot-patching signifies patching a process by injecting substitute function(s) at run-time. All that is needed to enable hot-patching for a function is to leave room for a 2-byte jump instruction at the top of it. Because our function starts with "push rbp", a 1-byte instruction, we prefix it with 1 byte that does nothing, to make room for a 2-byte jump instruction. The 1-byte we use is an "REX prefix". The REX prefix specifies how many bits of the 64-bit register to use in the prefixed instruction.

db          0x48            ; emit a REX prefix to enable hot-patching

Now we start the standardized prolog section of our function. The prolog needs to have an exactly matching block that contains unwind data for the function. So for each actual instruction in the prolog, we include an assembler directive that automatically generates the corresponding unwind data in the block.

rbp is a non-volatile register, so we much preserve it if we intend to use it. We will use rbp as our frame pointer, so we save it by pushing it on the stack. Establishing a frame pointer in this way enables us to change the stack pointer as much as we like in the body of the function.

push        rbp             ; save prospective frame pointer
[pushreg    rbp]            ; create unwind data for this rbp register push

We want to have some stack space available for ourselves, so we allocate some room on the stack by adjusting the stack pointer. We then generate unwind data for the allocation.

sub         rsp,0x40        ; allocate stack space
[allocstack 0x40]           ; create unwind data for this stack allocation

We then establish rbp as our frame pointer with a bias of 32. Using a bias enables us to address as many locations as possible with a 1-byte offset.

lea         rbp,[rsp+0x20]  ; assign the frame pointer with a bias of 32
[setframe   rbp,0x20]       ; create unwind data for a frame register in rbp

The final step in the prolog is to save all non-volatile registers that we inted to use.

movdqa      [rbp],xmm7      ; save a non-volatile XMM register
[savexmm128 xmm7, 0x20]     ; create unwind data for an XMM register save

mov         [rbp+0x18],rsi  ; save rsi
[savereg    rsi,0x38]       ; create unwind data for a save of rsi

mov         [rsp+0x10],rdi  ; save rdi
[savereg    rdi, 0x10]      ; create unwind data for a save of rdi

Signal the end of the prolog.

[endprolog]

We can now write the body of our function. Because we established a frame pointer, we are free to change the stack pointer as we like.

... function goes here.

We're done with the body of the function and need to cleaning up by restoring all non-volatile registers that we changed. First, we restore the registers that weren't saved with a push.

movdqa      xmm7,[rbp]      ; restore the registers that weren't saved
mov         rsi,[rbp+0x18]  ; with a push; this is not part of the
mov         rdi,[rbp-0x10]  ; official epilog

Now we restore the registers that we saved by pushing them on the stack. This is the official epilog.

lea         rsp,[rbp-0x20]  ; This is the official epilog
add         rsp,0x40        ; Correction 7/11/2015  
                            ; This instruction is missing from the code here: https://msdn.microsoft.com/en-us/library/ms235231.aspx
                            ; but is present in what looks like the correct epilog: https://msdn.microsoft.com/en-us/library/tawsa7cb.aspx
pop         rbp
ret

Mark the end of the frame function (declared with PROC_FRAME).

ENDPROC_FRAME

Debugging

You can step into your assembly function during debugging by turning on the disassembly window, setting a breakpoint on the call to your assembly function in C++ and stepping from there.

Detailed reference

Many details have been omitted in this tutorial. Take a look at these references for more information.

Microsoft MSDN: x64 Software Conventions

win64 Structured Exception Handling

x64 ABI vs. x86 ABI

What does "Hot Patchability" mean and what is it for?

How to write x64 assembly functions in Visual C++

Contents