So how does syscall happens, actually? If you look up the definition of the write
, you will see the following function signature:
1
2
3
4
5
6
|
/* Write N bytes of BUF to FD. Return the number written, or -1.
This function is a cancellation point and therefore not marked with
__THROW. */
extern ssize_t write (int __fd, const void *__buf, size_t __n) __wur
__attr_access ((__read_only__, 2, 3));
|
But we still don’t know exactly what happens when we call it. All we know is that it’s like a normal C function: we call it, and everything gets done.
Since the source code of libc
is open to everyone, we can look up how write
is implemented. But before doing that, let’s use gdb
to find out what actually happens.
1
2
3
4
5
6
|
// write.c
#include <unistd.h>
int main() {
write(1, "Hello, World!\n", 14);
return 0;
}
|
After compiling this simple code with gcc -O1 -g -o write write.c
, we get an executable named write
. All it does is print the string “Hello, World!”.
1
2
|
$ ./write
Hello, World!
|
Now we’ll use gdb
to run the executable again.
1
2
3
|
$ gdb write
...(Messages printed at start)
(gdb)
|
First, set a breakpoint at the main
symbol to stop at the entry point. Then enable the disassemble-next-line
option and run the program.
1
2
3
4
5
6
7
8
9
10
11
|
(gdb) b main
Breakpoint 1 at 0x1139: file write.c, line 3.
(gdb) set disassemble-next-line on
(gdb) show disassemble-next-line
Debugger's willingness to use disassemble-next-line is on.
(gdb) r
Starting program: ./write
Breakpoint 1, main () at write.c:3
3 int main() {
=> 0x0000555555555139 <main+0>: 48 83 ec 08 sub $0x8,%rsp
|
We can see from the assembly code that the stack grows by 0x8
bytes, giving main
a stack size of 0x8
bytes. However, that’s not our focus. Let’s execute this program instruction-by-instruction using si
(step instruction).
1
2
3
4
5
6
|
(gdb) si
4 write(1, "Hello, World!\n", 14);
=> 0x000055555555513d <main+4>: ba 0e 00 00 00 mov $0xe,%edx
0x0000555555555142 <main+9>: 48 8d 35 bb 0e 00 00 lea 0xebb(%rip),%rsi # 0x555555556004
0x0000555555555149 <main+16>: bf 01 00 00 00 mov $0x1,%edi
0x000055555555514e <main+21>: e8 dd fe ff ff call 0x555555555030 <write@plt>
|
From the last blog, we know that arguments are passed by registers in the order of rdi
, rsi
, rdx
, rcx
, r8
, and r9
. We can see that rdi
, rsi
, edx
are assigned to 0x1(1)
, %rip + 0xebb
(which is annotated as 0x555555556004
), and 0xe(14)
respectively. That’s exactly what we passed to the write
function. The second argument("Hello, World!\n"
) is passed by pointer though, we can verify this by checking what resides at this pointer:
1
2
|
(gdb) x/s 0x555555556004
0x555555556004: "Hello, World!\n"
|
OK, now that we understand argument passing, let’s dive into the write
function.
1
2
|
(gdb) s
__GI___libc_write (fd=fd@entry=1, buf=buf@entry=0x555555556004, nbytes=nbytes@entry=14) at ../sysdeps/unix/sysv/linux/write.c:25
|
We can even see what is passed to which parameter thanks to the GDB debuginfod feature, but we already know that, so let’s print out the assembly code of the __GI___libc_write
:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
|
(gdb) x/16i $pc
=> 0x7ffff7e9b4f0 <__GI___libc_write>: endbr64
0x7ffff7e9b4f4 <__GI___libc_write+4>: cmpb $0x0,0xe0b45(%rip) # 0x7ffff7f7c040 <__libc_single_threaded>
0x7ffff7e9b4fb <__GI___libc_write+11>: je 0x7ffff7e9b510 <__GI___libc_write+32>
0x7ffff7e9b4fd <__GI___libc_write+13>: mov $0x1,%eax
0x7ffff7e9b502 <__GI___libc_write+18>: syscall
0x7ffff7e9b504 <__GI___libc_write+20>: cmp $0xfffffffffffff000,%rax
0x7ffff7e9b50a <__GI___libc_write+26>: ja 0x7ffff7e9b560 <__GI___libc_write+112>
0x7ffff7e9b50c <__GI___libc_write+28>: ret
0x7ffff7e9b50d <__GI___libc_write+29>: nopl (%rax)
0x7ffff7e9b510 <__GI___libc_write+32>: push %rbp
0x7ffff7e9b511 <__GI___libc_write+33>: mov %rsp,%rbp
0x7ffff7e9b514 <__GI___libc_write+36>: sub $0x20,%rsp
0x7ffff7e9b518 <__GI___libc_write+40>: mov %rdx,-0x18(%rbp)
0x7ffff7e9b51c <__GI___libc_write+44>: mov %rsi,-0x10(%rbp)
0x7ffff7e9b520 <__GI___libc_write+48>: mov %edi,-0x8(%rbp)
0x7ffff7e9b523 <__GI___libc_write+51>: call 0x7ffff7e20d90 <__GI___pthread_enable_asynccancel>
|
Well, it’s quite noisy, but we only need to focus on two lines:
1
2
|
0x7ffff7e9b4fd <__GI___libc_write+13>: mov $0x1,%eax
0x7ffff7e9b502 <__GI___libc_write+18>: syscall
|
These codes simply set the eax
register to 0x1
and then execute the syscall
instruction.
1
2
3
4
5
|
(gdb) x/i $pc
=> 0x7ffff7e9b502 <__GI___libc_write+18>: syscall
(gdb) si
Hello, World!
(gdb)
|
After the syscall
instruction returns, we can see “Hello, World!” from the GDB console. The entire process can be summarized in three steps:
- Pass the arguments following the C Calling Convention.
- Set the
rax
register (or eax
in 32-bit mode) to a number specified as the syscall number in Linux documentation.
- Execute the
syscall
instruction.
This is surprisingly simple, we can just set an extra register rax
to the required syscall number, replace the call
instruction with syscall
, and an normal C function call becomes a syscall. We can even create our own write
function with just the two lines of assembly code above.
However, there is actually another small difference in the calling convention: for syscalls, the fourth argument is passed via register r10
instead of rcx
. This is because syscall
need to store the address of the next userspace instruction in register rcx
, so after returning from kernel, execution can continue at that saved address.
Previously, we use GDB to understand what happens with write
under the hood. Now, let’s examine the code to see how write
is implemented. You can skip this chapter if you’re not familiar with or interested in C macro magic.
The following code is based on glibc-2.38
, which you can download from the GNU FTP server. The definition of write
isn’t straightforward; it involves expanding complex macros dynamically. After manually expanding these magic-like macros, the simplified write.c
looks like this:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
|
define __glibc_unlikely
/* NB: This also works when X is an array. For an array X, type of
(X) - (X) is ptrdiff_t, which is signed, since size of ptrdiff_t
== size of pointer, cast is a NOP. */
#define TYPEFY1(X) __typeof__ ((X) - (X))
/* Explicit cast the argument. */
#define ARGIFY(X) ((TYPEFY1 (X)) (X))
/* Create a variable 'name' based on type of variable 'X' to avoid
explicit types. */
#define TYPEFY(X, name) __typeof__ (ARGIFY (X)) name
typedef int ssize_t;
typedef unsigned int size_t;
extern int __libc_single_threaded;
extern int __libc_errno;
int __pthread_enable_asynccancel();
void __pthread_disable_asynccancel(int);
#define __NR_write 1
#define internal_syscall3(number, arg1, arg2, arg3) \
({ \
unsigned long int resultvar; \
TYPEFY (arg3, __arg3) = ARGIFY (arg3); \
TYPEFY (arg2, __arg2) = ARGIFY (arg2); \
TYPEFY (arg1, __arg1) = ARGIFY (arg1); \
register TYPEFY (arg3, _a3) asm ("rdx") = __arg3; \
register TYPEFY (arg2, _a2) asm ("rsi") = __arg2; \
register TYPEFY (arg1, _a1) asm ("rdi") = __arg1; \
asm volatile ( \
"syscall\n\t" \
: "=a" (resultvar) \
: "0" (number), "r" (_a1), "r" (_a2), "r" (_a3) \
: "memory", REGISTERS_CLOBBERED_BY_SYSCALL); \
(long int) resultvar; \
})
/* Write NBYTES of BUF to FD. Return the number written, or -1. */
ssize_t
__libc_write (int fd, const void *buf, size_t nbytes)
{
long int ret;
if (__libc_single_threaded != 0)
{
long int sc_ret = internal_syscall3 (__NR_write, fd, buf, nbytes);
if (__glibc_unlikely ((unsigned long int) (sc_ret) > -4096UL)) {
__libc_errno = -sc_ret;
ret = -1L;
} else {
ret = sc_ret;
}
}
else
{
int sc_cancel_oldtype = __pthread_enable_asynccancel();
long int sc_ret = internal_syscall3 (__NR_write, fd, buf, nbytes);
if (__glibc_unlikely ((unsigned long int) (sc_ret) > -4096UL)) {
__libc_errno = -sc_ret;
ret = -1L;
} else {
ret = sc_ret;
}
__pthread_disable_asynccancel(sc_cancel_oldtype);
}
return ret;
}
libc_hidden_def (__libc_write)
// weak_alias (__libc_write, __write)
// libc_hidden_weak (__write)
// weak_alias (__libc_write, write)
// libc_hidden_weak (write)
int main(void) {
write(1, "Hello, world!\n", 14);
return 0;
}
|
The main part of write
is in __libc_write
, where we can see the pthread
synchronization guard that we saw earlier in the GDB disassembly. The syscall number for write
is defined in the macro __NR_write
, which is 1
. The actual arguments are already passed outside of the write
, so the internal syscall stub only cares about the number of arguments. Here, we have internal_syscall3
, which use inline assembly to call the instruction syscall
. However, if we look at the internal_syscall4
, we will notice some differences:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
|
#define internal_syscall4(number, arg1, arg2, arg3, arg4) \
({ \
unsigned long int resultvar; \
TYPEFY (arg4, __arg4) = ARGIFY (arg4); \
TYPEFY (arg3, __arg3) = ARGIFY (arg3); \
TYPEFY (arg2, __arg2) = ARGIFY (arg2); \
TYPEFY (arg1, __arg1) = ARGIFY (arg1); \
register TYPEFY (arg4, _a4) asm ("r10") = __arg4; \
register TYPEFY (arg3, _a3) asm ("rdx") = __arg3; \
register TYPEFY (arg2, _a2) asm ("rsi") = __arg2; \
register TYPEFY (arg1, _a1) asm ("rdi") = __arg1; \
asm volatile ( \
"syscall\n\t" \
: "=a" (resultvar) \
: "0" (number), "r" (_a1), "r" (_a2), "r" (_a3), "r" (_a4) \
: "memory", REGISTERS_CLOBBERED_BY_SYSCALL); \
(long int) resultvar; \
})
|
As we mentioned, the fourth argument is assigned to register r10
instead of rcx
; besides that, nothing is different.
From the userspace perspective, we only see an instruction being executed, and the job is done. But what happens behind the scenes? The online x86 reference gives an detailed operation specification of the syscall
instruction:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
|
IF (CS.L ≠ 1 ) or (IA32_EFER.LMA ≠ 1) or (IA32_EFER.SCE ≠ 1)
(* Not in 64-Bit Mode or SYSCALL/SYSRET not enabled in IA32_EFER *)
THEN #UD;
FI;
RCX := RIP; (* Will contain address of next instruction *)
RIP := IA32_LSTAR;
R11 := RFLAGS;
RFLAGS := RFLAGS AND NOT(IA32_FMASK);
CS.Selector := IA32_STAR[47:32] AND FFFCH (* Operating system provides CS; RPL forced to 0 *)
(* Set rest of CS to a fixed value *)
CS.Base := 0;
(* Flat segment *)
CS.Limit := FFFFFH;
(* With 4-KByte granularity, implies a 4-GByte limit *)
CS.Type := 11;
(* Execute/read code, accessed *)
CS.S := 1;
CS.DPL := 0;
CS.P := 1;
CS.L := 1;
(* Entry is to 64-bit mode *)
CS.D := 0;
(* Required if CS.L = 1 *)
CS.G := 1;
(* 4-KByte granularity *)
IF ShadowStackEnabled(CPL)
THEN (* adjust so bits 63:N get the value of bit N–1, where N is the CPU’s maximum linear-address width *)
IA32_PL3_SSP := LA_adjust(SSP);
(* With shadow stacks enabled the system call is supported from Ring 3 to Ring 0 *)
(* OS supporting Ring 0 to Ring 0 system calls or Ring 1/2 to ring 0 system call *)
(* Must preserve the contents of IA32_PL3_SSP to avoid losing ring 3 state *)
FI;
CPL := 0;
IF ShadowStackEnabled(CPL)
SSP := 0;
FI;
IF EndbranchEnabled(CPL)
IA32_S_CET.TRACKER = WAIT_FOR_ENDBRANCH
IA32_S_CET.SUPPRESS = 0
FI;
SS.Selector := IA32_STAR[47:32] + 8;
(* SS just above CS *)
(* Set rest of SS to a fixed value *)
SS.Base := 0;
(* Flat segment *)
SS.Limit := FFFFFH;
(* With 4-KByte granularity, implies a 4-GByte limit *)
SS.Type := 3;
(* Read/write data, accessed *)
SS.S := 1;
SS.DPL := 0;
SS.P := 1;
SS.B := 1;
(* 32-bit stack segment *)
SS.G := 1;
(* 4-KByte granularity *)
|
The pseudo code above contains many variable-like identifiers that are actually registers, including General Purpose Registers (RCX
, R11
), Pointer Registers (RIP
), Segment Registers (CS
, SS
), the RFLAGS Register (RFLAGS
), and Model-Specific Registers (MSRs) (IA32_EFER
, IA32_LSTAR
, IA32_FMASK
, IA32_PL3_SSP
). The operator .
accesses specific bit fields within these registers. Despite the numerous operations occurring, we only need to focus on a few:
1
2
3
4
5
|
RCX := RIP; (* Save address of next instruction(userspace) to RCX *)
RIP := IA32_LSTAR; (* Set address of next instruction(kernel) to IA32-LSTAR *)
R11 := RFLAGS; (* Save old RFLAGS to R11 *)
RFLAGS := RFLAGS AND NOT(IA32_FMASK); (* Clear some flags *)
CPL := 0; (* Set current privilege level to 0, also known as Protection Ring *)
|
Basically, the syscall
instruction backs up some userspace context information (such as the return address), sets the next instruction pointer (RIP
) to somewhere in kernel space (IA32-LSTAR
), and switches the protection ring to 0
, which is kernel mode. For security reasons, only in this mode can the CPU execute kernel code.
It’s important to note that all of this occurs within a single syscall
instruction, so set RIP
to point to kernel space does make sense. After executing this instruction, we are finally in kernel space.
TODO: syscall_irl - Part III: Dive into the kernel