格式化字符串漏洞

Format String Vulnerailites

It will show you how to discover format string vulnerabilities in C source code, and why this new kind of vulnerability is more dangerous than the common buffer overflow vulnerability.

A format function is a special kind of ANSI C function, that takes a variable number of arguments, from which one is the so called format string. While the function evaluates the format string, it accesses the extra parameters given to the function. It is a conversion function, which is used to represent primitive C data types in a human readable string representation. They are used in nearly any C program, to output information, print error messages or process strings.

To understand where this vulnerability is common in C code, we have to examine the purpose of format functions.

​ How the format function works

​ The calling function

This type of vulnerability can appear if two different types of information channels are merged into one, and special escape characters or sequences are used to distinguish which channel is currently active. Most of the times one channel is a data channel, which is not parsed actively but just copied, while the other channel is a controlling channel.

While this is not a bad thing in itself, it can quickly become a horrible security problem if the attacker is able to supply input that is used in one channel. Often there are faulty escape or de-escape routines, or they oversee a level, such as in format string vulnerabilities.

We now have to examine what exactly we are able to control, and how to use this control to extend this partial control over the process to full control of the execution flow.

1
2
printf("%08x.%08x.%08x.%08x.\n")
ea62f5d0.ea62f5e0.ea62f6e8.00000000

This is a partial dump of the stack memory, starting from the current bottom upward to the top of the stack — assuming the stack grows towards the low addresses. Depending on the size of the format string buffer and the size of the output buffer, you can reconstruct more or less large parts of the stack memory by using this technique.

It is also possible to peek at memory locations different from the stack memory. To do this we have to get the format function to display memory from an address we can supply.

This poses two problems to us: First, we have to find a format parameter which uses an address (by reference) as stack parameter and displays memory from there, and we have to supply that address. We are lucky in the first case, since the ‘%s’ parameter just does that, it displays memory — usually an ASCIIZ string — from a stack- supplied address. So the remaining problem is, how to get that address on the stack, into the right place

The format function internally maintains a pointer to the stack location of the current format parameter. If we would be able to get this pointer pointing into a memory space we can control, we can supply an address to the ‘%s’ parameter

1
2
printf("AAA0AAA1_%08x.%08x.%08x.%08x.");
AAA0_ea62f5d0.ea62f5e0.ea62f6e8.00000000.

The ‘%08x’ parameters increase the internal stack pointer of the format function towards the top of the stack. After more or less of this increasing parameters the stack pointer points into our memory: the format string itself. The format function always maintains the lowest stack frame, so if our buffer lies on the stack at all, it lies above the current stack pointer for sure. If we choose the number of ‘%08x’ parameters correctly, we could just display memory from an arbitrary address, by appending ‘%s’ to our string.

1
2
3
address = \x10\x01\x48\x08;
printf("\x10\x01\x48\x08_%08x.%08x.%08x.%08x.%08x|%s|");
H_e01c25d0.e01c25e0.e01c26e8.00000000.00000000|(null)|

If we cannot reach the exact format string boundary by using 4-Byte pops (‘%08x’), we have to pad the format string, by prepending one, two or three junk characters.

Instead we have to find instructions that modify the instruction pointer and take influence on how these instructions modify it. This sounds com- plicated, but in most cases it is pretty easy, since there are instructions that take a instruction pointer from the memory and jump to it.

This is how most buffer overflows work: In a two-stage process, first a saved instruction pointer is overwritten and then the program executes a legitimate instruction that transfers control to the attacker-supplied address.

In normal buffer overflows we overwrite the return address of a function frame on the stack. As the function that owns this frame returns, it returns to our supplied address

There is the ‘%n’parameter, which writes the number of bytes already printed, into a variable of our choice. The address of the variable is given to the format function by placing an integer pointer as parameter onto the stack.

1
printf("\x10\x01\x48\x08_%08x.%08x.%08x.%08x.%08x.%n");

With the ‘%08x’ parameter we increase the internal stack pointer of the format function by four bytes. We do this until this pointer points to the beginning of our format string (to ‘AAA0’). This works, because usually our format string is located on the stack, on top of our normal format function stack frame. The ‘%n’ writes to the address 0x30414141, that is represented by the string “AAA0”. Normally this would crash the program, since this address is not mapped. But if we supply a correct mapped and writeable address this works and we overwrite four bytes (sizeof (int)) at the address.

By using a dummy parameter ‘%nu’ we are able to control the counter written by ‘%n’, at least a bit. But for writing large numbers — such as addresses — this is not sufficient, so we have to find a way to write arbitrary data.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
unsigned char canary[5];
unsigned char foo[4];

memset(foo, "\x00", sizeof(foo));
strcpy(canary, "AAAA");
// memory state right now: 00 00 00 00 41 41 41 41 00
printf("%16u%n", 7350, (int *) &foo[0]);
// memory state right now: 10 00 00 00 00 41 41 41 41
printf("%32u%n", 7350, (int *) &foo[1]);
// memory state right now: 20 10 00 00 00 00 41 41 41
printf("%64u%n", 7350, (int *) &foo[2]);
// memory state right now: 40 20 10 00 00 00 00 41 41
printf("%128u%n\n", 7350, (int *) &foo[3]);
// memory state right now: 80 40 20 10 00 00 00 00 41
printf("after foo: %02x%02x%02x%02x\n", foo[0],foo[1], foo[2],foo[3]);
// after foo: 10204080
printf("cannary: %02x%02x%02x%02x\n", canary[0],canary[1],canary[2],canary[3]);
//cannary: 00000041

Returns the output “10204080” and “canary: 00000041”. We over- write four times the least significant byte of an integer we point to. By increasing the pointer each time, the least significant byte moves through the memory we want to write to, and allows us to store completely arbitrary data.

1
2
3
4
5
6
7
strcpy(canary, "AAAA");
printf("%16u%n%16u%n%32u%n%64u%n\n", 0, (int *) &foo[0],0, (int *) &foo[1],0, (int *) &foo[2],0, (int *) &foo[3]);

printf("after foo: %02x%02x%02x%02x\n", foo[0],foo[1], foo[2],foo[3]);
// after foo: 10204080
printf("cannary: %02x%02x%02x%02x\n", canary[0],canary[1],canary[2],canary[3]);
// cannary: 00000041

The part of the format string that actually does the writing to the memory, by using ‘%nu%n’ pairs, where n is greater than 10.
The first part is used to increase or overflow the least significant byte of the format function internal bytes-written counter, and the ‘%n’is used to write this counter to the addresses that are within the dummy-addr-pair part of the string

Variations of Explotation

A ‘%u’ sequence is two bytes long and pops four bytes, which gives a 1:2 byte ratio (we invest 1 byte to get 2 bytes ahead).

Through using the ‘%f’ parameter we even get 8 bytes ahead in the stack, while only investing two bytes. But this has a huge drawback, since if garbage from the stack is printed as floating point number, there may be a division by zero, which will crash the process. To avoid this we can use a special format qualifier, which will only print the integer part of the float number: ‘%.f’ will walk the stack upwards by eight bytes, using only three bytes in our buffer

Direct Parameter Access

1
2
printf("%6$d\n",6,5,4,3,2,1);
1

Print “1”,because the ‘6$’ explicitly addresses the 6th parameter on the stack. Using this method the whole stack pop sequence can be left out.

1
2
3
4
5
6
7
8
char foo[4]; 
printf ("%1$16u%2$n"
"%1$16u%3$n"
"%1$32u%4$n"
"%1$64u%5$n",
1,
(int *) &foo[0], (int *) &foo[1],
(int *) &foo[2], (int *) &foo[3]);

Will create “\x10\x20\x40\x80” in foo. This direct access is limited to only the first eight parameters on BSD derivates, except IRIX. The Solaris C Library limits it to the first 30 parameters, as shown in portals paper [3].
If you choose negative or huge values intending to access stack parameters below your current positions it will not produce the expected result but crash.

Brute Forcing

When exploiting a vulnerability such as a buffer overflow or a format string vulnerability it often fails because the last hurdle was not taken care of: to get all offsets right. Basically finding the right offsets means ‘what to write where’. For simple vulnerabilities you can reliably guess the correct offsets, or just brute force it, by trying them one after another. But as soon as you need multiple offsets this problem increases exponentially, it turns out to be impossible to brute force.

格式化字符串

格式化字符串类似于

1
print("The magic number is: %d\n", 1911);

这段代码将会打印The magic number is:1911,其中的格式化参数 %d 被替换为 1911。除了 %d 之外,还有其他的格式化参数,分别有着不同的含义:

参数 含义 传递方式
%d Decimal(int) Value
%u Unsigned decimal(unsigned int) Value
%x Hexadecimal(unsigned int) Value
%s String((const)(unsigned char *) Reference
%n Number of bytes written so far, (* int) Reference

格式化字符串占位符的格式详解

1
%[parameter][flags][width][.precision][length]type
  1. parameter 直接的参数访问,指定用于输入的参数”
  2. flags 当与宽度一起使用时在前面加上0
  3. width: 宽度修饰符,要输出的最少字符数
  4. .precison: 输出的最大字符数量
  5. length: 长度修饰符,允许将输出转化为char, short,int等
  6. type: 如何格式化输出参数。例如 %d 期望得到一个整数参数,并且输出一个数字。

[]内的参数是可选的。

重要的格式化字符串转化类型如下:

Type Input Output
%x unsigned integer Hexadecimal value
%s pointer to an array of char String
%n pointer to integer Number of bytes written so far
%p pointer (void *) The value of the pointer (Not de-referenced)

重要的修饰符如下:

Modifier Description Example
i$ Direct parameter access; Specifies the parameter to use for input %2$x : hex value of second parameter
%ix Width modifier. Specifies the minimum width of the output. %8x: Hex value taking up 8 columns
%hh Length modifier. Specifies that length is sizeof(char) %hhn: Writes 1 byte to target pointer
%h Length modifier. Specifies that length is sizeof(short) %hn: Writes 2 bytes (in 32 bit System) to target pointer

格式化函数的行为被格式化参数所控制。函数通过格式化参数从栈上取得参数。

1
print("a has value %d, b has value %d, c is at address: %08x\n", a,b, &c);

格式化函数的栈布局

如果我们给出的格式化字符串和实际的格式化参数不匹配会怎么样?

1
print("a has value %d, b has value %d, c is at address: %08x\n", a,b);

在这个例子中,格式化字符串需要三个参数,而实际上我们只给出了2个(a & b)。

但是这并不妨碍该程序通过编译。一方面因为 printf() 函数被定义为一个参数长度可变的函数;另一方面,编译器通常不去深入分析 printf() 到底是如何工作的。

那么 printf()能检测到参数不匹配么?

printf()从栈上取数据。如果格式化字符串需要3个参数,就去栈上取三个参数。除非栈被标记了有栈边界,否则函数不会知道它已经找遍了提供给他的所有参数。

因为栈上并没有边界标记,所以,printf()会继续从栈上取数据。在不匹配的情况下,它会从不属于这个函数调用的栈的部分取数据。而这会导致一些问题。

格式化字符串漏洞攻击

使程序崩溃

1
printf("%s%s%s%s%s%s%s%s%s%s");

对于每个 %s ,printf() 会从栈上取一个数字,把这个数字当做一个地址,用字符串的形式打印指向这个地址的内存中的内容。直到遇到 NULL字符(数字0,而不是字符’0’)。

由于函数所取的数字可能不是一个地址,取到的数字指向的内存也可能不存在(这个地址没有物理内存或者没有被分配),这就会导致程序崩溃。

同样也有可能这个数字指向的地址没问题,但是这个地址空间是被保护的(为内核内存所保留),这种情况下,程序也会崩溃。

查看栈布局

1
printf("%08x %08x %08x %08x\n");

这句代码会从栈上取4个参数,并且将它们用填充为8位的十六进制数展示。输出类似于如下

1
c54f4180 94ee08d0 00000001 01fae274

查看任意位置的内存

我们需要提供一个内存的地址。但是,我们不能改变代码,我们只能提供格式化字符串。

如果我们在不指定内存地址的情况下使用 printf(%s),则无论如何都会通过 printf() 函数从堆栈中获取目标地址。函数维护了一个内部栈指针,所以它知道栈中参数的位置。

格式化字符串通常都位于栈上,如果我们在格式化字符串中将目标地址编码,那么目标地址就会存在于栈上。下面这个例子中,格式化字符串存储在某个位于栈上的缓冲区中

1
2
3
4
5
6
int main(int argc, char *argv[]) 
{
char user_input[100];
... ... /* other variable definitions and statements */ scanf("%s", user_input); /* getting a string from user */ printf(user_input); /* Vulnerable place */
return 0;
}

如果我们可以强制printf() 获得一个也在栈上的地址,我们就可以控制这个地址。

1
printf("\x10\x01\x48\x08 %x %x %x %x %s");

\x10\x01\x48\x08是四个字节的目标地址。在C语言中,\x10会告诉编译器将一个十六进制的数据 0x10放在当前位置。这个值会占据一个字节。如果没有 \x 直接输入字符串 10 ,那么字符10 的ASCII 49和48会被存储。

%x 使得栈指针朝着格式化字符串移动

下图展示了如果 user_input[]包含了如下格式化字符串,攻击是如何进行的:

1
\x10\x01\x48\x08 %x %x %x %x %s

格式化函数攻击示意

基本上,格式化函数总是在最低位的栈帧中,所以如果我们的缓冲区总是位于栈上,那么它肯定在当前栈指针的上方(缓冲区地址高于当前指针地址)。我们使用4个%x (每个%x会使栈指针向上增加4字节)使得printf()的指针向高地址移动-假定栈向低地址方向增长,指向了我们在格式化字符串中存储的地址。

一旦我们到达了目的地址,我们向printf()提供了 %s ,使其打印了在内存地址 0x08480110中的内容。printf() 会把这个内容当作字符串对待,并且打印这个字符串直到字符串结束(NULL byte)。

如果我们用4字节的指针pop(‘%08x’)不能到达准确的格式化字符串边界,我们就需要填充格式化字符串,一个,两个或者三个垃圾字符。

user_input[]0x08480110 之间的栈空间并不属于 printf()(实际上我们并不知道这段空间有多大,这里只是示例为4)。但是,由于格式化字符串漏洞,printf() 会把它们当作参数来匹配格式化字符串中的 %x

因此我们可以想到,关键是找到这段空间的精确距离。这段距离取决于你在%s之前插入了多少的 %x

在进程内存几乎任何地址写入一个整形数

%n 到目前为止写入的字符数存储在相应参数指示的整数中。

1
2
int i:
printf("12345%n", &i);

此时 i = 5.

使用上文提到的查看任意位置内存的方法,我们可以在任意位置写入一个数字,只需要将 %s 替换为 %n ,之后内存 0x08480110 的内容就会被覆盖。

通过这个方法,攻击者可以:

  1. 覆写重要程序的控制访问权限的标志位
  2. 覆盖栈上的返回地址,函数指针等等

然而,写入的值取决于在%n 之前打印的字符数量。真的可能写入任意的整形数么?

  1. 使用垃圾字符填充,为了写入1000,可以简单的填充1000个垃圾字符
  2. 为了避免格式化字符串过长,我们可以使用格式化参数中的指示宽度的参数。例如%nu -%150u%n 会控制写入的整形数为150。

直接的参数访问

通过 $ 修饰符来进行直接参数访问

1
printf("%6$d\n",6,5,4,3,2,1);

打印 ‘1’,因为6$指定了栈上的第六个参数。

反制措施

地址随机化:如同克制缓冲区溢出攻击一样,地址随机化使的攻击者很难找到他们想要读或写的地址。

Reference

1

introduction of Format Strings