The last two posts talked about patches to sysdig and ltrace. This week wouldn't be complete without patching strace as well. My patch series to make sysdig work on ARM apparently had a bug: preadv and pwritev were not reporting their offset argument properly. These two syscalls had the same exact issue, so I'll just talk about preadv. The userspace prototype of this syscall looks like this:

ssize_t preadv(int fd, const struct iovec *iov, int iovcnt, off_t offset);

off_t is a 64-bit value, so on 32-bit architectures this must be split across two different registers when making the syscall. Some architectures also have alignment requirements. In my case, the Linux ARM EABI requires that such values be passed in a consecutive even/odd register pair, with a register of padding if needed. Thus in the case of preadv, the values would be passed as follows:

argument register
fd r0
iov r1
iovcnt r2
padding r3
offset r4/r5

The sysdig ARM code was doing this, and it worked fine for other syscalls, but this was not working for preadv and pwritev. To my surprise I discovered that even strace was misreporting the value of the offset argument. I wrote a small test program:

#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>
#include <sys/uio.h>

int main(void)
{
    const off_t offset = 1234567890123456789LL;
    char buf[4];

    int fd_zero = open("/dev/zero", O_RDONLY);
    pread (fd_zero, buf, sizeof(buf), offset);
    preadv(fd_zero,
           &(struct iovec){ .iov_base = buf,
                   .iov_len = sizeof(buf)},
           1, offset );

    int fd_null = open("/dev/null", O_WRONLY);
    pwrite(fd_null, buf, sizeof(buf), offset);
    pwritev(fd_null,
            &(struct iovec){.iov_base = buf, .iov_len = sizeof(buf)},
            1, offset );

    return 0;
}

Then I built it with gcc -std=gnu99 -D_FILE_OFFSET_BITS=64, and ran it under strace on ARM. The relevant parts of strace output:

open("/dev/zero", O_RDONLY|O_LARGEFILE) = 3
pread(3, "\0\0\0\0", 4, 1234567890123456789) = 4
preadv(3, [{"\0\0\0\0", 4}], 1, 4582412532) = 4
open("/dev/null", O_WRONLY|O_LARGEFILE) = 4
pwrite(4, "\0\0\0\0", 4, 1234567890123456789) = 4
pwritev(4, [{"\0\0\0\0", 4}], 1, 4582412532) = 4

Note that the offset parameter in preadv and pwritev is reported as 4582412532. As you can see in the source, the offset is actually the same for all the calls: 1234567890123456789. So something fishy is going on. Digging through kernel source revealed the answer. Here's how the pread and preadv system calls are defined (I'm looking at fs/read_write.c in Linux 3.14):

SYSCALL_DEFINE4(pread64, unsigned int, fd, char __user *, buf,
                        size_t, count, loff_t, pos)
SYSCALL_DEFINE5(preadv, unsigned long, fd, const struct iovec __user *, vec,
                unsigned long, vlen, unsigned long, pos_l, unsigned long, pos_h)

Note that pread defines its pos argument as a 64-bit value of type loff_t. This is what you'd expect and also what the userspace pread prototype looks like. Now look at preadv. It does not have a 64-bit pos argument. Instead it has two separate 32-bit arguments. This is different from the userspace prototype! So as far as the kernel is concerned, there are no 64 bit arguments here, so no alignment requirements apply. So the actual register map in the preadv syscall looks like

argument register
fd r0
iov r1
iovcnt r2
offset r3/r4

So libc must know to do this translation when invoking the syscall to connect the two different prototypes. Both sysdig and strace did not know this, and were interpreting the syscall inputs incorrectly.

There's even an LWN article about the discussion that took place when this was originally implemented. There are various compatibility issues, and this was the best method, apparently.