No matter what side of the argument you are on, you always find people on your side that you wish were on the other.
Jascha Heifetz

Writing a new memory device - /dev/clipboard

April 29th, 2021

This article contains code for how to implement a simple device driver in the kernel. The quality of this code is... proof of concept quality. If you're planning to do serious kernel development please don't use this code as a starting point.

Today we're going to use what we learned last time about /dev/null and take it to the next level. We're going to write a new memory device in the kernel. We will call it /dev/clipboard. No, it's not an existing device. Go ahead and check on your system!

Why the name /dev/clipboard? Because our new device will work just like a clipboard does. First you write bytes to it, and those bytes are saved. Later you come back and read from the device and it gives you back the same bytes. If you write to it again you overwrite what was there before. (This is not one of those fancy clipboards with multiple slots where you can choose which slot to retrieve.)

Here's a demo of how it works:

# save to clipboard
$ echo "abc 123" > /dev/clipboard 

# retrieve from clipboard
$ cat /dev/clipboard
abc 123

For convenience we will also print a message to the kernel log whenever someone writes to or reads from the clipboard:

$ dmesg  # just after the 'echo' command
Saved 8 bytes to /dev/clipboard

$ dmesg  # just after the 'cat' command
Returned 8 bytes from /dev/clipboard

Pretty neat, huh? Being able to just add a feature to the kernel like that.

We'll do all our work in drivers/char/mem.c, right next to where /dev/null and the other /dev/xxx memory device drivers live.

First, let's think about where the user's bytes will be kept.. they obviously have to be kept somewhere by the kernel, in the kernel's own memory. We'll keep it simple here and just use a static array of a fixed size. That means we're placing this array in one of the data segments of the kernel's binary, a segment that will be loaded and available at runtime. We'll use 4096 bytes as the size of the array, and that means we're adding 4kb to the kernel's memory footprint (so we don't want to make it too big):

#define CLIPBOARD_BUFSIZE 4096
static char clipboard_buffer[CLIPBOARD_BUFSIZE];
static size_t clipboard_size = 0;

CLIPBOARD_BUFSIZE is a macro that decides how big the fixed array is - a constant. clipboard_buffer is the array itself.

clipboard_size is a variable that reflects how many bytes of user data the array currently holds. This will change every time we write to the clipboard.

Okay, that's it for storage. Now let's turn to how we tell the kernel that this new device exists. We'll add it to the bottom of the list of devices implemented in this file:

static const struct memdev {
	const char *name;
	umode_t mode;
	const struct file_operations *fops;
	fmode_t fmode;
} devlist[] = {
	 [DEVMEM_MINOR] = { "mem", 0, &mem_fops, FMODE_UNSIGNED_OFFSET },
	 [2] = { "kmem", 0, &kmem_fops, FMODE_UNSIGNED_OFFSET },
	 [3] = { "null", 0666, &null_fops, 0 },
	 [4] = { "port", 0, &port_fops, 0 },
	 [5] = { "zero", 0666, &zero_fops, 0 },
	 [7] = { "full", 0666, &full_fops, 0 },
	 [8] = { "random", 0666, &random_fops, 0 },
	 [9] = { "urandom", 0666, &urandom_fops, 0 },
	[11] = { "kmsg", 0644, &kmsg_fops, 0 },
	[12] = { "clipboard", 0666, &clipboard_fops, 0 },
};

And we need to populate a file_operations struct. This is how we tell the kernel what a user can actually do with this new device. In our case, we want the user to be able to open it, read from it, or write to it, then close it. To achieve that we just need to implement a read and a write function. We can populate the other fields with stub functions that do nothing:

static const struct file_operations clipboard_fops = {
	.llseek		= null_lseek,
	.read		= read_clipboard,
	.write		= write_clipboard,
	.read_iter	= read_iter_null,
	.write_iter	= write_iter_null,
	.splice_write = splice_write_null,
};

Alright, we need a read function and a write function. Let's start with write first, because it's what the user will use first. What does the write function need to do?

Well, when the user writes to the clipboard that write should overwrite whatever was in the clipboard before (as we said before). So let's zero the memory in the array.

Next, we have an array of a fixed size. The user might write a byte string that fits or a byte string that exceeds our capacity. If the input is too big to store, we will just truncate the remainder that doesn't fit. (I haven't checked how other clipboards do this, but presumably they too have an upper limit.) Once we know how many bytes to keep, we'll copy those bytes from the user's buffer (which is transient) into our fixed array.

Then we'll print a message to the kernel log to explain what we did. And finally return the number of bytes the user sent us, in order to confirm that the write succeeded. Put all of that together and we get more or less the following:

static ssize_t write_clipboard(struct file *file, const char __user *buf,
			  size_t count, loff_t *ppos)
{
    // erase the clipboard to an empty state
    memset(clipboard_buffer, 0, CLIPBOARD_BUFSIZE * sizeof(char));

    // decide how many bytes of input to keep
    clipboard_size = count;
    if (clipboard_size > CLIPBOARD_BUFSIZE) {
        clipboard_size = CLIPBOARD_BUFSIZE;
    }

    // populate the clipboard
    if (copy_from_user(clipboard_buffer, buf, clipboard_size)) {
        return -EFAULT;
    }

    printk("Saved %lu bytes to /dev/clipboard\n", clipboard_size);

    // acknowledge all the bytes we received
    return count;
}

What's the EFAULT thing all about? copy_from_user is a function provided by the kernel. If it fails to copy all the bytes we requested it will return a non-zero value. This makes the boolean predicate true and we enter the if block. In the kernel the convention is to return a pre-defined error code prefixed with a minus sign to signal an error. We'll gloss over that here.

That does it for the write function. Finally, we need the read function.

The read function is a bit more tricky, because it's intended to be used to read incrementally. So a user can issue read(1) to read a single byte and then call read(1) again to read the next byte. Each time read will return the bytes themselves and an integer which is a count of the number of bytes returned. Once there are no more bytes to return read will return the integer 0.

How does the read function know how many bytes have been returned thus far? It needs to store this somewhere between calls. It turns out this is already provided for us - it's what the argument ppos is for. ppos is used as a cursor from the beginning of the array. We'll update ppos each time to reflect where the cursor should be, and the function calling our read function will store it for us until the next call.

Other than that, the read function is pretty analogous to the write function:

static ssize_t read_clipboard(struct file *file, char __user *buf,
			 size_t count, loff_t *ppos)
{
    size_t how_many;

    // we've already read the whole buffer, nothing more to do here
    if (*ppos >= clipboard_size) {
        return 0;
    }

    // how many more bytes are there in the clipboard left to read?
    how_many = clipboard_size - *ppos;

    // see if we have space for the whole clipboard in the user buffer
    // if not we'll only return the first part of the clipboard
    if (how_many > count) {
        how_many = count;
    }

    // populate the user buffer using the clipboard buffer
    if (copy_to_user(buf, clipboard_buffer + *ppos, how_many)) {
        return -EFAULT;
    }

    // advance the cursor to the end position in the clipboard that we are
    // returning
    *ppos += how_many;

    printk("Returned %lu bytes from /dev/clipboard\n", how_many);

    // return the number of bytes we put into the user buffer
    return how_many;
}

And that's it! That's all the code needed to implement a simple memory device in the kernel!

What happens when you pipe stuff to /dev/null?

April 28th, 2021

This question is one of those old chestnuts that have been around for ages. What happens when you send output to /dev/null? It's singular because the bytes just seem to disappear, like into a black hole. No other file on the computer works like that. The question has taken on something of a philosophical dimension in the debates among programmers. It's almost like asking: how do you take a piece of matter and compress it down into nothing - how is that possible?

Well, as of about a week ago I know the answer. And soon you will, too.

/dev/null appears as a file on the filesystem, but it's not really a file. Instead, it's something more like a border crossing. On one side - our side - /dev/null is a file that we can open and write bytes into. But from the kernel's point of view, /dev/null is a device. A device just like a physical piece of hardware: a mouse, or a network card. Of course, /dev/null is not actually a physical device, it's a pseudo device if you will. But the point is that the kernel treats it as a device - it uses the same vocabulary of concepts when dealing with /dev/null as it does dealing with hardware devices.

So if we have a device - and this is not a trick question - what do we need to have to able to use it? A device driver, exactly. So for the kernel /dev/null is a device and its behavior is defined by its device driver. That driver lives in drivers/char/mem.c. Let's have a look!

Near the bottom of the file we find an array definition:

static const struct memdev {
	const char *name;
	umode_t mode;
	const struct file_operations *fops;
	fmode_t fmode;
} devlist[] = {
	 [DEVMEM_MINOR] = { "mem", 0, &mem_fops, FMODE_UNSIGNED_OFFSET },
	 [2] = { "kmem", 0, &kmem_fops, FMODE_UNSIGNED_OFFSET },
	 [3] = { "null", 0666, &null_fops, 0 },
	 [4] = { "port", 0, &port_fops, 0 },
	 [5] = { "zero", 0666, &zero_fops, 0 },
	 [7] = { "full", 0666, &full_fops, 0 },
	 [8] = { "random", 0666, &random_fops, 0 },
	 [9] = { "urandom", 0666, &urandom_fops, 0 },
	[11] = { "kmsg", 0644, &kmsg_fops, 0 },
};

We don't need to understand all the details here, but just look at the names being defined: mem, null, zero, random, urandom. Where have we seen these names before? We've seen them endless times as /dev/mem, /dev/null, /dev/zero, /dev/random, /dev/urandom. And it's in this file that they are actually defined.

If we focus on the line that defines null we can see that it mentions something called null_fops. This is a struct that defines the behavior of /dev/null. And the struct looks like this:

static const struct file_operations null_fops = {
	.llseek		= null_lseek,
	.read		= read_null,
	.write		= write_null,
	.read_iter	= read_iter_null,
	.write_iter	= write_iter_null,
	.splice_write	= splice_write_null,
};

The values being used to populate this struct are function pointers. So when /dev/null is being written to the function that is responsible for this is write_null. And when /dev/null is being read from (it's not often we read from /dev/null) the function responsible for that is read_null.

Alright, we are just about to find out what happens when you write to /dev/null. The moment you have been waiting for. Are you ready for this? Here we go:

static ssize_t write_null(struct file *file, const char __user *buf,
                          size_t count, loff_t *ppos)
{
	return count;
}

write_null gets two arguments that are of interest to us: the bytes that we sent to /dev/null - represented by buf, and the size of that buffer - represented by count. And what does write_null do with these bytes? Nothing, nothing at all! All it does is return count to confirm that "we've received your bytes, sir". And that's it.

Writing to /dev/null is literally calling a function that ignores its argument. As simple as that. It doesn't look like that because /dev/null appears to us as a file, and the write causes a switch from user mode into kernel mode, and the bytes we send pass through a whole bunch of functions inside the kernel, but at the very end they are sent to the device driver that implements /dev/null and that driver does... nothing! How do you implement doing nothing? You write a function that takes an argument and doesn't use it. Genius!

And considering how often we use /dev/null to discard bytes we don't want to write to a (real) file and we don't want to appear in the terminal, it's actually an incredibly valuable implementation of nothing! It's the most useful nothing on the computer!

htop: cpu and memory usage stats

April 13th, 2021

htop is an enhancement over regular top and it's a very popular tool. But did you ever ask yourself how it actually works? In this article we'll be looking at where htop gets cpu and memory utilization information from. Given that htop runs on many different platforms, we'll be discussing just the Linux version of the story.

Cpu utilization per cpu

htop displays cpu utilization for each cpu. This is one the most key things we use htop for, in order to gauge the current load on the system.

The information comes from /proc/stat (documented here). This file contains a few different counters, but what's of interest to us right here are just the cpu lines which look like this:

cpu  5183484 9992 1575742 162186539 903310 0 27048 0 0 0
cpu0 1355329 2304 389040 40426679 299055 0 6431 0 0 0
cpu1 1234845 2602 423662 40594393 187209 0 16487 0 0 0
cpu2 1347723 2837 413561 40442958 239035 0 4085 0 0 0
cpu3 1245586 2246 349478 40722507 178009 0 44 0 0 0

So what are these numbers? Well, each number represents the amount of time spent in a particular state, by that cpu. This is true for each line that begins with cpuN. So cpu0 (which in htop is displayed as cpu #1) spent:

  1. 1355329 units of time in user mode (ie. running user processes)
  2. 2304 units of time in nice mode (ie. running user processes with a nice setting)
  3. 389040 units of time in system mode (ie. running kernel processes)
  4. 40426679 units of time in idle mode (ie. not doing anything)
  5. 299055 units of time in iowait (ie. waiting for io to become ready)
  6. 0 units of time servicing interrupts
  7. 6431 units of time servicing soft interrupts
  8. 0 units of time where the VM guest was waiting for the host CPU (if we're running in a VM)
  9. 0 units of time where we're running a VM guest
  10. 0 units of time where the VM guest is running with a nice setting

The first line in the file is simply an aggregate of all the per-cpu lines.

This is effectively the granularity that the kernel gives us about what the cpu spent time doing. The unit is something called USER_HZ, which is 100 on this system. So if we spent 1,355,329 units in user mode, that means 1355329 / 100 = 13,553 seconds (3.76 hours) spent running user processes since the system booted. By contrast, we spent 4.67 days in idle time, so this is clearly not a system under sustained load.

So how does htop use this? Each time it updates the ui it reads the contents of /proc/stat. Here are two subsequent readings one second apart, which show just the values for cpu0:

# time=0s
cpu0 1366294 2305 392684 40566185 300222 0 6590 0 0 0
# time=1s
cpu0 1366296 2305 392684 40566283 300222 0 6590 0 0 0

# compute the delta between the readings
cpu0 2 0 0 98 0 0 0 0 0 0

We can see that between the first and the second reading we spent 2 units in user mode and 98 units in idle mode. If we add up all of the numbers (2 + 98 = 100) we can see that the cpu spent 2 / 100 = 2% of its time running user processes, which means cpu utilization for cpu0 would be displayed as 2%.

Cpu utilization per process

Cpu utilization per process is actually measured in a very similar way. This time the file being read is /proc/<pid>/stat (documented here) which contains a whole bunch of counters about the process in question. Here it is for the X server:

939 (Xorg) S 904 939 939 1025 939 4194560 233020 15663 1847 225 297398 280532 25 14 20 0 10 0 4719 789872640 10552 18446744073709551615 93843303677952 93843305320677 140727447720048 0 0 0 0 4096 1098933999 0 0 0 17 3 0 0 3235 0 0 93843305821872 93843305878768 93843309584384 140727447727751 140727447727899 140727447727899 140727447728101 0

Fields 14 and 15 are the ones we are looking for here, because they represent respectively:

  • utime, or the amount of time this process has been scheduled in user mode
  • stime, or the amount of time this process has been scheduled in kernel mode

These numbers are measured in the same unit we've seen before, namely USER_HZ. htop will thus calculate the cpu utilization per process as: ((utime + stime) - (utime_prev + stime_prev)) / USER_HZ.

htop will calculate this for every running process each time it updates.

Memory utilization on the system

htop displays memory utilization in terms of physical memory and swap space.

This information is read from /proc/meminfo (documented here) which looks like this (here showing just the lines that htop cares about):

MemTotal:        3723752 kB
MemFree:          180308 kB
MemAvailable:     558240 kB
Buffers:           66816 kB
Cached:           782608 kB
SReclaimable:      87904 kB
SwapTotal:       1003516 kB
SwapCached:        13348 kB
SwapFree:         317256 kB
Shmem:            313436 kB

Unlike the cpu stats, these are not counters that accumulate over time, they are point in time snapshots. Calculating the current memory utilization comes down to MemTotal - MemFree. Likewise, calculating swap usage means SwapTotal - SwapFree - SwapCached.

htop uses the other components of memory use (buffers, cached, shared mem) to color parts of the progress bar accordingly.

Memory utilization per process

Memory utilization per process is shown as four numbers:

  • virtual memory, ie. how much memory the process has allocated (but not necessarily used yet)
  • resident memory, ie. how much memory the process currently uses
  • shared memory, ie. how much of its resident memory is composed of shared libraries that other processes are using
  • memory utilization %, ie. how much of physical memory this process is using. This is based on the resident memory number.

This information is read from the file /proc/<pid>/statm (documented here). The file looks like this:

196790 9938 4488 402 0 28226 0

This is the X server process once again, and the numbers mean:

  1. virtual memory size
  2. resident memory size
  3. shared memory size
  4. size of the program code (binary code)
  5. unused
  6. size of the data + stack of the program
  7. unused

These numbers are in terms of the page size, which on this system is 4096. So to calculate the resident memory for Xorg htop does 9938 * 4096 = 38mb. To calculate the percentage of system memory this process uses htop does (9938 * 4096) / (3723752 * 1024) = 1.1% using the MemTotal number from before.

Conclusion

As we have seen in this practical example the kernel provides super useful information through a file based interface. These are not really files per se, because it's just in-memory state inside the kernel made available through the file system. So there is minimal overhead associated with opening/reading/closing these files. And this interface makes it very accessible to both sophisticated programs like htop as well as simple scripts to access, because there is no need to link against system libraries. Arguably, this API makes the information more discoverable because any user on the system can cat files on the /proc file system to see what they contain.

The downside is that these are files in text format which have to be parsed. If the format of the file changes over time the parsing logic may break, and a program like htop has to account for the fact that newer kernel versions may add additional fields. In Linux there is also an evolution in progress where the /proc file system remains, but more and more information is exposed through the /sys file system.

Un poème pour Didier

July 22nd, 2018

Une fois, Aimé, avec son capitaine, Didier, a tout gagné.

Ensuite Didier, vingt ans après, son objectif à bien fixé.

Parmi tous les joueurs, une équipe il a dû sélectionné.

Il a pas tardé, que tout le monde lui est venu la contesté.

Patiemment, tous les adversaires il les a éliminé.

Même les sublimes croates, ils ont le mieux essayé.

À la fin, quand même, c'est Didier qui a gagné.

Backup/reinstall checkist

March 14th, 2018

Backup checklist

* Top level dirs in ~

* Dot files/dirs in ~

* dpkg -l

Prepare install media

dd if=image.iso of=/dev/sdX bs=4M status=progress oflag=sync && sync

Boot from install media

Switch BIOS to use SATA in AHCI mode, not RAID mode. Otherwise the drive may not be detected.

To see block devices: lsblk

To see details of hardware: lspci -vvv or lshw

Reinstall checklist

* /etc/sudoers

* apt: curl emacs-gtk evince gimp git gitk htop iotop ipython3 kdiff3 mpv net-tools network-manager-openconnect openssh-server python3-pip ttf-bitstream-vera ttf-mscorefonts-installer vim-gtk vlc yakuake

* pip: nametrans reps

* rustup: ripgrep

* manual: chrome dropbox skype vscode zoom

* ~/.xmodmaprc (use xev to discover mouse button numbers)

* common_local.sh

* re pull

Config checklist

* focus follows mouse, 0ms delay

* kde: desktop search off

* kde: startup with empty session

* kde: taskbar do not group items

* konsole: set font size & scrollback buffer for profile, set profile as default

* browsers: smooth scrolling off (chrome: chrome://flags/#smooth-scrolling)

Test checklist

* office vpn

ttf-bitstream-vera