KVM, QEMU and Big Iron: March 2017

Thursday, March 30, 2017

Oldies but Goldies: Channel I/O KVM Forum 2012 talk

Some of the information has been superseded in the meanwhile, but the slides from my talk at the 2012 KVM Forum contain some information that may still be interesting. (Sadly, no video of the talk was recorded.)

Tuesday, March 28, 2017

Channel I/O: Talking to devices

Having a nice set of channel devices available to your OS is all fine and good; but how do you actually talk to them? This post attempts to give a high-level overview, while also explaining some more acronyms.

Let's look again at the example configuration from the last post:

Device   Subchan. DevType CU Type Use PIM PAM POM CHPIDs
----------------------------------------------------------------------
0.0.0000 0.0.0000 0000/00 3832/01 yes 80 80 ff   00000000 00000000
0.0.0042 0.0.0001 0000/00 3832/02 yes 80 80 ff   00000000 00000000

The second device is the virtio-blk device 0.0.0042 on subchannel 0.0.0001, having channel path 0. Being virtio, this is a very simplified variation of what you'd see on real hardware (although this also can be a benefit in some way). Think of it as the following:

Device 0.0.0042 is accessed via channel path 0, and subchannel 0.0.0001 is used as a means to address it.

The access (channel path) is configured in the hypervisor (or in the hardware definitions). The subchannel is what the OS will use as a target for I/O instructions and how it can associate I/O interrupts with the device they are for.

I/O instructions? There's a whole zoo of them, but they share some characteristics:

They take a subchannel identifier as parameter.
They are privileged: I.e., on a Linux system, they can only be issued by the kernel and not from user space.

Check the Principles of Operation (SA22-7832), chapter 14, for the whole story. Here, I'll concentrate just on two instructions:

START SUBCHANNEL (SSCH) - start a channel program
TEST SUBCHANNEL (TSCH) - retrieve subchannel status

So, what's a channel program? It's basically a set of instructions sent to the control unit and executed. You can even branch in them. The basic unit of those instructions is the channel command word (CCW). The ccw is such a basic characteristic of channel I/O that it is used throughout Linux and QEMU for channel devices: For example, Linux has ccw_devices on the ccw bus, and QEMU has CcwDevices, most notably VirtIOCcwDevices (like the devices in the example).

ccws consist of three parts:

The command. This falls into the categories of read (read data from the device), write (write data to the device) or control (for example, rewinding a tape). An 8-bit value.
The flags, which control error handling or program flow. I'll ignore them for simplicity here.
The data address. This is an address in memory where data is written to (read) or read from (write).

Let's take an example. The SENSE ID command is a basic operation supported by both virtual devices like the virtio devices and real hardware devices. It is used to obtain configuration information from the device, like the CU type information in the output above. It is usually the first ccw an operating system issues to the device.

The operating system will assemble a ccw: The command code will be 0xe4 for SENSE ID, and the data address will point to a location wherethe OS wants to have the obtained information. The OS will also assemble a so-called ORB (operation request block), which, amongst other things, points to the assembled ccw (respectively the first one in a chain). This ORB and the subchannel id are the two parameters for the SSCH instruction. If all goes well, the OS will receive a condition code 0 and knows that it will be signalled asynchronously once the channel program has been processed (successfully or with errors)¹.

Processing of the actual channel command is done asynchronously by real hardware (QEMU does it synchronously for simplicity reasons). The result is that the wanted data is put into the memory area refered to by the ccw². Subsequently, the subchannel is made status pending: Information is ready for retrieval by the OS.

Usually, the OS wants to have a notification that the subchannel became status pending; this is done via an I/O interrupt. I/O interrupts on s390 carry extra status which is written to the low memory area of the cpu receiving the interrupt; amongst other things, this status contains the subchannel id.

Next, the OS needs to actually retreive the status information: This is done via the TSCH instruction, which in turn makes the subchannel no longer status pending and ready for the next I/O request via SSCH. The status contains enough information for the OS to determine whether the request was successful (and the sense id information has been stored), or whether there was an error.³

Of course, this is all only scratching at the surface of channel programs; interested readers can peek at the Linux kernel and QEMU to get a feel for both parts or at the Principles of Operation for the whole story.⁴

^{1. In the Linux source code, you'll find this under drivers/s390/cio/↩}
^{2. In the QEMU source code, you'll find channel command interpretation under hw/s390x/css.c↩}
^{3. Again, you'll find this under drivers/s390/cio/ in the Linux source code↩}
^{4. Command chaining, channel path management, I/O instructions to terminate a channel program are just some of the interesting topics.↩}

Tuesday, March 21, 2017

Channel I/O: What's in a channel subsystem?

When you start trying to get familiar with channel I/O and its concepts, one thing you notice is usually a host of very similar-sounding acronyms that are easily confused. The easiest way to get a hold of this is probably to look at a small machine started by QEMU and to examine what a Linux guest sees.

So, let's start with the following command line:

s390x-softmmu/qemu-system-s390x -machine s390-ccw-virtio,accel=kvm -m 1024 -nographic -drive file=/dev/dasdb,if=none,id=drive-virtio-disk0,format=raw,serial=ccwdasd1,cache=none -device virtio-blk-ccw,devno=fe.0.0042,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1,scsi=off

(Note that this assumes you're running an s390x system and have a bootable system on /dev/dasdb.)

This will start up a machine with two channel devices: one virtio-blk (as specified on the command line) and one virtio-net (always autogenerated unless explicitly turned off).

Let's log into the guest via the console¹ and examine what channel devices Linux sees:

[root@localhost ~]# lscss
Device   Subchan. DevType CU Type Use PIM PAM POM CHPIDs
----------------------------------------------------------------------
0.0.0000 0.0.0000 0000/00 3832/01 yes 80 80 ff   00000000 00000000
0.0.0042 0.0.0001 0000/00 3832/02 yes 80 80 ff   00000000 00000000

Let's go through this information column-by-column.

Device is the identifier for the device, which is unique guest-wide. The xx.y.zzzz format (often called bus id) is specific to Linux (and has leaked over into QEMU) and is made up of the following elements:

The channel subsystem id (cssid) xx (here: 0, as on all current Linux systems)
The subchannel set id (ssid) y (here: 0, can be any value from 0-3 on current Linux systems)
The device number (devno) zzzz (here: 0000 respectively 0042, can be any value from 0-0xffff)

The two values in this example have different origins:

0.0.0000 (the virtio-net device) has been autogenerated.
0.0.0042 (the virtio-blk device) has been specified on the command line.

But wait: The value on the command line was fe.0.0042, wasn't it? I will explain this in a later post; just remember for now that you specify the cssid fe for a virtio device on the QEMU command line and it will show up as cssid 0 in the Linux guest.

The devno basically belongs to the device; cssid and ssid indicate the addressing within the channel subsystem, which is why we encounter them again in the next id, Subchan.

This is the identifier for the subchannel, which is basically the means to actually address the device. It again uses the xx.y.zzzz format and is made up of the following elements:

The cssid xx (same as for the device)
The ssid y (again, the same as for the device)
The subchannel number zzzz (here: 0000 respectively 0001, generally not the same as the devno, although it can be any value from 0-0xffff as well)

These values are always autogenerated by QEMU (i.e., you can't specify them on the command line). They basically depend on the order in which devices are initialized (either from the initial command line, autogenerated or via device hotplug) - the only restriction is that the cssid and ssid are set by the device's bus id, if specified. The reasoning behind this is that a subchannel is only a means to access the device and as such needs only to be unique, but not pre-defined.

In contrast to the bus id for a device (which is a Linux and QEMU construct), the bus id for a subchannel actually has an equivalent in the architecture: the subchannel-identification word (often referred to as schid in Linux and QEMU), which is basically a 32 bit value composed of the cssid, the ssid, and the subchannel number. This is used to address a device via a certain subchannel by the various channel I/O related instructions.

The next two columns, DevType and CU Type, are part of the self description element of channel devices: The concept is that the operating system asks the device nicely to identify itself and the device responds with information about its type and what it can do. The device and the control unit are, in principle, two separate, cascaded entities; for virtio purposes, you can think of the device as the virtio backend (like the virtio-blk device) and of the control unit as the virtio proxy device (like the pci device used to access virtio devices on other platforms). That's also the reason why the device type is always zero for virtio devices. The control unit type is of the form aaaa/bb and consists of the following elements:

The type aaaa (a value from 0-0xffff; 0x3832 denotes a virtio-ccw control unit)
The model bb (a value from 0-0xff; for virtio devices, this is the device id as specified by the virtio standard)

In our example, we can therefore see that device 0.0.0000 is a virtio-net device (CU model 1) and device 0.0.0042 is a virtio-blk device (CU model 2).

The next column, Use, points to a big difference from other I/O architectures: In order to be able to use a subchannel to talk to a device, the operating system first needs to enable it. For virtio devices, this is done by the Linux driver by default (see the 'yes' for all devices); for other device types, this needs to be triggered by Linux user space (which implies that you can't simply go ahead and use a device, you always need to do some kind of setup).

The last four columns, PIM, PAM, POM and CHPIDs, deal with channel paths: An issue which is completely irrelevant for QEMU guests, but very interesting on real hardware. Just a quick overview:

PIM (path installed mask), PAM (path available mask) and POM (path operational mask) are all 8 bit values corresponding bit-by-bit to one of eight channel paths. If the corresponding bit is set in all of the three masks, the channel path can be used for I/O.
CHPIDs are channel path identifiers: Each channel path has an id from 0-0xff, which is unique combined with the relevant cssid. For virtio devices, there's only one valid channel path with the id 0².

Channel paths on real hardware correspond (simplistically spoken) to the connections between the actual mainframe and e.g. the storage server containing the disk devices. The setup is usually redundant, and load balancing and failover is possible between the paths. The channel paths are not per-device; usually, a set of devices shares a set of channel paths. For a virtual setup like a QEMU guest with only virtio devices, there is no real equivalent for this. Therefore, there's only a virtual channel path which does nothing but satisfy the architecture. This means that the output of the following command is not very interesting for our example guest:

[root@localhost ~]# lschp
CHPID Vary Cfg. Type Cmg Shared PCHID
============================================
0.00 1 - 32 - - -

CHPID is the channel-path identifier of the form xx.nn, where xx is the cssid and nn the chpid. This is always 0.00 on virtio-only guests.
Vary means that the channel path is online to the guest. You don't want to change this for the only path.
Type is the channel-path type. 0x32 is a reserved type for the virtio virtual channel path.

All of this does not explain how Linux actually talks to those devices (and how QEMU emulates this). I'll get to that in a future post.
^{1. A VT220 compatible console via SCLP is automatically generated.↩}
^{2. Which, in hindsight, turned out to not be the cleverest choice - see the confusing output of lscss.↩}