Thursday, November 9, 2017

Notes from KVM Forum 2017

KVM Forum 2017 took place in Prague Oct 25 - 27 and I had the pleasure of attending. Let me share some of my notes and observations (not exhaustive in any way).

General notes

KVM Forum this year was quite large, but with enough space to sit down and talk to others (or do some hacking). As always, the hallway track was great to meet some people (both old acquaintances and folks you never met in real life before) and to discuss things face-to-face that would take more time done via the mailing lists or IRC.

The first day featured a single track (shared with OSS Europe), and also the invitation-only QEMU summit in the morning (minutes will be posted to qemu-devel). Days two and three were dual-track except for the first sessions. Obviously, this means I was only able to see a subset of sessions (and one also needs a break sometimes...); fortunately, videos are slowly making their way unto youtube (check here for updates).

Keynotes

Christian Bornträger presented the KVM status, Paolo Bonzini the QEMU status and Peter Krempa the libvirt status. We seem to have a healthy development community, and interesting new topics still come up.

Architectures

I listened with interest to Christoffer Dall's talks about KVM on ARM: Reducing hypervisor overhead, and nested virtualization. Seeing the virtualization architecture on ARM makes me really glad to work on s390, and it makes the efforts of the ARM folks all the more impressive (writing code for an architecture revision for which no hardware yet exists... oh dear).

Devices

Virtio is currently moving towards the 1.1 revision of the standard, with one of the biggest changes a new ring layout. Jens Freimann (who took over last minute from Michael S. Tsirkin, who unfortunately could not make it) presented about this ongoing work, and also gave some tips on how to get changes included in the standard (let me point to the OASIS virtio TC here). There was also a presentation about virtio-crypto, which I unfortunately was not able to attend.

VFIO (and the mediated device framework) continues to be a topic of interest. There were talks about enabling migration, buffer sharing, and adding support for a new platform bus. On a related note, we (the s390 maintainers) were able to sit down with Alex Williamson and Daniel Berrange to discuss the vfio-ap proposal for s390 crypto cards; this approach has a good chance of being workable.

Hannes Reinecke and Paolo Bonzini presented about the challenges virtualizing SCSI Fibre Channel (NPIV): A good overview of what the challenges are and how we might be able to solve them.

Also of note: Improving virtio-blk performance with polling, introducing a paravirtualized RDMA device, and vhost-user-scsi and SPDK.

TCG

I'm currently trying to learn more about tcg and used the opportunity to attend two talks.

Alessandro Di Federico started his talk with a good general introduction to tcg. His work to split out a libtcg is interesting, but it might be done at the wrong level for usage with QEMU (or so I understood from the discussion.)

I also enjoyed Alex BennĂ©e's talk about handling vectors in tcg (complete with an historical overview of vector instructions).

Infrastructure

David Gilbert presented on the various reasons a migration might fail. Most migrations don't fail, so people tend to do it more and more in an automated way. If you have the misfortune of having a migration actually do fail on you, his talk gives a lot of pointers on how to find out why it happened (and hopefully, avoiding the problem in the future.)

Markus Armbruster gave an overview of the current QEMU command line infrastructure and how to improve it via QAPIfication (we probably need to sacrifice some rubber chickens backward compatibility).

Also: KubeVirt, running large deployments via libvirt, and the effects of changed defaults on users.

Conclusion

KVM Forum 2017 was really interesting (if somewhat exhausting) for learning about new developments and discussing with people; if you are working in the area, I can only recommend trying to attend KVM Forum.

Tuesday, August 8, 2017

Channel I/O: More about channel paths

recent discussion on qemu-devel touched upon some aspects of channel paths and their handling (or not-handling) in QEMU. I will try to summarize and give some further information here.

I previously published some information on channel paths here. This post will concentrate a bit more on aspects that are not yet relevant in QEMU, but may become so in the future.

To recap: Channel paths represent the means by which the mainframe talks to the device - it (somewhat) corresponds to the actual cabling. Let's take a look at the output of lscss on a z/VM guest as an actual example:

Device   Subchan.  DevType CU Type Use  PIM PAM POM  CHPIDs
----------------------------------------------------------------------
0.0.0150 0.0.0000  0000/00 3088/08      80  80  ff   08000000 00000000
0.0.0151 0.0.0001  0000/00 3088/08      80  80  ff   08000000 00000000
0.0.8000 0.0.0002  1732/01 1731/01 yes  80  80  ff   00000000 00000000
0.0.8001 0.0.0003  1732/01 1731/01 yes  80  80  ff   00000000 00000000
0.0.8002 0.0.0004  1732/01 1731/01 yes  80  80  ff   00000000 00000000
0.0.8003 0.0.0005  1732/01 1731/01      80  80  ff   01000000 00000000
0.0.8004 0.0.0006  1732/01 1731/01      80  80  ff   01000000 00000000
0.0.8005 0.0.0007  1732/01 1731/01      80  80  ff   01000000 00000000
0.0.0191 0.0.0008  3390/0a 3990/e9      e0  e0  ff   2a3a0900 00000000
0.0.208f 0.0.0009  3390/0c 3990/e9 yes  e0  e0  ff   3a2a1a00 00000000
0.0.218f 0.0.000a  3390/0c 3990/e9 yes  e0  e0  ff   2a3a0900 00000000
0.0.228f 0.0.000b  3390/0c 3990/e9 yes  e0  e0  ff   2a3a1a00 00000000
0.0.238f 0.0.000c  3390/0c 3990/e9 yes  e0  e0  ff   093a2a00 00000000
0.0.000c 0.0.000d  0000/00 2540/00      80  80  ff   08000000 00000000
0.0.000d 0.0.000e  0000/00 2540/00      80  80  ff   08000000 00000000
0.0.000e 0.0.000f  0000/00 1403/00      80  80  ff   08000000 00000000
0.0.0009 0.0.0010  0000/00 3215/00 yes  80  80  ff   08000000 00000000
0.0.0190 0.0.0011  3390/0a 3990/e9      e0  e0  ff   3a2a1a00 00000000
0.0.019d 0.0.0012  3390/0a 3990/e9      e0  e0  ff   093a2a00 00000000
0.0.019e 0.0.0013  3390/0a 3990/e9      e0  e0  ff   093a2a00 00000000
0.0.0592 0.0.0014  3390/0a 3990/e9      e0  e0  ff   3a2a1a00 00000000
0.0.ffff 0.0.0015  9336/10 6310/80      80  80  ff   08000000 00000000

A couple of interesting observations with regard to channel paths can be made here:
  • Devices 0.0.0150/0.0.0151, 0.0.000c/0.0.000d, 0.0.000e, 0.0.0009, and 0.0.ffff all share the same channel path, 08, despite being of different types (virtual CTC, virtual card punch/card reader/printer, virtual console, and virtual FBA DASD). This is because they are all emulated devices, and z/VM chooses to use the same virtual channel path for them.
  • Devices 0.0.8000 - 0.0.8002 uses channel path 0 as their only channel path, as can be seen by the PIM being 80.
  • Devices 0.0.8000 - 0.0.8002 and 0.0.8003-0.0.8005 use the same channel path, respectively; that is because they make up the device triplet for an OSA device.
  • The remaining devices (all ECKD DASD) use several channel paths (09, 1a, 2a, 3a), but only three at a time (as evidenced by the PIM of 0e), and also in different combination. This is probably a quirk of the individual setup for this guest.
The output of lschp of the same guest looks like this:

CHPID  Vary  Cfg.  Type  Cmg  Shared  PCHID
============================================
0.00   1     1     11    -    -      (ff00)
0.01   1     1     11    -    -      (ff01)
0.08   1     1     1a    -    -       0598 
0.09   1     1     1a    -    -       0599 
0.0a   1     1     1a    -    -       059c 
0.0b   1     1     25    -    -       059d 
0.0c   1     1     1a    -    -       05ac 
0.17   1     1     11    -    -       05b4 
0.18   1     1     1a    -    -       05a0 
0.19   1     1     1a    -    -       05a1 
0.1a   1     1     1a    -    -       05a4 
0.1b   1     1     25    -    -       05a5 
0.1c   1     1     1a    -    -       05ad 
0.28   1     1     1a    -    -       05d8 
0.29   1     1     1a    -    -       05d9 
0.2a   1     1     1a    -    -       05dc 
0.2b   1     1     25    -    -       05dd 
0.2c   1     1     1a    -    -       05d0 
0.34   1     1     11    -    -       05ec 
0.35   1     1     11    -    -       05a8 
0.38   1     1     1a    -    -       05e0 
0.39   1     1     1a    -    -       05e1 
0.3a   1     1     1a    -    -       05e4 
0.3b   1     1     25    -    -       05e5 
0.3c   1     1     1a    -    -       05d1 
0.60   1     1     24    -    -      (070c)
0.61   1     1     24    -    -      (070d)
0.62   1     1     24    -    -      (070e)
0.63   1     1     24    -    -      (070f)

Here we find the various channel paths again, together with more:
  • There are several channel paths that are available to the guest, but not in use by any device currently available to the guest (and therefore not turning up in the output of lscss).
  • Channel paths 00 and 01 (used by the OSA cards) use an internal channel (the number in the last column are in brackets) - we can therefore conclude that the cards are virtualized by z/VM.
  • The channel path 08 (which is referenced by all virtual devices) is actually backed by a physical path (0598). I frankly have no idea why z/VM is doing that.
  • The channel paths used by the ECKD DASD (09, 1a, 2a, 3a) all are of the same type (1a - FICON, IIRC) and are backed by different physical paths (last column).
Various modifications can be done to the channel paths; under Linux, the chchp tool is useful for that. Let's try to vary off a path:

chchp -v 0 0.3a
Vary offline 0.3a... done.

lschp shows the changed state for the path:

0.3a   0     1     1a    -    -       05e4

The lscss output remains unchanged - which isn't surprising as doing a vary off only affects the state of the channel path within Linux: Linux will no longer use the path for I/O, but the path masks as managed by the hardware and z/VM are not changed.

Let's try to configure off another path:

chchp -c 0 0.2a
Configure standby 0.2a... failed - attribute value not as expected

That did not work as expected. Why? This is supposed to issue a SCLP command to set the channel path to standby - but the my guest apparently does not have the rights or ability to do so. Which is a pity, as I would have liked to show the effects of configuring a channel path to standby:
  • It (unsurprisingly) changes the state in lschp.
  • It also changes the path masks, as shown in lscss.
  • It may generate a machine check with a channel report word (CRW) that informs the OS that something has happened to the channel path - this is dependent upon the environment, however.
So let's stop here. I'll continue with another setup, once I have it.

Thursday, June 1, 2017

Linux 4.12 and QEMU 2.10 will have basic support for vfio-ccw

If you want to passthrough some channel devices to your guest, you will be able to do so with a host kernel >= 4.12 and a QEMU >= 2.10.

For some hints about configuration and restrictions, see this entry in the QEMU wiki.

Wednesday, May 10, 2017

Channel I/O: Types of devices

The last posts in this series tried to examine some basic principles of channel I/O. But what kinds of devices are actually available?

I'll focus on the device types that are available to a guest running under QEMU.

The most important channel devices for QEMU guests (and often, the only ones present in a guest) are virtio-ccw devices. These have been used in the previous examples. Think of them as the channel I/O equivalent of virtio-pci devices: that is, a device that is discoverable in the guest and acts as a means to access the virtio device.
All virtio-ccw devices share the following characteristics:
  • Fully virtual (i.e., fully emulated). There is no "real hardware" virtio-ccw device.
  • A control unit type of 0x3832.
  • One virtual channel path, type 0x32.

New (well, to QEMU) and just recently added (will be in 2.10) are 3270 devices (the channel-attached variety). The classic green-screen console; some details about what works (and what yet doesn't) and how to set this up may be found in the QEMU wiki.
3270 devices have the following characteristics:
  • Fully virtual (emulated). While you could passthrough 3270 devices while running under z/VM, this depends on the non-yet-merged vfio-ccw infrastructure (see below) and does not really make much sense.
  • A control unit type of 0x3270.
  • One virtual channel path, type 0x1a.

Still being worked on, but on a good track, is the vfio-ccw infrastructure. The kernel part has been merged for 4.12, the QEMU part will hopefully be merged soon. vfio-ccw brings the same functionality to channel devices that vfio-pci brought to pci devices: Give hardware devices to the guest to use. This is still quite experimental, and has only really been tested with one device type yet: ECKD DASD.
'DASD' basically refers to disks; this wikipedia article explains more. 'ECKD' refers to the data recording format; this wikipedia article probably explains more than you ever wanted to know. Linux accesses ECKD DASD as block devices with some minor oddities.
If you pass through an ECKD DASD, you can expect the following to show up in your guest:
  • A device that corresponds one-to-one to a device on the host, although it might have a different device number (depending on how it was configured).
  • A control unit and device type corresponding to ECKD DASD.  A control unit type of 0x3990 and a device type of 0x3390 are the most likely.
  • One to eight channel paths, corresponding to real channel paths.
vfio-ccw opens the way to expose all kinds of channel devices to QEMU/KVM guests: FBA DASD, channel-attached tapes - basically everything that is supported by Linux on the host.

Wednesday, April 12, 2017

SELinux vs. QEMU/KVM

Trying to run QEMU/KVM under an older version of z/VM? Make sure to read Thomas' hint.

Thursday, March 30, 2017

Oldies but Goldies: Channel I/O KVM Forum 2012 talk

Some of the information has been superseded in the meanwhile, but the slides from my talk at the 2012 KVM Forum contain some information that may still be interesting. (Sadly, no video of the talk was recorded.)

Tuesday, March 28, 2017

Channel I/O: Talking to devices

Having a nice set of channel devices available to your OS is all fine and good; but how do you actually talk to them? This post attempts to give a high-level overview, while also explaining some more acronyms.

Let's look again at the example configuration from the last post:

Device   Subchan.  DevType CU Type Use  PIM PAM POM  CHPIDs
----------------------------------------------------------------------
0.0.0000 0.0.0000  0000/00 3832/01 yes  80  80  ff   00000000 00000000
0.0.0042 0.0.0001  0000/00 3832/02 yes  80  80  ff   00000000 00000000
The second device is the virtio-blk device 0.0.0042 on subchannel 0.0.0001, having channel path 0. Being virtio, this is a very simplified variation of what you'd see on real hardware (although this also can be a benefit in some way). Think of it as the following:

Device 0.0.0042 is accessed via channel path 0, and subchannel 0.0.0001 is used as a means to address it.

The access (channel path) is configured in the hypervisor (or in the hardware definitions). The subchannel is what the OS will use as a target for I/O instructions and how it can associate I/O interrupts with the device they are for.

I/O instructions? There's a whole zoo of them, but they share some characteristics:
  • They take a subchannel identifier as parameter.
  • They are privileged: I.e., on a Linux system, they can only be issued by the kernel and not from user space.
Check the Principles of Operation (SA22-7832), chapter 14, for the whole story. Here, I'll concentrate just on two instructions:
  • START SUBCHANNEL (SSCH) - start a channel program
  • TEST SUBCHANNEL (TSCH) - retrieve subchannel status
So, what's a channel program? It's basically a set of instructions sent to the control unit and executed. You can even branch in them. The basic unit of those instructions is the channel command word (CCW). The ccw is such a basic characteristic of channel I/O that it is used throughout Linux and QEMU for channel devices: For example, Linux has ccw_devices on the ccw bus, and QEMU has CcwDevices, most notably VirtIOCcwDevices (like the devices in the example).

ccws consist of three parts:
  • The command. This falls into the categories of read (read data from the device), write (write data to the device) or control (for example, rewinding a tape). An 8-bit value.
  • The flags, which control error handling or program flow. I'll ignore them for simplicity here.
  • The data address. This is an address in memory where data is written to (read) or read from (write).
Let's take an example. The SENSE ID command is a basic operation supported by both virtual devices like the virtio devices and real hardware devices. It is used to obtain configuration information from the device, like the CU type information in the output above. It is usually the first ccw an operating system issues to the device.

The operating system will assemble a ccw: The command code will be 0xe4 for SENSE ID, and the data address will point to a location wherethe OS wants to have the obtained information. The OS will also assemble a so-called ORB (operation request block), which, amongst other things, points to the assembled ccw (respectively the first one in a chain). This ORB and the subchannel id are the two parameters for the SSCH instruction. If all goes well, the OS will receive a condition code 0 and knows that it will be signalled asynchronously once the channel program has been processed (successfully or with errors)1.

Processing of the actual channel command is done asynchronously by real hardware (QEMU does it synchronously for simplicity reasons). The result is that the wanted data is put into the memory area refered to by the ccw2. Subsequently, the subchannel is made status pending: Information is ready for retrieval by the OS.

Usually, the OS wants to have a notification that the subchannel became status pending; this is done via an I/O interrupt. I/O interrupts on s390 carry extra status which is written to the low memory area of the cpu receiving the interrupt; amongst other things, this status contains the subchannel id.

Next, the OS needs to actually retreive the status information: This is done via the TSCH instruction, which in turn makes the subchannel no longer status pending and ready for the next I/O request via SSCH. The status contains enough information for the OS to determine whether the request was successful (and the sense id information has been stored), or whether there was an error.3

Of course, this is all only scratching at the surface of channel programs; interested readers can peek at the Linux kernel and QEMU to get a feel for both parts or at the Principles of Operation for the whole story.4

1. In the Linux source code, you'll find this under drivers/s390/cio/
2. In the QEMU source code, you'll find channel command interpretation under hw/s390x/css.c
3. Again, you'll find this under drivers/s390/cio/ in the Linux source code
4. Command chaining, channel path management, I/O instructions to terminate a channel program are just some of the interesting topics.

Tuesday, March 21, 2017

Channel I/O: What's in a channel subsystem?

When you start trying to get familiar with channel I/O and its concepts, one thing you notice is usually a host of very similar-sounding acronyms that are easily confused. The easiest way to get a hold of this is probably to look at a small machine started by QEMU and to examine what a Linux guest sees.

So, let's start with the following command line: 

s390x-softmmu/qemu-system-s390x -machine s390-ccw-virtio,accel=kvm -m 1024 -nographic -drive file=/dev/dasdb,if=none,id=drive-virtio-disk0,format=raw,serial=ccwdasd1,cache=none -device virtio-blk-ccw,devno=fe.0.0042,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1,scsi=off

(Note that this assumes you're running an s390x system and have a bootable system on /dev/dasdb.)

This will start up a machine with two channel devices: one virtio-blk (as specified on the command line) and one virtio-net (always autogenerated unless explicitly turned off).

Let's log into the guest via the console1 and examine what channel devices Linux sees:
[root@localhost ~]# lscss
Device   Subchan.  DevType CU Type Use  PIM PAM POM  CHPIDs
----------------------------------------------------------------------
0.0.0000 0.0.0000  0000/00 3832/01 yes  80  80  ff   00000000 00000000
0.0.0042 0.0.0001  0000/00 3832/02 yes  80  80  ff   00000000 00000000

Let's go through this information column-by-column.

Device is the identifier for the device, which is unique guest-wide. The xx.y.zzzz format (often called bus id) is specific to Linux (and has leaked over into QEMU) and is made up of the following elements:
  • The channel subsystem id (cssid) xx (here: 0, as on all current Linux systems)
  • The subchannel set id (ssid) y (here: 0, can be any value from 0-3 on current Linux systems)
  • The device number (devno) zzzz (here: 0000 respectively 0042, can be any value from 0-0xffff)
The two values in this example have different origins:
  • 0.0.0000 (the virtio-net device) has been autogenerated.
  • 0.0.0042 (the virtio-blk device) has been specified on the command line.
But wait: The value on the command line was fe.0.0042, wasn't it? I will explain this in a later post; just remember for now that you specify the cssid fe for a virtio device on the QEMU command line and it will show up as cssid 0 in the Linux guest.

The devno basically belongs to the device; cssid and ssid indicate the addressing within the channel subsystem, which is why we encounter them again in the next id, Subchan.

This is the identifier for the subchannel, which is basically the means to actually address the device. It again uses the xx.y.zzzz format and is made up of the following elements:
  • The cssid xx (same as for the device)
  • The ssid y (again, the same as for the device)
  • The subchannel number zzzz (here: 0000 respectively 0001, generally not the same as the devno, although it can be any value from 0-0xffff as well)
These values are always autogenerated by QEMU (i.e., you can't specify them on the command line). They basically depend on the order in which devices are initialized (either from the initial command line, autogenerated or via device hotplug) - the only restriction is that the cssid and ssid are set by the device's bus id, if specified. The reasoning behind this is that a subchannel is only a means to access the device and as such needs only to be unique, but not pre-defined.

In contrast to the bus id for a device (which is a Linux and QEMU construct), the bus id for a subchannel actually has an equivalent in the architecture: the subchannel-identification word (often referred to as schid in Linux and QEMU), which is basically a 32 bit value composed of the cssid, the ssid, and the subchannel number. This is used to address a device via a certain subchannel by the various channel I/O related instructions.

The next two columns, DevType and CU Type, are part of the self description element of channel devices: The concept is that the operating system asks the device nicely to identify itself and the device responds with information about its type and what it can do. The device and the control unit are, in principle, two separate, cascaded entities; for virtio purposes, you can think of the device as the virtio backend (like the virtio-blk device) and of the control unit as the virtio proxy device (like the pci device used to access virtio devices on other platforms). That's also the reason why the device type is always zero for virtio devices. The control unit type is of the form aaaa/bb and consists of the following elements:
  • The type aaaa (a value from 0-0xffff; 0x3832 denotes a virtio-ccw control unit)
  • The model bb (a value from 0-0xff; for virtio devices, this is the device id as specified by the virtio standard)
In our example, we can therefore see that device 0.0.0000 is a virtio-net device (CU model 1) and device 0.0.0042 is a virtio-blk device (CU model 2).

The next column, Use, points to a big difference from other I/O architectures: In order to be able to use a subchannel to talk to a device, the operating system first needs to enable it. For virtio devices, this is done by the Linux driver by default (see the 'yes' for all devices); for other device types, this needs to be triggered by Linux user space (which implies that you can't simply go ahead and use a device, you always need to do some kind of setup).

The last four columns, PIM, PAM, POM and CHPIDs, deal with channel paths: An issue which is completely irrelevant for QEMU guests, but very interesting on real hardware. Just a quick overview:
  • PIM (path installed mask), PAM (path available mask) and POM (path operational mask) are all 8 bit values corresponding bit-by-bit to one of eight channel paths. If the corresponding bit is set in all of the three masks, the channel path can be used for I/O.
  • CHPIDs are channel path identifiers: Each channel path has an id from 0-0xff, which is unique combined with the relevant cssid. For virtio devices, there's only one valid channel path with the id 02.
Channel paths on real hardware correspond (simplistically spoken) to the connections between the actual mainframe and e.g. the storage server containing the disk devices. The setup is usually redundant, and load balancing and failover is possible between the paths. The channel paths are not per-device; usually, a set of devices shares a set of channel paths. For a virtual setup like a QEMU guest with only virtio devices, there is no real equivalent for this. Therefore, there's only a virtual channel path which does nothing but satisfy the architecture. This means that the output of the following command is not very interesting for our example guest:
[root@localhost ~]# lschp
CHPID  Vary  Cfg.  Type  Cmg  Shared  PCHID
============================================
0.00   1     -     32    -    -       -   

  • CHPID is the channel-path identifier of the form xx.nn, where xx is the cssid and nn the chpid. This is always 0.00 on virtio-only guests.
  • Vary means that the channel path is online to the guest. You don't want to change this for the only path.
  • Type is the channel-path type. 0x32 is a reserved type for the virtio virtual channel path.
All of this does not explain how Linux actually talks to those devices (and how QEMU emulates this). I'll get to that in a future post.
1. A VT220 compatible console via SCLP is automatically generated.
2. Which, in hindsight, turned out to not be the cleverest choice - see the confusing output of lscss.

Friday, February 3, 2017

Channel I/O, demystified

As promised, I will write some articles about one of the areas where s390x looks especially alien to people familiar with the usual suspects like x86: the native way of addressing I/O devices. 1

Channel I/O is a rather large topic, so I will concentrate on explaining it from a Linux (guest) and from a qemu/KVM (virtualization) point of view. This also means that I will prefer terminology that will make sense to somebody familiar with Linux (on x86) rather than the one used by e.g. a z/OS system programmer.

Links to the individual articles:
Channel I/O: What's in a channel subsystem?
Channel I/O: Talking to devices
Channel I/O: Types of devices
Channel I/O: More about channel paths
Channel Measurements: A Quick Overview

1. There is PCI on s390x, but it is a recent addition and its idiosyncracies are better understood if you know how channel I/O works.