Discussion:
[RFC 0/3] virtio-iommu: a paravirtualized IOMMU
(too old to reply)
Jean-Philippe Brucker
2017-04-07 19:17:44 UTC
Permalink
This is the initial proposal for a paravirtualized IOMMU device using
virtio transport. It contains a description of the device, a Linux driver,
and a toy implementation in kvmtool. With this prototype, you can
translate DMA to guest memory from emulated (virtio), or passed-through
(VFIO) devices.

In its simplest form, implemented here, the device handles map/unmap
requests from the guest. Future extensions proposed in "RFC 3/3" should
allow to bind page tables to devices.

There are a number of advantages in a paravirtualized IOMMU over a full
emulation. It is portable and could be reused on different architectures.
It is easier to implement than a full emulation, with less state tracking.
It might be more efficient in some cases, with less context switches to
the host and the possibility of in-kernel emulation.

When designing it and writing the kvmtool device, I considered two main
scenarios, illustrated below.

Scenario 1: a hardware device passed through twice via VFIO

MEM____pIOMMU________PCI device________________________ HARDWARE
| (2b) \
----------|-------------+-------------+------------------\-------------
| : KVM : \
| : : \
pIOMMU drv : _______virtio-iommu drv \ KERNEL
| : | : | \
VFIO : | : VFIO \
| : | : | \
| : | : | /
----------|-------------+--------|----+----------|------------/--------
| | : | /
| (1c) (1b) | : (1a) | / (2a)
| | : | /
| | : | / USERSPACE
|___virtio-iommu dev___| : net drv___/
:
--------------------------------------+--------------------------------
HOST : GUEST

(1) a. Guest userspace is running a net driver (e.g. DPDK). It allocates a
buffer with mmap, obtaining virtual address VA. It then send a
VFIO_IOMMU_MAP_DMA request to map VA to an IOVA (possibly VA=IOVA).
b. The maping request is relayed to the host through virtio
(VIRTIO_IOMMU_T_MAP).
c. The mapping request is relayed to the physical IOMMU through VFIO.

(2) a. The guest userspace driver can now instruct the device to directly
access the buffer at IOVA
b. IOVA accesses from the device are translated into physical
addresses by the IOMMU.

Scenario 2: a virtual net device behind a virtual IOMMU.

MEM__pIOMMU___PCI device HARDWARE
| |
-------|---------|------+-------------+-------------------------------
| | : KVM :
| | : :
pIOMMU drv | : :
\ | : _____________virtio-net drv KERNEL
\_net drv : | : / (1a)
| : | : /
tap : | ________virtio-iommu drv
| : | | : (1b)
-----------------|------+-----|---|---+-------------------------------
| | | :
|_virtio-net_| | :
/ (2) | :
/ | : USERSPACE
virtio-iommu dev______| :
:
--------------------------------------+-------------------------------
HOST : GUEST

(1) a. Guest virtio-net driver maps the virtio ring and a buffer
b. The mapping requests are relayed to the host through virtio.
(2) The virtio-net device now needs to access any guest memory via the
IOMMU.

Physical and virtual IOMMUs are completely dissociated. The net driver is
mapping its own buffers via DMA/IOMMU API, and buffers are copied between
virtio-net and tap.


The description itself seemed too long for a single email, so I split it
into three documents, and will attach Linux and kvmtool patches to this
email.

1. Firmware note,
2. device operations (draft for the virtio specification),
3. future work/possible improvements.

Just to be clear on the terms I'm using:

pIOMMU physical IOMMU, controlling DMA accesses from physical devices
vIOMMU virtual IOMMU (virtio-iommu), controlling DMA accesses from
physical and virtual devices to guest memory.
GVA, GPA, HVA, HPA
Guest/Host Virtual/Physical Address
IOVA I/O Virtual Address, the address accessed by a device doing DMA
through an IOMMU. In the context of a guest OS, IOVA is GVA.

Note: kvmtool is GPLv2. Linux patches are GPLv2, except for UAPI
virtio-iommu.h header, which is BSD 3-clause. For the time being, the
specification draft in RFC 2/3 is also BSD 3-clause.


This proposal may be involuntarily centered around ARM architectures at
times. Any feedback would be appreciated, especially regarding other IOMMU
architectures.

Thanks,
Jean-Philippe
Jean-Philippe Brucker
2017-04-07 19:17:45 UTC
Permalink
Unlike other virtio devices, the virtio-iommu doesn't work independently,
it is linked to other virtual or assigned devices. So before jumping into
device operations, we need to define a way for the guest to discover the
virtual IOMMU and the devices it translates.

The host must describe the relation between IOMMU and devices to the guest
using either device-tree or ACPI. The virtual IOMMU identifies each
virtual device with a 32-bit ID, that we will call "Device ID" in this
document. Device IDs are not necessarily unique system-wide, but they may
not overlap within a single virtual IOMMU. Device ID of passed-through
devices do not need to match IDs seen by the physical IOMMU.

The virtual IOMMU uses virtio-mmio transport exclusively, not virtio-pci,
because with PCI the IOMMU interface would itself be an endpoint, and
existing firmware interfaces don't allow to describe IOMMU<->master
relations between PCI endpoints.

The following diagram describes a situation where two virtual IOMMUs
translate traffic from devices in the system. vIOMMU 1 translates two PCI
domains, in which each function has a 16-bits requester ID. In order for
the vIOMMU to differentiate guest requests targeted at devices in each
domain, their Device ID ranges cannot overlap. vIOMMU 2 translates two PCI
domains and a collection of platform devices.

Device ID Requester ID
/ 0x0 0x0 \
/ | | PCI domain 1
/ 0xffff 0xffff /
vIOMMU 1
\ 0x10000 0x0 \
\ | | PCI domain 2
\ 0x1ffff 0xffff /

/ 0x0 \
/ | platform devices
/ 0x1fff /
vIOMMU 2
\ 0x2000 0x0 \
\ | | PCI domain 3
\ 0x11fff 0xffff /

Device-tree already offers a way to describe the topology. Here's an
example description of vIOMMU 2 with its devices:

/* The virtual IOMMU is described with a virtio-mmio node */
viommu2: ***@10000 {
compatible = "virtio,mmio";
reg = <0x10000 0x200>;
dma-coherent;
interrupts = <0x0 0x5 0x1>;

#iommu-cells = <1>
};

/* Some platform device has Device ID 0x5 */
***@20000 {
...

iommus = <&viommu2 0x5>;
};

/*
* PCI domain 3 is described by its host controller node, along
* with the complete relation to the IOMMU
*/
pci {
...
/* Linear map between RIDs and Device IDs for the whole bus */
iommu-map = <0x0 &viommu2 0x10000 0x10000>;
};

For more details, please refer to [DT-IOMMU].

For ACPI, we expect to add a new node type to the IO Remapping Table
specification [IORT], providing a similar mechanism for describing
translations via ACPI tables. The following is *not* a specification,
simply an example of what the node could be.

Field | Len. | Off. | Description
----------------|-------|-------|---------------------------------
Type | 1 | 0 | 5: paravirtualized IOMMU
Length | 2 | 1 | The length of the node.
Revision | 1 | 3 | 0
Reserved | 4 | 4 | Must be zero.
Number of ID | 4 | 8 |
mappings | | |
Reference to | 4 | 12 | Offset from the start of the
ID Array | | | IORT node to the start of its
| | | Array ID mappings.
| | |
Model | 4 | 16 | 0: virtio-iommu
Device object | -- | 20 | ASCII Null terminated string
name | | | with the full path to the entry
| | | in the namespace for this IOMMU.
Padding | -- | -- | To keep 32-bit alignment and
| | | leave space for future models.
| | |
Array of ID | | |
mappings | 20xN | -- | ID Array.

The OS parses the IORT table to build a map of ID relations between IOMMU
and devices. ID Array is used to find correspondence between IOMMU IDs and
PCI or platform devices. Later on, the virtio-iommu driver finds the
associated LNRO0005 descriptor via the "Device object name" field, and
probes the virtio device to find out more about its capabilities. Since
all properties of the IOMMU will be obtained during virtio probing, the
IORT node can stay simple.

[DT-IOMMU] https://www.kernel.org/doc/Documentation/devicetree/bindings/iommu/iommu.txt
https://www.kernel.org/doc/Documentation/devicetree/bindings/pci/pci-iommu.txt

[IORT] IO Remapping Table, DEN0049B
http://infocenter.arm.com/help/topic/com.arm.doc.den0049b/DEN0049B_IO_Remapping_Table.pdf
Tian, Kevin
2017-04-18 09:51:23 UTC
Permalink
From: Jean-Philippe Brucker
Sent: Saturday, April 8, 2017 3:18 AM
Unlike other virtio devices, the virtio-iommu doesn't work independently,
it is linked to other virtual or assigned devices. So before jumping into
device operations, we need to define a way for the guest to discover the
virtual IOMMU and the devices it translates.
The host must describe the relation between IOMMU and devices to the guest
using either device-tree or ACPI. The virtual IOMMU identifies each
Do you plan to support both device tree and ACPI?
virtual device with a 32-bit ID, that we will call "Device ID" in this
document. Device IDs are not necessarily unique system-wide, but they may
not overlap within a single virtual IOMMU. Device ID of passed-through
devices do not need to match IDs seen by the physical IOMMU.
The virtual IOMMU uses virtio-mmio transport exclusively, not virtio-pci,
because with PCI the IOMMU interface would itself be an endpoint, and
existing firmware interfaces don't allow to describe IOMMU<->master
relations between PCI endpoints.
I'm not familiar with virtio-mmio mechanism. Curious how devices in
virtio-mmio are enumerated today? Could we use that mechanism to
identify vIOMMUs and then invent a purely para-virtualized method to
enumerate devices behind each vIOMMU?

Asking this is because each vendor has its own enumeration methods.
ARM has device tree and ACPI IORT. AMR has ACPI IVRS and device
tree (same format as ARM?). Intel has APCI DMAR and sub-tables. Your
current proposal looks following ARM definitions which I'm not sure
extensible enough to cover features defined only in other vendors'
structures.

Since the purpose of this series is to go para-virtualize, why not also
para-virtualize and simplify the enumeration method? For example,
we may define a query interface through vIOMMU registers to allow
guest query whether a device belonging to that vIOMMU. Then we
can even remove use of any enumeration structure completely...
Just a quick example which I may not think through all the pros and
cons. :-)
The following diagram describes a situation where two virtual IOMMUs
translate traffic from devices in the system. vIOMMU 1 translates two PCI
domains, in which each function has a 16-bits requester ID. In order for
the vIOMMU to differentiate guest requests targeted at devices in each
domain, their Device ID ranges cannot overlap. vIOMMU 2 translates two PCI
domains and a collection of platform devices.
Device ID Requester ID
/ 0x0 0x0 \
/ | | PCI domain 1
/ 0xffff 0xffff /
vIOMMU 1
\ 0x10000 0x0 \
\ | | PCI domain 2
\ 0x1ffff 0xffff /
/ 0x0 \
/ | platform devices
/ 0x1fff /
vIOMMU 2
\ 0x2000 0x0 \
\ | | PCI domain 3
\ 0x11fff 0xffff /
isn't above be (0x30000, 3ffff) for PCI domain 3 giving device ID is 16bit?

Thanks
Kevin
Jean-Philippe Brucker
2017-04-18 18:41:19 UTC
Permalink
Post by Tian, Kevin
From: Jean-Philippe Brucker
Sent: Saturday, April 8, 2017 3:18 AM
Unlike other virtio devices, the virtio-iommu doesn't work independently,
it is linked to other virtual or assigned devices. So before jumping into
device operations, we need to define a way for the guest to discover the
virtual IOMMU and the devices it translates.
The host must describe the relation between IOMMU and devices to the guest
using either device-tree or ACPI. The virtual IOMMU identifies each
Do you plan to support both device tree and ACPI?
Yes, with ACPI the topology would be described using IORT nodes. I didn't
include an example in my driver because DT is sufficient for a prototype
and is readily available (both in Linux and kvmtool), whereas IORT would
be quite easy to reuse in Linux, but isn't present in kvmtool at the
moment. However, both interfaces have to be supported for the virtio-iommu
to be portable.
Post by Tian, Kevin
virtual device with a 32-bit ID, that we will call "Device ID" in this
document. Device IDs are not necessarily unique system-wide, but they may
not overlap within a single virtual IOMMU. Device ID of passed-through
devices do not need to match IDs seen by the physical IOMMU.
The virtual IOMMU uses virtio-mmio transport exclusively, not virtio-pci,
because with PCI the IOMMU interface would itself be an endpoint, and
existing firmware interfaces don't allow to describe IOMMU<->master
relations between PCI endpoints.
I'm not familiar with virtio-mmio mechanism. Curious how devices in
virtio-mmio are enumerated today? Could we use that mechanism to
identify vIOMMUs and then invent a purely para-virtualized method to
enumerate devices behind each vIOMMU?
Using DT, virtio-mmio devices are described with "virtio-mmio" compatible
node, and with ACPI they use _HID LNRO0005. Since the host already
describes available devices to a guest using a firmware interface, I think
we should reuse the tools provided by that interface for describing
relations between DMA masters and IOMMU.
Post by Tian, Kevin
Asking this is because each vendor has its own enumeration methods.
ARM has device tree and ACPI IORT. AMR has ACPI IVRS and device
tree (same format as ARM?). Intel has APCI DMAR and sub-tables. Your
current proposal looks following ARM definitions which I'm not sure
extensible enough to cover features defined only in other vendors'
structures.
ACPI IORT can be extended to incorporate para-virtualized IOMMUs,
regardless of the underlying architecture. It isn't defined solely for the
ARM SMMU, but serves a more general purpose of describing a map of device
identifiers communicated from one components to another. Both DMAR and
IVRS have such description (respectively DRHD and IVHD), but they are
designed for a specific IOMMU, whereas IORT could host other kinds.

It seems that all we really need is an interface that says "there is a
virtio-iommu at address X, here are the devices it translates and their
corresponding IDs", and both DT and ACPI IORT are able to fulfill this role.
Post by Tian, Kevin
Since the purpose of this series is to go para-virtualize, why not also
para-virtualize and simplify the enumeration method? For example,
we may define a query interface through vIOMMU registers to allow
guest query whether a device belonging to that vIOMMU. Then we
can even remove use of any enumeration structure completely...
Just a quick example which I may not think through all the pros and
cons. :-)
I don't think adding a brand new topology description mechanism is worth
the effort, we're better off reusing what already exists and is
implemented by operating systems. Adding a query interface inside the
vIOMMU may work (though might be very painful to integrate with fwspec in
Linux), but would be redundant since the host has to provide a firmware
description of the system anyway.
Post by Tian, Kevin
The following diagram describes a situation where two virtual IOMMUs
translate traffic from devices in the system. vIOMMU 1 translates two PCI
domains, in which each function has a 16-bits requester ID. In order for
the vIOMMU to differentiate guest requests targeted at devices in each
domain, their Device ID ranges cannot overlap. vIOMMU 2 translates two PCI
domains and a collection of platform devices.
Device ID Requester ID
/ 0x0 0x0 \
/ | | PCI domain 1
/ 0xffff 0xffff /
vIOMMU 1
\ 0x10000 0x0 \
\ | | PCI domain 2
\ 0x1ffff 0xffff /
/ 0x0 \
/ | platform devices
/ 0x1fff /
vIOMMU 2
\ 0x2000 0x0 \
\ | | PCI domain 3
\ 0x11fff 0xffff /
isn't above be (0x30000, 3ffff) for PCI domain 3 giving device ID is 16bit?
Unlike Requester IDs in PCI, there is no architected rule for IDs of
platform devices, it's an integration choice. The ID of platform device is
used exclusively for interfacing with an IOMMU (or MSI controller), it
doesn't mean anything outside this context. Here the host allocates 13
bits to platform device IDs, which is legal.

Thanks,
Jean-Philippe
Tian, Kevin
2017-04-21 08:43:43 UTC
Permalink
Sent: Wednesday, April 19, 2017 2:41 AM
Post by Tian, Kevin
From: Jean-Philippe Brucker
Sent: Saturday, April 8, 2017 3:18 AM
Unlike other virtio devices, the virtio-iommu doesn't work independently,
it is linked to other virtual or assigned devices. So before jumping into
device operations, we need to define a way for the guest to discover the
virtual IOMMU and the devices it translates.
The host must describe the relation between IOMMU and devices to the guest
using either device-tree or ACPI. The virtual IOMMU identifies each
Do you plan to support both device tree and ACPI?
Yes, with ACPI the topology would be described using IORT nodes. I didn't
include an example in my driver because DT is sufficient for a prototype
and is readily available (both in Linux and kvmtool), whereas IORT would
be quite easy to reuse in Linux, but isn't present in kvmtool at the
moment. However, both interfaces have to be supported for the virtio-
iommu
to be portable.
'portable' means whether guest enables ACPI?
Post by Tian, Kevin
virtual device with a 32-bit ID, that we will call "Device ID" in this
document. Device IDs are not necessarily unique system-wide, but they
may
Post by Tian, Kevin
not overlap within a single virtual IOMMU. Device ID of passed-through
devices do not need to match IDs seen by the physical IOMMU.
The virtual IOMMU uses virtio-mmio transport exclusively, not virtio-pci,
because with PCI the IOMMU interface would itself be an endpoint, and
existing firmware interfaces don't allow to describe IOMMU<->master
relations between PCI endpoints.
I'm not familiar with virtio-mmio mechanism. Curious how devices in
virtio-mmio are enumerated today? Could we use that mechanism to
identify vIOMMUs and then invent a purely para-virtualized method to
enumerate devices behind each vIOMMU?
Using DT, virtio-mmio devices are described with "virtio-mmio" compatible
node, and with ACPI they use _HID LNRO0005. Since the host already
describes available devices to a guest using a firmware interface, I think
we should reuse the tools provided by that interface for describing
relations between DMA masters and IOMMU.
OK, I didn't realize virtio-mmio is defined to rely on DT for enumeration.
Post by Tian, Kevin
Asking this is because each vendor has its own enumeration methods.
ARM has device tree and ACPI IORT. AMR has ACPI IVRS and device
tree (same format as ARM?). Intel has APCI DMAR and sub-tables. Your
current proposal looks following ARM definitions which I'm not sure
extensible enough to cover features defined only in other vendors'
structures.
ACPI IORT can be extended to incorporate para-virtualized IOMMUs,
regardless of the underlying architecture. It isn't defined solely for the
ARM SMMU, but serves a more general purpose of describing a map of device
identifiers communicated from one components to another. Both DMAR and
IVRS have such description (respectively DRHD and IVHD), but they are
designed for a specific IOMMU, whereas IORT could host other kinds.
I'll take a look at IORT definition. DRHD includes information more
than device mapping.
It seems that all we really need is an interface that says "there is a
virtio-iommu at address X, here are the devices it translates and their
corresponding IDs", and both DT and ACPI IORT are able to fulfill this role.
Post by Tian, Kevin
Since the purpose of this series is to go para-virtualize, why not also
para-virtualize and simplify the enumeration method? For example,
we may define a query interface through vIOMMU registers to allow
guest query whether a device belonging to that vIOMMU. Then we
can even remove use of any enumeration structure completely...
Just a quick example which I may not think through all the pros and
cons. :-)
I don't think adding a brand new topology description mechanism is worth
the effort, we're better off reusing what already exists and is
implemented by operating systems. Adding a query interface inside the
vIOMMU may work (though might be very painful to integrate with fwspec in
Linux), but would be redundant since the host has to provide a firmware
description of the system anyway.
Post by Tian, Kevin
The following diagram describes a situation where two virtual IOMMUs
translate traffic from devices in the system. vIOMMU 1 translates two PCI
domains, in which each function has a 16-bits requester ID. In order for
the vIOMMU to differentiate guest requests targeted at devices in each
domain, their Device ID ranges cannot overlap. vIOMMU 2 translates two
PCI
Post by Tian, Kevin
domains and a collection of platform devices.
Device ID Requester ID
/ 0x0 0x0 \
/ | | PCI domain 1
/ 0xffff 0xffff /
vIOMMU 1
\ 0x10000 0x0 \
\ | | PCI domain 2
\ 0x1ffff 0xffff /
/ 0x0 \
/ | platform devices
/ 0x1fff /
vIOMMU 2
\ 0x2000 0x0 \
\ | | PCI domain 3
\ 0x11fff 0xffff /
isn't above be (0x30000, 3ffff) for PCI domain 3 giving device ID is 16bit?
Unlike Requester IDs in PCI, there is no architected rule for IDs of
platform devices, it's an integration choice. The ID of platform device is
used exclusively for interfacing with an IOMMU (or MSI controller), it
doesn't mean anything outside this context. Here the host allocates 13
bits to platform device IDs, which is legal.
Please add such explanation to your next version. In earlier text
"16-bits request ID" is mentioned for vIOMMU1, which gave me
the illusion that same 16bit applies to vIOMMU2 too.

Thanks
Kevin
Jean-Philippe Brucker
2017-04-24 15:05:36 UTC
Permalink
Post by Tian, Kevin
Sent: Wednesday, April 19, 2017 2:41 AM
Post by Tian, Kevin
From: Jean-Philippe Brucker
Sent: Saturday, April 8, 2017 3:18 AM
Unlike other virtio devices, the virtio-iommu doesn't work independently,
it is linked to other virtual or assigned devices. So before jumping into
device operations, we need to define a way for the guest to discover the
virtual IOMMU and the devices it translates.
The host must describe the relation between IOMMU and devices to the guest
using either device-tree or ACPI. The virtual IOMMU identifies each
Do you plan to support both device tree and ACPI?
Yes, with ACPI the topology would be described using IORT nodes. I didn't
include an example in my driver because DT is sufficient for a prototype
and is readily available (both in Linux and kvmtool), whereas IORT would
be quite easy to reuse in Linux, but isn't present in kvmtool at the
moment. However, both interfaces have to be supported for the virtio-
iommu
to be portable.
'portable' means whether guest enables ACPI?
Sorry, "supported" isn't the right term for what I meant. It is for
firmware interface to accommodate devices, not the other way around, so
firmware consideration is outside the scope of the virtio-iommu
specification and virtio-iommu itself doesn't need to "support" any interface.

For the purpose of this particular document however, both popular firmware
interfaces (ACPI and DT) must be taken into account. Those are the two
interfaces I know about, there might be others. But I figure that a VMM
implementing a virtual IOMMU is complex enough to be able to also
implement one of these two interfaces, so talking about DT and ACPI should
fit all use cases. It also provides two examples for other firmware
interfaces that wish to describe the IOMMU topology.
Post by Tian, Kevin
Post by Tian, Kevin
virtual device with a 32-bit ID, that we will call "Device ID" in this
document. Device IDs are not necessarily unique system-wide, but they
may
Post by Tian, Kevin
not overlap within a single virtual IOMMU. Device ID of passed-through
devices do not need to match IDs seen by the physical IOMMU.
The virtual IOMMU uses virtio-mmio transport exclusively, not virtio-pci,
because with PCI the IOMMU interface would itself be an endpoint, and
existing firmware interfaces don't allow to describe IOMMU<->master
relations between PCI endpoints.
I'm not familiar with virtio-mmio mechanism. Curious how devices in
virtio-mmio are enumerated today? Could we use that mechanism to
identify vIOMMUs and then invent a purely para-virtualized method to
enumerate devices behind each vIOMMU?
Using DT, virtio-mmio devices are described with "virtio-mmio" compatible
node, and with ACPI they use _HID LNRO0005. Since the host already
describes available devices to a guest using a firmware interface, I think
we should reuse the tools provided by that interface for describing
relations between DMA masters and IOMMU.
OK, I didn't realize virtio-mmio is defined to rely on DT for enumeration.
Not necessarily DT, you can have virtio-mmio devices in ACPI namespace as
well. Qemu has a an example of LNRO0005 with ACPI.
Post by Tian, Kevin
Post by Tian, Kevin
Asking this is because each vendor has its own enumeration methods.
ARM has device tree and ACPI IORT. AMR has ACPI IVRS and device
tree (same format as ARM?). Intel has APCI DMAR and sub-tables. Your
current proposal looks following ARM definitions which I'm not sure
extensible enough to cover features defined only in other vendors'
structures.
ACPI IORT can be extended to incorporate para-virtualized IOMMUs,
regardless of the underlying architecture. It isn't defined solely for the
ARM SMMU, but serves a more general purpose of describing a map of device
identifiers communicated from one components to another. Both DMAR and
IVRS have such description (respectively DRHD and IVHD), but they are
designed for a specific IOMMU, whereas IORT could host other kinds.
I'll take a look at IORT definition. DRHD includes information more
than device mapping.
I guess that most information provided by DMAR and others are
IOMMU-specific and the equivalent for virtio-iommu would fit in virtio
config space. But describing device mapping relative to IOMMUs is the same
problem for all systems. Doing it with a virtio-iommu probing mechanism
would require to reinvent a way to identify devices every time a host
wants to add support for a new bus (RID for PCI, base address for MMIO,
others in the future), when firmwares would have to provide this
information anyway for bare metal.
Post by Tian, Kevin
It seems that all we really need is an interface that says "there is a
virtio-iommu at address X, here are the devices it translates and their
corresponding IDs", and both DT and ACPI IORT are able to fulfill this role.
Post by Tian, Kevin
Since the purpose of this series is to go para-virtualize, why not also
para-virtualize and simplify the enumeration method? For example,
we may define a query interface through vIOMMU registers to allow
guest query whether a device belonging to that vIOMMU. Then we
can even remove use of any enumeration structure completely...
Just a quick example which I may not think through all the pros and
cons. :-)
I don't think adding a brand new topology description mechanism is worth
the effort, we're better off reusing what already exists and is
implemented by operating systems. Adding a query interface inside the
vIOMMU may work (though might be very painful to integrate with fwspec in
Linux), but would be redundant since the host has to provide a firmware
description of the system anyway.
Post by Tian, Kevin
The following diagram describes a situation where two virtual IOMMUs
translate traffic from devices in the system. vIOMMU 1 translates two PCI
domains, in which each function has a 16-bits requester ID. In order for
the vIOMMU to differentiate guest requests targeted at devices in each
domain, their Device ID ranges cannot overlap. vIOMMU 2 translates two
PCI
Post by Tian, Kevin
domains and a collection of platform devices.
Device ID Requester ID
/ 0x0 0x0 \
/ | | PCI domain 1
/ 0xffff 0xffff /
vIOMMU 1
\ 0x10000 0x0 \
\ | | PCI domain 2
\ 0x1ffff 0xffff /
/ 0x0 \
/ | platform devices
/ 0x1fff /
vIOMMU 2
\ 0x2000 0x0 \
\ | | PCI domain 3
\ 0x11fff 0xffff /
isn't above be (0x30000, 3ffff) for PCI domain 3 giving device ID is 16bit?
Unlike Requester IDs in PCI, there is no architected rule for IDs of
platform devices, it's an integration choice. The ID of platform device is
used exclusively for interfacing with an IOMMU (or MSI controller), it
doesn't mean anything outside this context. Here the host allocates 13
bits to platform device IDs, which is legal.
Please add such explanation to your next version. In earlier text
"16-bits request ID" is mentioned for vIOMMU1, which gave me
the illusion that same 16bit applies to vIOMMU2 too.
Sure, I will clarify this.

Thanks,
Jean-Philippe
Jean-Philippe Brucker
2017-04-07 19:17:46 UTC
Permalink
After the virtio-iommu device has been probed and the driver is aware of
the devices translated by the IOMMU, it can start sending requests to the
virtio-iommu device. The operations described here are voluntarily
minimalistic, so vIOMMU devices can be as simple as possible to implement,
and can be extended with feature bits.

I. Overview
II. Feature bits
III. Device configuration layout
IV. Device initialization
V. Device operations
1. Attach device
2. Detach device
3. Map region
4. Unmap region


I. Overview
===========

Requests are small buffers added by the guest to the request virtqueue.
The guest can add a batch of them to the queue and send a notification
(kick) to the device to have all of them handled.

Here is an example flow:

* attach(address space, device), kick: create a new address space and
attach a device to it
* map(address space, virt, phys, size, flags): create a mapping between a
guest-virtual and a guest-physical addresses
* map, map, map, kick

* ... here the guest device can perform DMA to the freshly mapped memory

* unmap(address space, virt, size), unmap, kick
* detach(address space, device), kick

The following description attempts to use the same format as other virtio
devices. We won't go into details of the virtio transport, please refer to
[VIRTIO-v1.0] for more information.

As a quick reminder, the virtio (1.0) transport can be described with the
following flow:

HOST : GUEST
(3) :
.----- [available ring] <-----. (2)
/ : \
v (4) : (1) \
[device] <--- [descriptor table] <---- [driver]
\ : ^
\ : /
(5) '-------> [used ring] ---------'
: (6)
:

(1) Driver has a buffers with a payload to send via virtio. It writes
address and size of buffer in a descriptor. It can chain N sub-buffers
by writing N descriptors and linking them together. The first
descriptor of the chain is referred to as the head.
(2) Driver queues the head index into the 'available' ring.
(3) Driver notifies the device. Since virtio-iommu uses MMIO, notification
is done by writing to a doorbell address. KVM traps it and forwards
the notification to the virtio device. Device dequeues the head index
from the 'available' ring.
(4) Device reads all descriptors in the chain, handles the payload.
(5) Device writes the head index into the 'used' ring and sends a
notification to the guest, by injecting an interrupt.
(6) Driver pops the head from the used ring, and optionally read the
buffers that were updated by the device.


II. Feature bits
================

VIRTIO_IOMMU_F_INPUT_RANGE (0)
Available range of virtual addresses is described in input_range

VIRTIO_IOMMU_F_IOASID_BITS (1)
The number of address spaces supported is described in ioasid_bits

VIRTIO_IOMMU_F_MAP_UNMAP (2)
Map and unmap requests are available. This is here to allow a device or
driver to only implement page-table sharing, once we introduce the
feature. Device will be able to only select one of F_MAP_UNMAP or
F_PT_SHARING. For the moment, this bit must always be set.

VIRTIO_IOMMU_F_BYPASS (3)
When not attached to an address space, devices behind the IOMMU can
access the physical address space.

III. Device configuration layout
================================

struct virtio_iommu_config {
u64 page_size_mask;
struct virtio_iommu_range {
u64 start;
u64 end;
} input_range;
u8 ioasid_bits;
};

IV. Device initialization
=========================

1. page_size_mask contains the bitmask of all page sizes that can be
mapped. The least significant bit set defines the page granularity of
IOMMU mappings. Other bits in the mask are hints describing page sizes
that the IOMMU can merge into a single mapping (page blocks).

There is no lower limit for the smallest page granularity supported by
the IOMMU. It is legal for the driver to map one byte at a time if the
device advertises it.

page_size_mask must have at least one bit set.

2. If the VIRTIO_IOMMU_F_IOASID_BITS feature is negotiated, ioasid_bits
contains the number of bits supported in an I/O Address Space ID, the
identifier used in map/unmap requests. A value of 0 is valid, and means
that a single address space is supported.

If the feature is not negotiated, address space identifiers can use up
to 32 bits.

3. If the VIRTIO_IOMMU_F_INPUT_RANGE feature is negotiated, input_range
contains the virtual address range that the IOMMU is able to translate.
Any mapping request to virtual addresses outside of this range will
fail.

If the feature is not negotiated, virtual mappings span over the whole
64-bit address space (start = 0, end = 0xffffffffffffffff)

4. If the VIRTIO_IOMMU_F_BYPASS feature is negotiated, devices behind the
IOMMU not attached to an address space are allowed to access
guest-physical addresses. Otherwise, accesses to guest-physical
addresses may fault.


V. Device operations
====================

Driver send requests on the request virtqueue (0), notifies the device and
waits for the device to return the request with a status in the used ring.
All requests are split in two parts: one device-readable, one device-
writeable. Each request must therefore be described with at least two
descriptors, as illustrated below.

31 7 0
+--------------------------------+ <------- RO descriptor
| 0 (reserved) | type |
+--------------------------------+
| |
| payload |
| | <------- WO descriptor
+--------------------------------+
| 0 (reserved) | status |
+--------------------------------+

struct virtio_iommu_req_head {
u8 type;
u8 reserved[3];
};

struct virtio_iommu_req_tail {
u8 status;
u8 reserved[3];
};

(Note on the format choice: this format forces the payload to be split in
two - one read-only buffer, one write-only. It is necessary and sufficient
for our purpose, and does not close the door to future extensions with
more complex requests, such as a WO field sandwiched between two RO ones.
With virtio 1.0 ring requirements, such a request would need to be
described by two chains of descriptors, which might be more complex to
implement efficiently, but still possible. Both devices and drivers must
assume that requests are segmented anyway.)

Type may be one of:

VIRTIO_IOMMU_T_ATTACH 1
VIRTIO_IOMMU_T_DETACH 2
VIRTIO_IOMMU_T_MAP 3
VIRTIO_IOMMU_T_UNMAP 4

A few general-purpose status codes are defined here. Driver must not
assume a specific status to be returned for an invalid request. Except for
0 that always means "success", these values are hints to make
troubleshooting easier.

VIRTIO_IOMMU_S_OK 0
All good! Carry on.

VIRTIO_IOMMU_S_IOERR 1
Virtio communication error

VIRTIO_IOMMU_S_UNSUPP 2
Unsupported request

VIRTIO_IOMMU_S_DEVERR 3
Internal device error

VIRTIO_IOMMU_S_INVAL 4
Invalid parameters

VIRTIO_IOMMU_S_RANGE 5
Out-of-range parameters

VIRTIO_IOMMU_S_NOENT 6
Entry not found

VIRTIO_IOMMU_S_FAULT 7
Bad address


1. Attach device
----------------

struct virtio_iommu_req_attach {
le32 address_space;
le32 device;
le32 flags/reserved;
};

Attach a device to an address space. 'address_space' is an identifier
unique to the guest. If the address space doesn't exist in the IOMMU
device, it is created. 'device' is an identifier unique to the IOMMU. The
host communicates unique device ID to the guest during boot. The method
used to communicate this ID is outside the scope of this specification,
but the following rules must apply:

* The device ID is unique from the IOMMU point of view. Multiple devices
whose DMA transactions are not translated by the same IOMMU may have the
same device ID. Devices whose DMA transactions may be translated by the
same IOMMU must have different device IDs.

* Sometimes the host cannot completely isolate two devices from each
others. For example on a legacy PCI bus, devices can snoop DMA
transactions from their neighbours. In this case, the host must
communicate to the guest that it cannot isolate these devices from each
others. The method used to communicate this is outside the scope of this
specification. The IOMMU device must ensure that devices that cannot be
isolated by the host have the same address spaces.

Multiple devices may be added to the same address space. A device cannot
be attached to multiple address spaces (that is, with the map/unmap
interface. For SVM, see page table and context table sharing proposal.)

If the device is already attached to another address space 'old', it is
detached from the old one and attached to the new one. The device cannot
access mappings from the old address space after this request completes.

The device either returns VIRTIO_IOMMU_S_OK, or an error status. We
suggest the following error status, that would help debug the driver.

NOENT: device not found.
RANGE: address space is outside the range allowed by ioasid_bits.


2. Detach device
----------------

struct virtio_iommu_req_detach {
le32 device;
le32 flags/reserved;
};

Detach a device from its address space. When this request completes, the
device cannot access any mapping from that address space anymore. If the
device isn't attached to any address space, the request returns
successfully.

After all devices have been successfully detached from an address space,
its ID can be reused by the driver for another address space.

NOENT: device not found.
INVAL: device wasn't attached to any address space.


3. Map region
-------------

struct virtio_iommu_req_map {
le32 address_space;
le64 phys_addr;
le64 virt_addr;
le64 size;
le32 flags;
};

VIRTIO_IOMMU_MAP_F_READ 0x1
VIRTIO_IOMMU_MAP_F_WRITE 0x2
VIRTIO_IOMMU_MAP_F_EXEC 0x4

Map a range of virtually-contiguous addresses to a range of
physically-contiguous addresses. Size must always be a multiple of the
page granularity negotiated during initialization. Both phys_addr and
virt_addr must be aligned on the page granularity. The address space must
have been created with VIRTIO_IOMMU_T_ATTACH.

The range defined by (virt_addr, size) must be within the limits specified
by input_range. The range defined by (phys_addr, size) must be within the
guest-physical address space. This includes upper and lower limits, as
well as any carving of guest-physical addresses for use by the host (for
instance MSI doorbells). Guest physical boundaries are set by the host
using a firmware mechanism outside the scope of this specification.

(Note that this format prevents from creating the identity mapping in a
single request (0x0 - 0xfff....fff) -> (0x0 - 0xfff...fff), since it would
result in a size of zero. Hopefully allowing VIRTIO_IOMMU_F_BYPASS
eliminates the need for issuing such request. It would also be unlikely to
conform to the physical range restrictions from the previous paragraph)

(Another note, on flags: it is unlikely that all possible combinations of
flags will be supported by the physical IOMMU. For instance, (W & !R) or
(E & W) might be invalid. I haven't taken time to devise a clever way to
advertise supported and implicit (for instance "W implies R") flags or
combination thereof for the moment, but I could at least try to research
common models. Keeping in mind that we might soon want to add more flags,
such as privileged, device, transient, shared, etc. whatever these would
mean)

This request is only available when VIRTIO_IOMMU_F_MAP_UNMAP has been
negotiated.

INVAL: invalid flags
RANGE: virt_addr, phys_addr or range are not in the limits specified
during negotiation. For instance, not aligned to page granularity.
NOENT: address space not found.


4. Unmap region
---------------

struct virtio_iommu_req_unmap {
le32 address_space;
le64 virt_addr;
le64 size;
le32 reserved;
};

Unmap a range of addresses mapped with VIRTIO_IOMMU_T_MAP. The range,
defined by virt_addr and size, must exactly cover one or more contiguous
mappings created with MAP requests. All mappings covered by the range are
removed. Driver should not send a request covering unmapped areas.

We define a mapping as a virtual region created with a single MAP request.
virt_addr should exactly match the start of an existing mapping. The end
of the range, (virt_addr + size - 1), should exactly match the end of an
existing mapping. Device must reject any request that would affect only
part of a mapping. If the requested range spills outside of mapped
regions, the device's behaviour is undefined.

These rules are illustrated with the following requests (with arguments
(va, size)), assuming each example sequence starts with a blank address
space:

map(0, 10)
unmap(0, 10) -> allowed

map(0, 5)
map(5, 5)
unmap(0, 10) -> allowed

map(0, 10)
unmap(0, 5) -> forbidden

map(0, 10)
unmap(0, 15) -> undefined

map(0, 5)
map(10, 5)
unmap(0, 15) -> undefined

(Note: the semantics of unmap are chosen to be compatible with VFIO's
type1 v2 IOMMU API. This way a device serving as intermediary between
guest and VFIO doesn't have to keep an internal tree of mappings. They are
a bit tighter than VFIO, in that they don't allow unmap spilling outside
mapped regions. Spilling is 'undefined' at the moment, because it should
work in most cases but I don't know if it's worth the added complexity in
devices that are not simply transmitting requests to VFIO. Splitting
mappings won't ever be allowed, but see the relaxed proposal in 3/3 for
more lenient semantics)

This request is only available when VIRTIO_IOMMU_F_MAP_UNMAP has been
negotiated.

NOENT: address space not found.
FAULT: mapping not found.
RANGE: request would split a mapping.


[VIRTIO-v1.0] Virtual I/O Device (VIRTIO) Version 1.0. 03 December 2013.
Committee Specification Draft 01 / Public Review Draft 01.
http://docs.oasis-open.org/virtio/virtio/v1.0/csprd01/virtio-v1.0-csprd01.html
Tian, Kevin
2017-04-18 10:26:41 UTC
Permalink
From: Jean-Philippe Brucker
Sent: Saturday, April 8, 2017 3:18 AM
[...]
II. Feature bits
================
VIRTIO_IOMMU_F_INPUT_RANGE (0)
Available range of virtual addresses is described in input_range
Usually only the maximum supported address bits are important.
Curious do you see such situation where low end of the address
space is not usable (since you have both start/end defined later)?

[...]
1. Attach device
----------------
struct virtio_iommu_req_attach {
le32 address_space;
le32 device;
le32 flags/reserved;
};
Attach a device to an address space. 'address_space' is an identifier
unique to the guest. If the address space doesn't exist in the IOMMU
Based on your description this address space ID is per operation right?
MAP/UNMAP and page-table sharing should have different ID spaces...
device, it is created. 'device' is an identifier unique to the IOMMU. The
host communicates unique device ID to the guest during boot. The method
used to communicate this ID is outside the scope of this specification,
* The device ID is unique from the IOMMU point of view. Multiple devices
whose DMA transactions are not translated by the same IOMMU may have the
same device ID. Devices whose DMA transactions may be translated by the
same IOMMU must have different device IDs.
* Sometimes the host cannot completely isolate two devices from each
others. For example on a legacy PCI bus, devices can snoop DMA
transactions from their neighbours. In this case, the host must
communicate to the guest that it cannot isolate these devices from each
others. The method used to communicate this is outside the scope of this
specification. The IOMMU device must ensure that devices that cannot be
"IOMMU device" -> "IOMMU driver"
isolated by the host have the same address spaces.
Thanks
Kevin
Jean-Philippe Brucker
2017-04-18 18:45:54 UTC
Permalink
Post by Tian, Kevin
From: Jean-Philippe Brucker
Sent: Saturday, April 8, 2017 3:18 AM
[...]
II. Feature bits
================
VIRTIO_IOMMU_F_INPUT_RANGE (0)
Available range of virtual addresses is described in input_range
Usually only the maximum supported address bits are important.
Curious do you see such situation where low end of the address
space is not usable (since you have both start/end defined later)?
A start address would allow to provide something resembling a GART to the
guest: an IOMMU with one address space (ioasid_bits=0) and a small IOVA
aperture. I'm not sure how useful that would be in practice.

On a related note, the virtio-iommu itself doesn't provide a
per-address-space aperture as it stands. For example, attaching a device
to an address space might restrict the available IOVA range for the whole
AS if that device cannot write to high memory (above 32-bit). If the guest
attempts to map an IOVA outside this window into the device's address
space, it should expect the MAP request to fail. And when attaching, if
the address space already has mappings outside this window, then ATTACH
should fail.

This too seems to be something that ought to be communicated by firmware,
but bits are missing (I can't find anything equivalent to DT's dma-ranges
for PCI root bridges in ACPI tables, for example). In addition VFIO
doesn't communicate any DMA mask for devices, and doesn't check them
itself. I guess that the host could find out the DMA mask of devices one
way or another, but it is tricky to enforce, so I didn't make this a hard
requirement. Although I should probably add a few words about it.
Post by Tian, Kevin
[...]
1. Attach device
----------------
struct virtio_iommu_req_attach {
le32 address_space;
le32 device;
le32 flags/reserved;
};
Attach a device to an address space. 'address_space' is an identifier
unique to the guest. If the address space doesn't exist in the IOMMU
Based on your description this address space ID is per operation right?
MAP/UNMAP and page-table sharing should have different ID spaces...
I think it's simpler if we keep a single IOASID space per virtio-iommu
device, because the maximum number of address spaces (described by
ioasid_bits) might be a restriction of the pIOMMU. For page-table sharing
you still need to define which devices will share a page directory using
ATTACH requests, though that interface is not set in stone.
Post by Tian, Kevin
device, it is created. 'device' is an identifier unique to the IOMMU. The
host communicates unique device ID to the guest during boot. The method
used to communicate this ID is outside the scope of this specification,
* The device ID is unique from the IOMMU point of view. Multiple devices
whose DMA transactions are not translated by the same IOMMU may have the
same device ID. Devices whose DMA transactions may be translated by the
same IOMMU must have different device IDs.
* Sometimes the host cannot completely isolate two devices from each
others. For example on a legacy PCI bus, devices can snoop DMA
transactions from their neighbours. In this case, the host must
communicate to the guest that it cannot isolate these devices from each
others. The method used to communicate this is outside the scope of this
specification. The IOMMU device must ensure that devices that cannot be
"IOMMU device" -> "IOMMU driver"
Indeed

Thanks!
Jean-Philippe
Post by Tian, Kevin
isolated by the host have the same address spaces.
Tian, Kevin
2017-04-21 09:02:35 UTC
Permalink
Sent: Wednesday, April 19, 2017 2:46 AM
Post by Tian, Kevin
From: Jean-Philippe Brucker
Sent: Saturday, April 8, 2017 3:18 AM
[...]
II. Feature bits
================
VIRTIO_IOMMU_F_INPUT_RANGE (0)
Available range of virtual addresses is described in input_range
Usually only the maximum supported address bits are important.
Curious do you see such situation where low end of the address
space is not usable (since you have both start/end defined later)?
A start address would allow to provide something resembling a GART to the
guest: an IOMMU with one address space (ioasid_bits=0) and a small IOVA
aperture. I'm not sure how useful that would be in practice.
Intel VT-d has no such limitation, which I can tell. :-)
On a related note, the virtio-iommu itself doesn't provide a
per-address-space aperture as it stands. For example, attaching a device
to an address space might restrict the available IOVA range for the whole
AS if that device cannot write to high memory (above 32-bit). If the guest
attempts to map an IOVA outside this window into the device's address
space, it should expect the MAP request to fail. And when attaching, if
the address space already has mappings outside this window, then ATTACH
should fail.
This too seems to be something that ought to be communicated by firmware,
but bits are missing (I can't find anything equivalent to DT's dma-ranges
for PCI root bridges in ACPI tables, for example). In addition VFIO
doesn't communicate any DMA mask for devices, and doesn't check them
itself. I guess that the host could find out the DMA mask of devices one
way or another, but it is tricky to enforce, so I didn't make this a hard
requirement. Although I should probably add a few words about it.
If there is no such communication on bare metal, then same for pvIOMMU.
Post by Tian, Kevin
[...]
1. Attach device
----------------
struct virtio_iommu_req_attach {
le32 address_space;
le32 device;
le32 flags/reserved;
};
Attach a device to an address space. 'address_space' is an identifier
unique to the guest. If the address space doesn't exist in the IOMMU
Based on your description this address space ID is per operation right?
MAP/UNMAP and page-table sharing should have different ID spaces...
I think it's simpler if we keep a single IOASID space per virtio-iommu
device, because the maximum number of address spaces (described by
ioasid_bits) might be a restriction of the pIOMMU. For page-table sharing
you still need to define which devices will share a page directory using
ATTACH requests, though that interface is not set in stone.
got you. yes VM is supposed to consume less IOASIDs than physically
available. It doesn’t hurt to have one IOASID space for both IOVA
map/unmap usages (one IOASID per device) and SVM usages (multiple
IOASIDs per device). The former is digested by software and the latter
will be bound to hardware.

Thanks
Kevin
Jean-Philippe Brucker
2017-04-24 15:05:47 UTC
Permalink
Post by Tian, Kevin
Sent: Wednesday, April 19, 2017 2:46 AM
Post by Tian, Kevin
From: Jean-Philippe Brucker
Sent: Saturday, April 8, 2017 3:18 AM
[...]
II. Feature bits
================
VIRTIO_IOMMU_F_INPUT_RANGE (0)
Available range of virtual addresses is described in input_range
Usually only the maximum supported address bits are important.
Curious do you see such situation where low end of the address
space is not usable (since you have both start/end defined later)?
A start address would allow to provide something resembling a GART to the
guest: an IOMMU with one address space (ioasid_bits=0) and a small IOVA
aperture. I'm not sure how useful that would be in practice.
Intel VT-d has no such limitation, which I can tell. :-)
On a related note, the virtio-iommu itself doesn't provide a
per-address-space aperture as it stands. For example, attaching a device
to an address space might restrict the available IOVA range for the whole
AS if that device cannot write to high memory (above 32-bit). If the guest
attempts to map an IOVA outside this window into the device's address
space, it should expect the MAP request to fail. And when attaching, if
the address space already has mappings outside this window, then ATTACH
should fail.
This too seems to be something that ought to be communicated by firmware,
but bits are missing (I can't find anything equivalent to DT's dma-ranges
for PCI root bridges in ACPI tables, for example). In addition VFIO
doesn't communicate any DMA mask for devices, and doesn't check them
itself. I guess that the host could find out the DMA mask of devices one
way or another, but it is tricky to enforce, so I didn't make this a hard
requirement. Although I should probably add a few words about it.
If there is no such communication on bare metal, then same for pvIOMMU.
Post by Tian, Kevin
[...]
1. Attach device
----------------
struct virtio_iommu_req_attach {
le32 address_space;
le32 device;
le32 flags/reserved;
};
Attach a device to an address space. 'address_space' is an identifier
unique to the guest. If the address space doesn't exist in the IOMMU
Based on your description this address space ID is per operation right?
MAP/UNMAP and page-table sharing should have different ID spaces...
I think it's simpler if we keep a single IOASID space per virtio-iommu
device, because the maximum number of address spaces (described by
ioasid_bits) might be a restriction of the pIOMMU. For page-table sharing
you still need to define which devices will share a page directory using
ATTACH requests, though that interface is not set in stone.
got you. yes VM is supposed to consume less IOASIDs than physically
available. It doesn’t hurt to have one IOASID space for both IOVA
map/unmap usages (one IOASID per device) and SVM usages (multiple
IOASIDs per device). The former is digested by software and the latter
will be bound to hardware.
Hmm, I'm using address space indexed by IOASID for "classic" IOMMU, and
then contexts indexed by PASID when talking about SVM. So in my mind an
address space can have multiple sub-address-spaces (contexts). Number of
IOASIDs is a limitation of the pIOMMU, and number of PASIDs is a
limitation of the device. Therefore attaching devices to address spaces
would update the number of available contexts in that address space. The
terminology is not ideal, and I'd be happy to change it for something more
clear.

Thanks,
Jean-Philippe
Jean-Philippe Brucker
2017-04-07 19:17:47 UTC
Permalink
Here I propose a few ideas for extensions and optimizations. This is all
very exploratory, feel free to correct mistakes and suggest more things.

I. Linux host
1. vhost-iommu
2. VFIO nested translation
II. Page table sharing
1. Sharing IOMMU page tables
2. Sharing MMU page tables (SVM)
3. Fault reporting
4. Host implementation with VFIO
III. Relaxed operations
IV. Misc


I. Linux host
=============

1. vhost-iommu
--------------

An advantage of virtualizing an IOMMU using virtio is that it allows to
hoist a lot of the emulation code into the kernel using vhost, and avoid
returning to userspace for each request. The mainline kernel already
implements vhost-net, vhost-scsi and vhost-vsock, and a lot of core code
could be reused.

Introducing vhost in a simplified scenario 1 (removed guest userspace
pass-through, irrelevant to this example) gives us the following:

MEM____pIOMMU________PCI device____________ HARDWARE
| \
----------|-------------+-------------+-----\--------------------------
| : KVM : \
pIOMMU drv : : \ KERNEL
| : : net drv
VFIO : : /
| : : /
vhost-iommu_________________________virtio-iommu-drv
: :
--------------------------------------+-------------------------------
HOST : GUEST


Introducing vhost in scenario 2, userspace now only handles the device
initialisation part, and most runtime communication is handled in kernel:

MEM__pIOMMU___PCI device HARDWARE
| |
-------|---------|------+-------------+-------------------------------
| | : KVM :
pIOMMU drv | : : KERNEL
\__net drv : :
| : :
tap : :
| : :
_vhost-net________________________virtio-net drv
(2) / : : / (1a)
/ : : /
vhost-iommu________________________________virtio-iommu drv
: : (1b)
------------------------+-------------+-------------------------------
HOST : GUEST

(1) a. Guest virtio driver maps ring and buffers
b. Map requests are relayed to the host the same way.
(2) To access any guest memory, vhost-net must query the IOMMU. We can
reuse the existing TLB protocol for this. TLB commands are written to
and read from the vhost-net fd.

As defined in Linux/include/uapi/linux/vhost.h, the vhost msg structure
has everything needed for map/unmap operations:

struct vhost_iotlb_msg {
__u64 iova;
__u64 size;
__u64 uaddr;
__u8 perm; /* R/W */
__u8 type;
#define VHOST_IOTLB_MISS
#define VHOST_IOTLB_UPDATE /* MAP */
#define VHOST_IOTLB_INVALIDATE /* UNMAP */
#define VHOST_IOTLB_ACCESS_FAIL
};

struct vhost_msg {
int type;
union {
struct vhost_iotlb_msg iotlb;
__u8 padding[64];
};
};

The vhost-iommu device associates a virtual device ID to a TLB fd. We
should be able to use the same commands for [vhost-net <-> virtio-iommu]
and [virtio-net <-> vhost-iommu] communication. A virtio-net device
would open a socketpair and hand one side to vhost-iommu.

If vhost_msg is ever used for another purpose than TLB, we'll have some
trouble, as there will be multiple clients that want to read/write the
vhost fd. A multicast transport method will be needed. Until then, this
can work.

Details of operations would be:

(1) Userspace sets up vhost-iommu as with other vhost devices, by using
standard vhost ioctls. Userspace starts by describing the system topology
via ioctl:

ioctl(iommu_fd, VHOST_IOMMU_ADD_DEVICE, struct
vhost_iommu_add_device)

#define VHOST_IOMMU_DEVICE_TYPE_VFIO
#define VHOST_IOMMU_DEVICE_TYPE_TLB

struct vhost_iommu_add_device {
__u8 type;
__u32 devid;
union {
struct vhost_iommu_device_vfio {
int vfio_group_fd;
};
struct vhost_iommu_device_tlb {
int fd;
};
};
};

(2) VIRTIO_IOMMU_T_ATTACH(address space, devid)

vhost-iommu creates an address space if necessary, finds the device along
with the relevant operations. If type is VFIO, operations are done on a
container, otherwise they are done on single devices.

(3) VIRTIO_IOMMU_T_MAP(address space, virt, phys, size, flags)

Turn phys into an hva using the vhost mem table.

- If type is TLB, either preload with VHOST_IOTLB_UPDATE or store the
mapping locally and wait for the TLB to ask for it with a
VHOST_IOTLB_MISS.
- If type is VFIO, turn it into a VFIO_IOMMU_MAP_DMA (might need to
introduce a shortcut in the external user API of VFIO).

(4) VIRTIO_IOMMU_T_UNMAP(address space, virt, phys, size, flags)

- If type is TLB, send a VHOST_IOTLB_INVALIDATE.
- If type is VFIO, turn it into VFIO_IOMMU_UNMAP_DMA.

(5) VIRTIO_IOMMU_T_DETACH(address space, devid)

Undo whatever was done in (2).


2. VFIO nested translation
--------------------------

For my current kvmtool implementation, I am putting each VFIO group in a
different container during initialization. We cannot detach a group from a
container at runtime without first resetting all devices in that group. So
the best way to provide dynamic address spaces right now is one container
per group. The drawback is that we need to maintain multiple sets of page
tables even if the guest wants to put all devices in the same address
space. Another disadvantage is when implementing bypass mode, we need to
map the whole address space at the beginning, then unmap everything on
attach. Adding nested support would be a nice way to provide dynamic
address spaces while keeping groups tied to a container at all times.

A physical IOMMU may offer nested translation. In this case, address
spaces are managed by two page directories instead of one. A guest-
virtual address is translated into a guest-physical one using what we'll
call here "stage-1" (s1) page tables, and the guest-physical address is
translated into a host-physical one using "stage-2" (s2) page tables.

s1 s2
GVA --> GPA --> HPA

There isn't a lot of support in Linux for nesting IOMMU page directories
at the moment (though SVM support is coming, see II). VFIO does have a
"nesting" IOMMU type, which doesn't mean much at the moment. The ARM SMMU
code uses this to decide whether to manage the container with s2 page
tables instead of s1, but even then we still only have a single stage and
it is assumed that IOVA=GPA.

Another model that would help with dynamically changing address spaces is
nesting VFIO containers:

Parent <---------- map/unmap
container
/ | \
/ group \
Child Child <--- map/unmap
container container
| | |
group group group

At the beginning all groups are attached to the parent container, and
there is no child container. Doing map/unmap on the parent container maps
stage-2 page tables (map GPA -> HVA and pin the page -> HPA). User should
be able to choose whether they want all devices attached to this container
to be able to access GPAs (bypass mode, as it currently is) or simply
block all DMA (in which case there is no need to pin pages here).

At some point the guest wants to create an address space and attaches
children to it. Using an ioctl (to be defined), we can derive a child
container from the parent container, and move groups from parent to child.

This returns a child fd. When the guest maps something in this new address
space, we can do a map ioctl on the child container, which maps stage-1
page tables (map GVA -> GPA).

A page table walk may access multiple levels of tables (pgd, p4d, pud,
pmd, pt). With nested translation, each access to a table during the
stage-1 walk requires a stage-2 walk. This makes a full translation costly
so it is preferable to use a single stage of translation when possible.
Folding two stages into one is simple with a single container, as shown in
the kvmtool example. The host keeps track of GPA->HVA mappings, so it can
fold the full GVA->HVA mapping before sending the VFIO request. With
nested containers however, the IOMMU driver would have to do the folding
work itself. Keeping a copy of stage-2 mapping created on the parent
container, it would fold them into the actual stage-2 page tables when
receiving a map request on the child container (note that software folding
is not possible when stage-1 pgd is managed by the guest, as described in
next section).

I don't know if nested VFIO containers are a desirable feature at all. I
find the concept cute on paper, and it would make it easier for userspace
to juggle with address spaces, but it might require some invasive changes
in VFIO, and people have been able to use the current API for IOMMU
virtualization so far.


II. Page table sharing
======================

1. Sharing IOMMU page tables
----------------------------

VIRTIO_IOMMU_F_PT_SHARING

This is independent of the nested mode described in I.2, but relies on a
similar feature in the physical IOMMU: having two stages of page tables,
one for the host and one for the guest.

When this is supported, the guest can manage its own s1 page directory, to
avoid sending MAP/UNMAP requests. Feature VIRTIO_IOMMU_F_PT_SHARING allows
a driver to give a page directory pointer (pgd) to the host and send
invalidations when removing or changing a mapping. In this mode, three
requests are used: probe, attach and invalidate. An address space cannot
be using the MAP/UNMAP interface and PT_SHARING at the same time.

Device and driver first need to negotiate which page table format they
will be using. This depends on the physical IOMMU, so the request contains
a negotiation part to probe the device capabilities.

(1) Driver attaches devices to address spaces as usual, but a flag
VIRTIO_IOMMU_ATTACH_F_PRIVATE (working title) tells the device not to
create page tables for use with the MAP/UNMAP API. The driver intends
to manage the address space itself.

(2) Driver sends a PROBE_TABLE request. It sets len > 0 with the size of
pg_format array.

VIRTIO_IOMMU_T_PROBE_TABLE

struct virtio_iommu_req_probe_table {
le32 address_space;
le32 flags;
le32 len;

le32 nr_contexts;
struct {
le32 model;
u8 format[64];
} pg_format[len];
};

Introducing a probe request is more flexible than advertising those
features in virtio config, because capabilities are dynamic, and depend on
which devices are attached to an address space. Within a single address
space, devices may support different numbers of contexts (PASIDs), and
some may not support recoverable faults.

(3) Device responds success with all page table formats implemented by the
physical IOMMU in pg_format. 'model' 0 is invalid, so driver can
initialize the array to 0 and deduce from there which entries have
been filled by the device.

Using a probe method seems preferable over trying to attach every possible
format until one sticks. For instance, with an ARM guest running on an x86
host, PROBE_TABLE would return the Intel IOMMU page table format, and the
guest could use that page table code to handle its mappings, hidden behind
the IOMMU API. This requires that the page-table code is reasonably
abstracted from the architecture, as is done with drivers/iommu/io-pgtable
(an x86 guest could use any format implement by io-pgtable for example.)

(4) If the driver is able to use this format, it sends the ATTACH_TABLE
request.

VIRTIO_IOMMU_T_ATTACH_TABLE

struct virtio_iommu_req_attach_table {
le32 address_space;
le32 flags;
le64 table;

le32 nr_contexts;
/* Page-table format description */

le32 model;
u8 config[64]
};


'table' is a pointer to the page directory. 'nr_contexts' isn't used
here.

For both ATTACH and PROBE, 'flags' are the following (and will be
explained later):

VIRTIO_IOMMU_ATTACH_TABLE_F_INDIRECT (1 << 0)
VIRTIO_IOMMU_ATTACH_TABLE_F_NATIVE (1 << 1)
VIRTIO_IOMMU_ATTACH_TABLE_F_FAULT (1 << 2)

Now 'model' is a bit tricky. We need to specify all possible page table
formats and their parameters. I'm not well-versed in x86, s390 or other
IOMMUs, so I'll just focus on the ARM world for this example. We basically
have two page table models, with a multitude of configuration bits:

* ARM LPAE
* ARM short descriptor

We could define a high-level identifier per page-table model, such as:

#define PG_TABLE_ARM 0x1
#define PG_TABLE_X86 0x2
...

And each model would define its own structure. On ARM 'format' could be a
simple u32 defining a variant, LPAE 32/64 or short descriptor. It could
also contain additional capabilities. Then depending on the variant,
'config' would be:

struct pg_config_v7s {
le32 tcr;
le32 prrr;
le32 nmrr;
le32 asid;
};

struct pg_config_lpae {
le64 tcr;
le64 mair;
le32 asid;

/* And maybe TTB1? */
};

struct pg_config_arm {
le32 variant;
union ...;
};

I am really uneasy with describing all those nasty architectural details
in the virtio-iommu specification. We certainly won't start describing the
content bit-by-bit of tcr or mair here, but just declaring these fields
might be sufficient.

(5) Once the table is attached, the driver can simply write the page
tables and expect the physical IOMMU to observe the mappings without
any additional request. When changing or removing a mapping, however,
the driver must send an invalidate request.

VIRTIO_IOMMU_T_INVALIDATE

struct virtio_iommu_req_invalidate {
le32 address_space;
le32 context;
le32 flags;
le64 virt_addr;
le64 range_size;

u8 opaque[64];
};

'flags' may be:

VIRTIO_IOMMU_INVALIDATE_T_VADDR: invalidate a single VA range
from 'context' (context is 0 when !F_INDIRECT).

And with context tables only (explained below):

VIRTIO_IOMMU_INVALIDATE_T_SINGLE: invalidate all mappings from
'context' (context is 0 when !F_INDIRECT). virt_addr and range_size
are ignored.

VIRTIO_IOMMU_INVALIDATE_T_TABLE: with F_INDIRECT, invalidate entries
in the table that changed. Device reads the table again, compares it
to previous values, and invalidate all mappings for contexts that
changed. context, virt_addr and range_size are ignored.

IOMMUs may offer hints and quirks in their invalidation packets. The
opaque structure in invalidate would allow to transport those. This
depends on the page table format and as with architectural page-table
definitions, I really don't want to have those details in the spec itself.


2. Sharing MMU page tables
--------------------------

The guest can share process page-tables with the physical IOMMU. To do
that, it sends PROBE_TABLE with (F_INDIRECT | F_NATIVE | F_FAULT). The
page table format is implicit, so the pg_format array can be empty (unless
the guest wants to query some specific property, e.g. number of levels
supported by the pIOMMU?). If the host answers with success, guest can
send its MMU page table details with ATTACH_TABLE and (F_NATIVE |
F_INDIRECT | F_FAULT) flags.

F_FAULT means that the host communicates page requests from device to the
guest, and the guest can handle them by mapping virtual address in the
fault to pages. It is only available with VIRTIO_IOMMU_F_FAULT_QUEUE (see
below.)

F_NATIVE means that the pIOMMU pgtable format is the same as guest MMU
pgtable format.

F_INDIRECT means that 'table' pointer is a context table, instead of a
page directory. Each slot in the context table points to a page directory:

64 2 1 0
table ----> +---------------------+
| pgd |0|1|<--- context 0
| --- |0|0|<--- context 1
| pgd |0|1|
| --- |0|0|
| --- |0|0|
+---------------------+
| \___Entry is valid
|______reserved

Question: do we want per-context page table format, or can it stay global
for the whole indirect table?

Having a context table allows to provide multiple address spaces for a
single device. In the simplest form, without F_INDIRECT we have a single
address space per device, but some devices may implement more, for
instance devices with the PCI PASID extension.

A slot's position in the context table gives an ID, between 0 and
nr_contexts. The guest can use this ID to have the device target a
specific address space with DMA. The mechanism to do that is
device-specific. For a PCI device, the ID is a PASID, and PCI doesn't
define a specific way of using them for DMA, it's the device driver's
concern.


3. Fault reporting
------------------

VIRTIO_IOMMU_F_EVENT_QUEUE

With this feature, an event virtqueue (1) is available. For now it will
only be used for fault handling, but I'm calling it eventq so that other
asynchronous features can piggy-back on it. Device may report faults and
page requests by sending buffers via the used ring.

#define VIRTIO_IOMMU_T_FAULT 0x05

struct virtio_iommu_evt_fault {
struct virtio_iommu_evt_head {
u8 type;
u8 reserved[3];
};

u32 address_space;
u32 context;

u64 vaddr;
u32 flags; /* Access details: R/W/X */

/* In the reply: */
u32 reply; /* Fault handled, or failure */
u64 paddr;
};

Driver must send the reply via the request queue, with the fault status
in 'reply', and the mapped page in 'paddr' on success.

Existing fault handling interfaces such as PRI have a tag (PRG) allowing
to identify a page request (or group thereof) when sending a reply. I
wonder if this would be useful to us, but it seems like the
(address_space, context, vaddr) tuple is sufficient to identify a page
fault, provided the device doesn't send duplicate faults. Duplicate faults
could be required if they have a side effect, for instance implementing a
poor man's doorbell. If this is desirable, we could add a fault_id field.


4. Host implementation with VFIO
--------------------------------

The VFIO interface for sharing page tables is being worked on at the
moment by Intel. Other virtual IOMMU implementation will most likely let
guest manage full context tables (PASID tables) themselves, giving the
context table pointer to the pIOMMU via a VFIO ioctl.

For the architecture-agnostic virtio-iommu however, we shouldn't have to
implement all possible formats of context table (they are at least
different between ARM SMMU and Intel IOMMU, and will certainly be extended
in future physical IOMMU architectures.) In addition, most users might
only care about having one page directory per device, as SVM is a luxury
at the moment and few devices support it. For these reasons, we should
allow to pass single page directories via VFIO, using very similar
structures as described above, whilst reusing the VFIO channel developed
for Intel vIOMMU.

* VFIO_SVM_INFO: probe page table formats
* VFIO_SVM_BIND: set pgd and arch-specific configuration

There is an inconvenient with letting the pIOMMU driver manage the guest's
context table. During a page table walk, the pIOMMU translates the context
table pointer using the stage-2 page tables. The context table must
therefore be mapped in guest-physical space by the pIOMMU driver. One
solution is to let the pIOMMU driver reserve some GPA space upfront using
the iommu and sysfs resv API [1]. The host would then carve that region
out of the guest-physical space using a firmware mechanism (for example DT
reserved-memory node).


III. Relaxed operations
=======================

VIRTIO_IOMMU_F_RELAXED

Adding an IOMMU dramatically reduces performance of a device, because
map/unmap operations are costly and produce a lot of TLB traffic. For
significant performance improvements, device might allow the driver to
sacrifice safety for speed. In this mode, the driver does not need to send
UNMAP requests. The semantics of MAP change and are more complex to
implement. Given a MAP([start:end] -> phys, flags) request:

(1) If [start:end] isn't mapped, request succeeds as usual.
(2) If [start:end] overlaps an existing mapping [old_start:old_end], we
unmap [max(start, old_start):min(end, old_end)] and replace it with
[start:end].
(3) If [start:end] overlaps an existing mapping that matches the new map
request exactly (same flags, same phys address), the old mapping is
kept.

This squashing could be performed by the guest. The driver can catch unmap
requests from the DMA layer, and only relay map requests for (1) and (2).
A MAP request is therefore able to split and partially override an
existing mapping, which isn't allowed in non-relaxed mode. UNMAP requests
are unnecessary, but are now allowed to split or carve holes in mappings.

In this model, a MAP request may take longer, but we may have a net gain
by removing a lot of redundant requests. Squashing series of map/unmap
performed by the guest for the same mapping improves temporal reuse of
IOVA mappings, which I can observe by simply dumping IOMMU activity of a
virtio device. It reduce the number of TLB invalidations to the strict
minimum while keeping correctness of DMA operations (provided the device
obeys its driver). There is a good read on the subject of optimistic
teardown in paper [2].

This model is completely unsafe. A stale DMA transaction might access a
page long after the device driver in the guest unmapped it and
decommissioned the page. The DMA transaction might hit into a completely
different part of the system that is now reusing the page. Existing
relaxed implementations attempt to mitigate the risk by setting a timeout
on the teardown. Unmap requests from device drivers are not discarded
entirely, but buffered and sent at a later time. Paper [2] reports good
results with a 10ms delay.

We could add a way for device and driver to negotiate a vulnerability
window to mitigate the risk of DMA attacks. Driver might not accept a
window at all, since it requires more infrastructure to keep delayed
mappings. In my opinion, it should be made clear that regardless of the
duration of this window, any driver accepting F_RELAXED feature makes the
guest completely vulnerable, and the choice boils down to either isolation
or speed, not a bit of both.


IV. Misc
========

I think we have enough to go on for a while. To improve MAP throughput, I
considered adding a MAP_SG request depending on a feature bit, with
variable size:

struct virtio_iommu_req_map_sg {
struct virtio_iommu_req_head;
u32 address_space;
u32 nr_elems;
u64 virt_addr;
u64 size;
u64 phys_addr[nr_elems];
};

Would create the following mappings:

virt_addr -> phys_addr[0]
virt_addr + size -> phys_addr[1]
virt_addr + 2 * size -> phys_addr[2]
...

This would avoid the overhead of multiple map commands. We could try to
find a more cunning format to compress virtually-contiguous mappings with
different (phys, size) pairs as well. But Linux drivers rarely prefer
map_sg() functions over regular map(), so I don't know if the whole map_sg
feature is worth the effort. All we would gain is a few bytes anyway.

My current map_sg implementation in the virtio-iommu driver adds a batch
of map requests to the queue and kick the host once. That might be enough
of an optimization.


Another invasive optimization would be adding grouped requests. By adding
two flags in the header, L and G, we can group sequences of requests
together, and have one status at the end, either 0 if all requests in the
group succeeded, or the status of the first request that failed. This is
all in-order. Requests in a group follow each others, there is no sequence
identifier.

___ L: request is last in the group
/ _ G: request is part of a group
| /
v v
31 9 8 7 0
+--------------------------------+ <------- RO descriptor
| res0 |0|1| type |
+--------------------------------+
| payload |
+--------------------------------+
| res0 |0|1| type |
+--------------------------------+
| payload |
+--------------------------------+
| res0 |0|1| type |
+--------------------------------+
| payload |
+--------------------------------+
| res0 |1|1| type |
+--------------------------------+
| payload |
+--------------------------------+ <------- WO descriptor
| res0 | status |
+--------------------------------+

This adds some complexity on the device, since it must unroll whatever was
done by successful requests in a group as soon as one fails, and reject
all subsequent ones. A group of requests is an atomic operation. As with
map_sg, this change mostly allows to save space and virtio descriptors.


[1] https://www.kernel.org/doc/Documentation/ABI/testing/sysfs-kernel-iommu_groups
[2] vIOMMU: Efficient IOMMU Emulation
N. Amit, M. Ben-Yehuda, D. Tsafrir, A. Schuster
Tian, Kevin
2017-04-21 08:31:15 UTC
Permalink
From: Jean-Philippe Brucker
Sent: Saturday, April 8, 2017 3:18 AM
Here I propose a few ideas for extensions and optimizations. This is all
very exploratory, feel free to correct mistakes and suggest more things.
[...]
II. Page table sharing
======================
1. Sharing IOMMU page tables
----------------------------
VIRTIO_IOMMU_F_PT_SHARING
This is independent of the nested mode described in I.2, but relies on a
similar feature in the physical IOMMU: having two stages of page tables,
one for the host and one for the guest.
When this is supported, the guest can manage its own s1 page directory, to
avoid sending MAP/UNMAP requests. Feature
VIRTIO_IOMMU_F_PT_SHARING allows
a driver to give a page directory pointer (pgd) to the host and send
invalidations when removing or changing a mapping. In this mode, three
requests are used: probe, attach and invalidate. An address space cannot
be using the MAP/UNMAP interface and PT_SHARING at the same time.
Device and driver first need to negotiate which page table format they
will be using. This depends on the physical IOMMU, so the request contains
a negotiation part to probe the device capabilities.
(1) Driver attaches devices to address spaces as usual, but a flag
VIRTIO_IOMMU_ATTACH_F_PRIVATE (working title) tells the device not to
create page tables for use with the MAP/UNMAP API. The driver intends
to manage the address space itself.
(2) Driver sends a PROBE_TABLE request. It sets len > 0 with the size of
pg_format array.
VIRTIO_IOMMU_T_PROBE_TABLE
struct virtio_iommu_req_probe_table {
le32 address_space;
le32 flags;
le32 len;
le32 nr_contexts;
struct {
le32 model;
u8 format[64];
} pg_format[len];
};
Introducing a probe request is more flexible than advertising those
features in virtio config, because capabilities are dynamic, and depend on
which devices are attached to an address space. Within a single address
space, devices may support different numbers of contexts (PASIDs), and
some may not support recoverable faults.
(3) Device responds success with all page table formats implemented by the
physical IOMMU in pg_format. 'model' 0 is invalid, so driver can
initialize the array to 0 and deduce from there which entries have
been filled by the device.
Using a probe method seems preferable over trying to attach every possible
format until one sticks. For instance, with an ARM guest running on an x86
host, PROBE_TABLE would return the Intel IOMMU page table format, and the
guest could use that page table code to handle its mappings, hidden behind
the IOMMU API. This requires that the page-table code is reasonably
abstracted from the architecture, as is done with drivers/iommu/io-pgtable
(an x86 guest could use any format implement by io-pgtable for example.)
So essentially you need modify all existing IOMMU drivers to support page
table sharing in pvIOMMU. After abstraction is done the core pvIOMMU files
can be kept vendor agnostic. But if we talk about the whole pvIOMMU
module, it actually includes vendor specific logic thus unlike typical
para-virtualized virtio drivers being completely vendor agnostic. Is this
understanding accurate?

It also means in the host-side pIOMMU driver needs to propagate all
supported formats through VFIO to Qemu vIOMMU, meaning
such format definitions need be consistently agreed across all those
components.

[...]
2. Sharing MMU page tables
--------------------------
The guest can share process page-tables with the physical IOMMU. To do
that, it sends PROBE_TABLE with (F_INDIRECT | F_NATIVE | F_FAULT). The
page table format is implicit, so the pg_format array can be empty (unless
the guest wants to query some specific property, e.g. number of levels
supported by the pIOMMU?). If the host answers with success, guest can
send its MMU page table details with ATTACH_TABLE and (F_NATIVE |
F_INDIRECT | F_FAULT) flags.
F_FAULT means that the host communicates page requests from device to the
guest, and the guest can handle them by mapping virtual address in the
fault to pages. It is only available with VIRTIO_IOMMU_F_FAULT_QUEUE (see
below.)
F_NATIVE means that the pIOMMU pgtable format is the same as guest MMU
pgtable format.
F_INDIRECT means that 'table' pointer is a context table, instead of a
64 2 1 0
table ----> +---------------------+
| pgd |0|1|<--- context 0
| --- |0|0|<--- context 1
| pgd |0|1|
| --- |0|0|
| --- |0|0|
+---------------------+
| \___Entry is valid
|______reserved
Question: do we want per-context page table format, or can it stay global
for the whole indirect table?
Are you defining this context table format in software, or following
hardware definition? At least for VT-d there is a strict hardware-defined
structure (PASID table) which must be used here.

[...]
4. Host implementation with VFIO
--------------------------------
The VFIO interface for sharing page tables is being worked on at the
moment by Intel. Other virtual IOMMU implementation will most likely let
guest manage full context tables (PASID tables) themselves, giving the
context table pointer to the pIOMMU via a VFIO ioctl.
For the architecture-agnostic virtio-iommu however, we shouldn't have to
implement all possible formats of context table (they are at least
different between ARM SMMU and Intel IOMMU, and will certainly be extended
Since anyway you'll finally require vendor specific page table logic,
why not also abstracting this context table too which then doesn't
require below host-side changes?
in future physical IOMMU architectures.) In addition, most users might
only care about having one page directory per device, as SVM is a luxury
at the moment and few devices support it. For these reasons, we should
allow to pass single page directories via VFIO, using very similar
structures as described above, whilst reusing the VFIO channel developed
for Intel vIOMMU.
* VFIO_SVM_INFO: probe page table formats
* VFIO_SVM_BIND: set pgd and arch-specific configuration
There is an inconvenient with letting the pIOMMU driver manage the guest's
context table. During a page table walk, the pIOMMU translates the context
table pointer using the stage-2 page tables. The context table must
therefore be mapped in guest-physical space by the pIOMMU driver. One
solution is to let the pIOMMU driver reserve some GPA space upfront using
the iommu and sysfs resv API [1]. The host would then carve that region
out of the guest-physical space using a firmware mechanism (for example DT
reserved-memory node).
Can you elaborate this flow? pIOMMU driver doesn't directly manage GPA
address space thus it's not reasonable for it to randomly specify a reserved
range. It might make more sense for GPA owner (e.g. Qemu) to decide and
then pass information to pIOMMU driver.
III. Relaxed operations
=======================
VIRTIO_IOMMU_F_RELAXED
Adding an IOMMU dramatically reduces performance of a device, because
map/unmap operations are costly and produce a lot of TLB traffic. For
significant performance improvements, device might allow the driver to
sacrifice safety for speed. In this mode, the driver does not need to send
UNMAP requests. The semantics of MAP change and are more complex to
(1) If [start:end] isn't mapped, request succeeds as usual.
(2) If [start:end] overlaps an existing mapping [old_start:old_end], we
unmap [max(start, old_start):min(end, old_end)] and replace it with
[start:end].
(3) If [start:end] overlaps an existing mapping that matches the new map
request exactly (same flags, same phys address), the old mapping is
kept.
This squashing could be performed by the guest. The driver can catch unmap
requests from the DMA layer, and only relay map requests for (1) and (2).
A MAP request is therefore able to split and partially override an
existing mapping, which isn't allowed in non-relaxed mode. UNMAP requests
are unnecessary, but are now allowed to split or carve holes in mappings.
In this model, a MAP request may take longer, but we may have a net gain
by removing a lot of redundant requests. Squashing series of map/unmap
performed by the guest for the same mapping improves temporal reuse of
IOVA mappings, which I can observe by simply dumping IOMMU activity of a
virtio device. It reduce the number of TLB invalidations to the strict
minimum while keeping correctness of DMA operations (provided the device
obeys its driver). There is a good read on the subject of optimistic
teardown in paper [2].
This model is completely unsafe. A stale DMA transaction might access a
page long after the device driver in the guest unmapped it and
decommissioned the page. The DMA transaction might hit into a completely
different part of the system that is now reusing the page. Existing
relaxed implementations attempt to mitigate the risk by setting a timeout
on the teardown. Unmap requests from device drivers are not discarded
entirely, but buffered and sent at a later time. Paper [2] reports good
results with a 10ms delay.
We could add a way for device and driver to negotiate a vulnerability
window to mitigate the risk of DMA attacks. Driver might not accept a
window at all, since it requires more infrastructure to keep delayed
mappings. In my opinion, it should be made clear that regardless of the
duration of this window, any driver accepting F_RELAXED feature makes the
guest completely vulnerable, and the choice boils down to either isolation
or speed, not a bit of both.
Even with above optimization I'd image the performance drop is still
significant for kernel map/unmap usages, not to say when such
optimization is not possible if safety is required (actually I don't
know why IOMMU is still required if safety can be compromised. Aren't
we using IOMMU for security purpose?). I think we'd better focus on
higher-value usages, e.g. user space DMA protection (DPDK) and
SVM, while leaving kernel protection with a lower priority (most for
functionality verification). Is this strategy aligned with your thought?

btw what about interrupt remapping/posting? Are they also in your
plan for pvIOMMU?

Last, thanks for very informative write-! Looks a long enabling path is
required get pvIOMMU feature on-par with a real IOMMU. Starting
with a minimal set is relatively easier. :-)

Thanks
Kevin
Jean-Philippe Brucker
2017-04-24 15:05:55 UTC
Permalink
Post by Tian, Kevin
From: Jean-Philippe Brucker
Sent: Saturday, April 8, 2017 3:18 AM
Here I propose a few ideas for extensions and optimizations. This is all
very exploratory, feel free to correct mistakes and suggest more things.
[...]
II. Page table sharing
======================
1. Sharing IOMMU page tables
----------------------------
VIRTIO_IOMMU_F_PT_SHARING
This is independent of the nested mode described in I.2, but relies on a
similar feature in the physical IOMMU: having two stages of page tables,
one for the host and one for the guest.
When this is supported, the guest can manage its own s1 page directory, to
avoid sending MAP/UNMAP requests. Feature
VIRTIO_IOMMU_F_PT_SHARING allows
a driver to give a page directory pointer (pgd) to the host and send
invalidations when removing or changing a mapping. In this mode, three
requests are used: probe, attach and invalidate. An address space cannot
be using the MAP/UNMAP interface and PT_SHARING at the same time.
Device and driver first need to negotiate which page table format they
will be using. This depends on the physical IOMMU, so the request contains
a negotiation part to probe the device capabilities.
(1) Driver attaches devices to address spaces as usual, but a flag
VIRTIO_IOMMU_ATTACH_F_PRIVATE (working title) tells the device not to
create page tables for use with the MAP/UNMAP API. The driver intends
to manage the address space itself.
(2) Driver sends a PROBE_TABLE request. It sets len > 0 with the size of
pg_format array.
VIRTIO_IOMMU_T_PROBE_TABLE
struct virtio_iommu_req_probe_table {
le32 address_space;
le32 flags;
le32 len;
le32 nr_contexts;
struct {
le32 model;
u8 format[64];
} pg_format[len];
};
Introducing a probe request is more flexible than advertising those
features in virtio config, because capabilities are dynamic, and depend on
which devices are attached to an address space. Within a single address
space, devices may support different numbers of contexts (PASIDs), and
some may not support recoverable faults.
(3) Device responds success with all page table formats implemented by the
physical IOMMU in pg_format. 'model' 0 is invalid, so driver can
initialize the array to 0 and deduce from there which entries have
been filled by the device.
Using a probe method seems preferable over trying to attach every possible
format until one sticks. For instance, with an ARM guest running on an x86
host, PROBE_TABLE would return the Intel IOMMU page table format, and the
guest could use that page table code to handle its mappings, hidden behind
the IOMMU API. This requires that the page-table code is reasonably
abstracted from the architecture, as is done with drivers/iommu/io-pgtable
(an x86 guest could use any format implement by io-pgtable for example.)
So essentially you need modify all existing IOMMU drivers to support page
table sharing in pvIOMMU. After abstraction is done the core pvIOMMU files
can be kept vendor agnostic. But if we talk about the whole pvIOMMU
module, it actually includes vendor specific logic thus unlike typical
para-virtualized virtio drivers being completely vendor agnostic. Is this
understanding accurate?
Yes, although kernel modules would be separate. For Linux on ARM we
already have the page-table logic abstracted in iommu/io-pgtable module,
because multiple IOMMUs share the same PT formats (SMMUv2, SMMUv3, Renesas
IPMMU, Qcom MSM, Mediatek). It offers a simple interface:

* When attaching devices to an IOMMU domain, the IOMMU driver registers
its page table format and provides invalidation callbacks.

* On iommu_map/unmap, the IOMMU driver calls into io_pgtable_ops, which
provide map, unmap and iova_to_phys functions.

* Page table operations call back into the driver via iommu_gather_ops
when they need to invalidate TLB entries.

Currently only the few flavors of ARM PT formats are implemented, but
other page table formats could be added if they fit this model.
Post by Tian, Kevin
It also means in the host-side pIOMMU driver needs to propagate all
supported formats through VFIO to Qemu vIOMMU, meaning
such format definitions need be consistently agreed across all those
components.
Yes, that's the icky part. We need to define a format that every OS and
hypervisor implementing virtio-iommu can understand (similarly to the
PASID table sharing interface that Yi L is working on for VFIO, although
that one is contained in Linux UAPI and doesn't require other OSes to know
about it).
Post by Tian, Kevin
2. Sharing MMU page tables
--------------------------
The guest can share process page-tables with the physical IOMMU. To do
that, it sends PROBE_TABLE with (F_INDIRECT | F_NATIVE | F_FAULT). The
page table format is implicit, so the pg_format array can be empty (unless
the guest wants to query some specific property, e.g. number of levels
supported by the pIOMMU?). If the host answers with success, guest can
send its MMU page table details with ATTACH_TABLE and (F_NATIVE |
F_INDIRECT | F_FAULT) flags.
F_FAULT means that the host communicates page requests from device to the
guest, and the guest can handle them by mapping virtual address in the
fault to pages. It is only available with VIRTIO_IOMMU_F_FAULT_QUEUE (see
below.)
F_NATIVE means that the pIOMMU pgtable format is the same as guest MMU
pgtable format.
F_INDIRECT means that 'table' pointer is a context table, instead of a
64 2 1 0
table ----> +---------------------+
| pgd |0|1|<--- context 0
| --- |0|0|<--- context 1
| pgd |0|1|
| --- |0|0|
| --- |0|0|
+---------------------+
| \___Entry is valid
|______reserved
Question: do we want per-context page table format, or can it stay global
for the whole indirect table?
Are you defining this context table format in software, or following
hardware definition? At least for VT-d there is a strict hardware-defined
structure (PASID table) which must be used here.
This definition is only for virtio-iommu, I didn't follow any hardware
definitions. For SMMUv3 the context tables are completely different. There
may be two levels of tables, and each context gets a 512-bits descriptor
(it has per-context page table format and other info).

To be honest I'm not sure where I was going with this indirect table. I
can't see any advantage in using an indirect table over sending a bunch of
individual ATTACH_TABLE requests, each with a pgd and a pasid. However the
indirect flag could be needed for sharing physical context tables (below).
Post by Tian, Kevin
4. Host implementation with VFIO
--------------------------------
The VFIO interface for sharing page tables is being worked on at the
moment by Intel. Other virtual IOMMU implementation will most likely let
guest manage full context tables (PASID tables) themselves, giving the
context table pointer to the pIOMMU via a VFIO ioctl.
For the architecture-agnostic virtio-iommu however, we shouldn't have to
implement all possible formats of context table (they are at least
different between ARM SMMU and Intel IOMMU, and will certainly be extended
Since anyway you'll finally require vendor specific page table logic,
why not also abstracting this context table too which then doesn't
require below host-side changes?
I keep going back and forth on that question :) Some pIOMMUs won't have
context tables, so we need a ATTACH_TABLE interface for sharing single pgd
anyway. Now for SVM, we could either create an additional interface for
vendor-specific context tables, or send individual ATTACH_TABLE request.

The disadvantage of sharing context tables is that it requires more
specification work to enumerate all existing context table formats,
similarly to the work needed for defining all page table formats. As I
said earlier this work needs to be done anyway for VFIO, but this time it
would be an interface that needs to suit all OSes and hypervisor, not only
Linux. I think it's a lot more complicated to agree on that since it's not
a matter of sending Linux patches to extend the interface anymore, it is a
wider scope.

So we need to carefully consider whether this additional specification
effort is really needed. We certainly want to share page tables with the
guest to improves performance over the map/unmap interface, but I don't
see a similar performance concern on context tables. Supposedly binding a
device context to a task is a relatively rare event, much less frequent
than updating PT mappings.

In addition page table formats might be more common than context table
formats and therefore easier to abstract. With context tables you will
need one format per IOMMU variant, whereas (on ARM) multiple IOMMUs could
share the same page table format. I'm not sure whether the same argument
applies to x86 (similarity of page tables between Intel and AMD IOMMU
versus differences in PASID/GCR3 table formats)

On the other hand, the clear advantage of sharing context tables with the
guest is that we don't have to do the complicated memory reserve dance
described below.
Post by Tian, Kevin
in future physical IOMMU architectures.) In addition, most users might
only care about having one page directory per device, as SVM is a luxury
at the moment and few devices support it. For these reasons, we should
allow to pass single page directories via VFIO, using very similar
structures as described above, whilst reusing the VFIO channel developed
for Intel vIOMMU.
* VFIO_SVM_INFO: probe page table formats
* VFIO_SVM_ATTACH_TABLE: set pgd and arch-specific configuration
There is an inconvenient with letting the pIOMMU driver manage the guest's
context table. During a page table walk, the pIOMMU translates the context
table pointer using the stage-2 page tables. The context table must
therefore be mapped in guest-physical space by the pIOMMU driver. One
solution is to let the pIOMMU driver reserve some GPA space upfront using
the iommu and sysfs resv API [1]. The host would then carve that region
out of the guest-physical space using a firmware mechanism (for example DT
reserved-memory node).
Can you elaborate this flow? pIOMMU driver doesn't directly manage GPA
address space thus it's not reasonable for it to randomly specify a reserved
range. It might make more sense for GPA owner (e.g. Qemu) to decide and
then pass information to pIOMMU driver.
I realized that it's actually more complicated than this, because I didn't
consider hotplugging devices into VM. If you insert new devices at
runtime, you might need more GPA space for storing their context tables,
but only if they don't attach to an existing address space (otherwise on
ARM we could reuse the existing context table)

So GPA space cannot be reserved statically, but must be reclaimed at
runtime. In addition, context tables can become quite big, and with static
reserve we'd have to reserve tonnes of GPA space upfront even if the guest
isn't planning on using context tables at all. And even without
considering SVM, some IOMMUs (namely SMMUv3) would still need a
single-entry table in GPA space for nested translation.

I don't have any pleasant solution so far. One way of doing it is to carry
memory reclaim in ATTACH_TABLE requests:

(1) Driver sends ATTACH_TABLE(pasid, pgd)
(2) Device relays BIND(pasid, pgd) to pIOMMU via VFIO
(3) pIOMMU needs, say, 512KiB of contiguous GPA for mapping a context
table. Returns this info via VFIO.
(4) Device replies to ATTACH_TABLE with "try again" and, somewhere in the
request buffer, stores the amount of contiguous GPA that the operation
will cost.
(5) Driver re-sends the ATTACH_TABLE request, but this time with a GPA
address that the host can use.

Note that each reclaim for a table should be accompanied by an identifier
for that table. So that if a second ATTACH_TABLE requests reaches the
device between (4) and (5) and require GPA space for the same table, the
device returns the same GPA reclaim with the same identifier and the
driver won't have to allocate GPA twice.

If the pIOMMU needs N > 1 contiguous GPA chunks (for instance, two levels
of context tables) we could do N reclaim (requiring N + 1 ATTACH_TABLE
requests) or put an array in the ATTACH_TABLE request. I prefer the
former, there is little advantage to the latter.

Alternatively, this could be a job for something similar to
virtio-balloon, with contiguous chunks instead of pages. The ATTACH_TABLE
would block the primary request queue while the GPA reclaim is serviced by
the guest on an auxiliary queue (which may not be acceptable if the driver
expects MAP/UNMAP/INVALIDATE requests on the same queue to be fast).

In any case, I would greatly appreciate any proposal for a nicer
mechanism, because this feels very fragile.
Post by Tian, Kevin
III. Relaxed operations
=======================
VIRTIO_IOMMU_F_RELAXED
Adding an IOMMU dramatically reduces performance of a device, because
map/unmap operations are costly and produce a lot of TLB traffic. For
significant performance improvements, device might allow the driver to
sacrifice safety for speed. In this mode, the driver does not need to send
UNMAP requests. The semantics of MAP change and are more complex to
(1) If [start:end] isn't mapped, request succeeds as usual.
(2) If [start:end] overlaps an existing mapping [old_start:old_end], we
unmap [max(start, old_start):min(end, old_end)] and replace it with
[start:end].
(3) If [start:end] overlaps an existing mapping that matches the new map
request exactly (same flags, same phys address), the old mapping is
kept.
This squashing could be performed by the guest. The driver can catch unmap
requests from the DMA layer, and only relay map requests for (1) and (2).
A MAP request is therefore able to split and partially override an
existing mapping, which isn't allowed in non-relaxed mode. UNMAP requests
are unnecessary, but are now allowed to split or carve holes in mappings.
In this model, a MAP request may take longer, but we may have a net gain
by removing a lot of redundant requests. Squashing series of map/unmap
performed by the guest for the same mapping improves temporal reuse of
IOVA mappings, which I can observe by simply dumping IOMMU activity of a
virtio device. It reduce the number of TLB invalidations to the strict
minimum while keeping correctness of DMA operations (provided the device
obeys its driver). There is a good read on the subject of optimistic
teardown in paper [2].
This model is completely unsafe. A stale DMA transaction might access a
page long after the device driver in the guest unmapped it and
decommissioned the page. The DMA transaction might hit into a completely
different part of the system that is now reusing the page. Existing
relaxed implementations attempt to mitigate the risk by setting a timeout
on the teardown. Unmap requests from device drivers are not discarded
entirely, but buffered and sent at a later time. Paper [2] reports good
results with a 10ms delay.
We could add a way for device and driver to negotiate a vulnerability
window to mitigate the risk of DMA attacks. Driver might not accept a
window at all, since it requires more infrastructure to keep delayed
mappings. In my opinion, it should be made clear that regardless of the
duration of this window, any driver accepting F_RELAXED feature makes the
guest completely vulnerable, and the choice boils down to either isolation
or speed, not a bit of both.
Even with above optimization I'd image the performance drop is still
significant for kernel map/unmap usages, not to say when such
optimization is not possible if safety is required (actually I don't
know why IOMMU is still required if safety can be compromised. Aren't
we using IOMMU for security purpose?).
I guess apart from security concerns, a significant use case would be
scatter-gather, avoiding large contiguous (and pinned down) allocations in
guests. It's quite useful when you start doing DMA over MB or GB of
memory. It also allows pass-though to guest userspace, but for that there
are other ways (UIO or vfio-noiommu)
Post by Tian, Kevin
I think we'd better focus on
higher-value usages, e.g. user space DMA protection (DPDK) and
SVM, while leaving kernel protection with a lower priority (most for
functionality verification). Is this strategy aligned with your thought?
btw what about interrupt remapping/posting? Are they also in your
plan for pvIOMMU?
I didn't think about this so far, because we don't have a special region
reserved for MSIs in the ARM IOMMUs; all MSI doorbells are accessed with
IOVAs and translated similarly to other regions. In addition with KVM ARM,
MSI injection bypasses the IOMMU altogether, the host doesn't actually
write the MSI. I could take a look at what other hypervisors and
architectures do.
Post by Tian, Kevin
Last, thanks for very informative write-! Looks a long enabling path is
required get pvIOMMU feature on-par with a real IOMMU. Starting
with a minimal set is relatively easier. :-)
Yes, I described possible improvements in 3/3 in order to see how they
would fit within the baseline device of 2/3. But apart from vhost
prototype, these are a long way off, and I'd like to make sure that the
base is solid before tackling the rest.

Thanks,
Jean-Philippe
Michael S. Tsirkin
2017-04-26 16:24:24 UTC
Permalink
Post by Jean-Philippe Brucker
Here I propose a few ideas for extensions and optimizations. This is all
very exploratory, feel free to correct mistakes and suggest more things.
I. Linux host
1. vhost-iommu
A qemu based implementation would be a first step.
Would allow validating the claim that it's much
simpler to support than e.g. VTD.
Post by Jean-Philippe Brucker
2. VFIO nested translation
II. Page table sharing
1. Sharing IOMMU page tables
2. Sharing MMU page tables (SVM)
3. Fault reporting
4. Host implementation with VFIO
III. Relaxed operations
IV. Misc
I. Linux host
=============
1. vhost-iommu
--------------
An advantage of virtualizing an IOMMU using virtio is that it allows to
hoist a lot of the emulation code into the kernel using vhost, and avoid
returning to userspace for each request. The mainline kernel already
implements vhost-net, vhost-scsi and vhost-vsock, and a lot of core code
could be reused.
Introducing vhost in a simplified scenario 1 (removed guest userspace
MEM____pIOMMU________PCI device____________ HARDWARE
| \
----------|-------------+-------------+-----\--------------------------
| : KVM : \
pIOMMU drv : : \ KERNEL
| : : net drv
VFIO : : /
| : : /
vhost-iommu_________________________virtio-iommu-drv
--------------------------------------+-------------------------------
HOST : GUEST
Introducing vhost in scenario 2, userspace now only handles the device
MEM__pIOMMU___PCI device HARDWARE
| |
-------|---------|------+-------------+-------------------------------
pIOMMU drv | : : KERNEL
_vhost-net________________________virtio-net drv
(2) / : : / (1a)
/ : : /
vhost-iommu________________________________virtio-iommu drv
: : (1b)
------------------------+-------------+-------------------------------
HOST : GUEST
(1) a. Guest virtio driver maps ring and buffers
b. Map requests are relayed to the host the same way.
(2) To access any guest memory, vhost-net must query the IOMMU. We can
reuse the existing TLB protocol for this. TLB commands are written to
and read from the vhost-net fd.
As defined in Linux/include/uapi/linux/vhost.h, the vhost msg structure
struct vhost_iotlb_msg {
__u64 iova;
__u64 size;
__u64 uaddr;
__u8 perm; /* R/W */
__u8 type;
#define VHOST_IOTLB_MISS
#define VHOST_IOTLB_UPDATE /* MAP */
#define VHOST_IOTLB_INVALIDATE /* UNMAP */
#define VHOST_IOTLB_ACCESS_FAIL
};
struct vhost_msg {
int type;
union {
struct vhost_iotlb_msg iotlb;
__u8 padding[64];
};
};
The vhost-iommu device associates a virtual device ID to a TLB fd. We
should be able to use the same commands for [vhost-net <-> virtio-iommu]
and [virtio-net <-> vhost-iommu] communication. A virtio-net device
would open a socketpair and hand one side to vhost-iommu.
If vhost_msg is ever used for another purpose than TLB, we'll have some
trouble, as there will be multiple clients that want to read/write the
vhost fd. A multicast transport method will be needed. Until then, this
can work.
(1) Userspace sets up vhost-iommu as with other vhost devices, by using
standard vhost ioctls. Userspace starts by describing the system topology
ioctl(iommu_fd, VHOST_IOMMU_ADD_DEVICE, struct
vhost_iommu_add_device)
#define VHOST_IOMMU_DEVICE_TYPE_VFIO
#define VHOST_IOMMU_DEVICE_TYPE_TLB
struct vhost_iommu_add_device {
__u8 type;
__u32 devid;
union {
struct vhost_iommu_device_vfio {
int vfio_group_fd;
};
struct vhost_iommu_device_tlb {
int fd;
};
};
};
(2) VIRTIO_IOMMU_T_ATTACH(address space, devid)
vhost-iommu creates an address space if necessary, finds the device along
with the relevant operations. If type is VFIO, operations are done on a
container, otherwise they are done on single devices.
(3) VIRTIO_IOMMU_T_MAP(address space, virt, phys, size, flags)
Turn phys into an hva using the vhost mem table.
- If type is TLB, either preload with VHOST_IOTLB_UPDATE or store the
mapping locally and wait for the TLB to ask for it with a
VHOST_IOTLB_MISS.
- If type is VFIO, turn it into a VFIO_IOMMU_MAP_DMA (might need to
introduce a shortcut in the external user API of VFIO).
(4) VIRTIO_IOMMU_T_UNMAP(address space, virt, phys, size, flags)
- If type is TLB, send a VHOST_IOTLB_INVALIDATE.
- If type is VFIO, turn it into VFIO_IOMMU_UNMAP_DMA.
(5) VIRTIO_IOMMU_T_DETACH(address space, devid)
Undo whatever was done in (2).
2. VFIO nested translation
--------------------------
For my current kvmtool implementation, I am putting each VFIO group in a
different container during initialization. We cannot detach a group from a
container at runtime without first resetting all devices in that group. So
the best way to provide dynamic address spaces right now is one container
per group. The drawback is that we need to maintain multiple sets of page
tables even if the guest wants to put all devices in the same address
space. Another disadvantage is when implementing bypass mode, we need to
map the whole address space at the beginning, then unmap everything on
attach. Adding nested support would be a nice way to provide dynamic
address spaces while keeping groups tied to a container at all times.
A physical IOMMU may offer nested translation. In this case, address
spaces are managed by two page directories instead of one. A guest-
virtual address is translated into a guest-physical one using what we'll
call here "stage-1" (s1) page tables, and the guest-physical address is
translated into a host-physical one using "stage-2" (s2) page tables.
s1 s2
GVA --> GPA --> HPA
There isn't a lot of support in Linux for nesting IOMMU page directories
at the moment (though SVM support is coming, see II). VFIO does have a
"nesting" IOMMU type, which doesn't mean much at the moment. The ARM SMMU
code uses this to decide whether to manage the container with s2 page
tables instead of s1, but even then we still only have a single stage and
it is assumed that IOVA=GPA.
Another model that would help with dynamically changing address spaces is
Parent <---------- map/unmap
container
/ | \
/ group \
Child Child <--- map/unmap
container container
| | |
group group group
At the beginning all groups are attached to the parent container, and
there is no child container. Doing map/unmap on the parent container maps
stage-2 page tables (map GPA -> HVA and pin the page -> HPA). User should
be able to choose whether they want all devices attached to this container
to be able to access GPAs (bypass mode, as it currently is) or simply
block all DMA (in which case there is no need to pin pages here).
At some point the guest wants to create an address space and attaches
children to it. Using an ioctl (to be defined), we can derive a child
container from the parent container, and move groups from parent to child.
This returns a child fd. When the guest maps something in this new address
space, we can do a map ioctl on the child container, which maps stage-1
page tables (map GVA -> GPA).
A page table walk may access multiple levels of tables (pgd, p4d, pud,
pmd, pt). With nested translation, each access to a table during the
stage-1 walk requires a stage-2 walk. This makes a full translation costly
so it is preferable to use a single stage of translation when possible.
Folding two stages into one is simple with a single container, as shown in
the kvmtool example. The host keeps track of GPA->HVA mappings, so it can
fold the full GVA->HVA mapping before sending the VFIO request. With
nested containers however, the IOMMU driver would have to do the folding
work itself. Keeping a copy of stage-2 mapping created on the parent
container, it would fold them into the actual stage-2 page tables when
receiving a map request on the child container (note that software folding
is not possible when stage-1 pgd is managed by the guest, as described in
next section).
I don't know if nested VFIO containers are a desirable feature at all. I
find the concept cute on paper, and it would make it easier for userspace
to juggle with address spaces, but it might require some invasive changes
in VFIO, and people have been able to use the current API for IOMMU
virtualization so far.
II. Page table sharing
======================
1. Sharing IOMMU page tables
----------------------------
VIRTIO_IOMMU_F_PT_SHARING
This is independent of the nested mode described in I.2, but relies on a
similar feature in the physical IOMMU: having two stages of page tables,
one for the host and one for the guest.
When this is supported, the guest can manage its own s1 page directory, to
avoid sending MAP/UNMAP requests. Feature VIRTIO_IOMMU_F_PT_SHARING allows
a driver to give a page directory pointer (pgd) to the host and send
invalidations when removing or changing a mapping. In this mode, three
requests are used: probe, attach and invalidate. An address space cannot
be using the MAP/UNMAP interface and PT_SHARING at the same time.
Device and driver first need to negotiate which page table format they
will be using. This depends on the physical IOMMU, so the request contains
a negotiation part to probe the device capabilities.
(1) Driver attaches devices to address spaces as usual, but a flag
VIRTIO_IOMMU_ATTACH_F_PRIVATE (working title) tells the device not to
create page tables for use with the MAP/UNMAP API. The driver intends
to manage the address space itself.
(2) Driver sends a PROBE_TABLE request. It sets len > 0 with the size of
pg_format array.
VIRTIO_IOMMU_T_PROBE_TABLE
struct virtio_iommu_req_probe_table {
le32 address_space;
le32 flags;
le32 len;
le32 nr_contexts;
struct {
le32 model;
u8 format[64];
} pg_format[len];
};
Introducing a probe request is more flexible than advertising those
features in virtio config, because capabilities are dynamic, and depend on
which devices are attached to an address space. Within a single address
space, devices may support different numbers of contexts (PASIDs), and
some may not support recoverable faults.
(3) Device responds success with all page table formats implemented by the
physical IOMMU in pg_format. 'model' 0 is invalid, so driver can
initialize the array to 0 and deduce from there which entries have
been filled by the device.
Using a probe method seems preferable over trying to attach every possible
format until one sticks. For instance, with an ARM guest running on an x86
host, PROBE_TABLE would return the Intel IOMMU page table format, and the
guest could use that page table code to handle its mappings, hidden behind
the IOMMU API. This requires that the page-table code is reasonably
abstracted from the architecture, as is done with drivers/iommu/io-pgtable
(an x86 guest could use any format implement by io-pgtable for example.)
(4) If the driver is able to use this format, it sends the ATTACH_TABLE
request.
VIRTIO_IOMMU_T_ATTACH_TABLE
struct virtio_iommu_req_attach_table {
le32 address_space;
le32 flags;
le64 table;
le32 nr_contexts;
/* Page-table format description */
le32 model;
u8 config[64]
};
'table' is a pointer to the page directory. 'nr_contexts' isn't used
here.
For both ATTACH and PROBE, 'flags' are the following (and will be
VIRTIO_IOMMU_ATTACH_TABLE_F_INDIRECT (1 << 0)
VIRTIO_IOMMU_ATTACH_TABLE_F_NATIVE (1 << 1)
VIRTIO_IOMMU_ATTACH_TABLE_F_FAULT (1 << 2)
Now 'model' is a bit tricky. We need to specify all possible page table
formats and their parameters. I'm not well-versed in x86, s390 or other
IOMMUs, so I'll just focus on the ARM world for this example. We basically
* ARM LPAE
* ARM short descriptor
#define PG_TABLE_ARM 0x1
#define PG_TABLE_X86 0x2
...
And each model would define its own structure. On ARM 'format' could be a
simple u32 defining a variant, LPAE 32/64 or short descriptor. It could
also contain additional capabilities. Then depending on the variant,
struct pg_config_v7s {
le32 tcr;
le32 prrr;
le32 nmrr;
le32 asid;
};
struct pg_config_lpae {
le64 tcr;
le64 mair;
le32 asid;
/* And maybe TTB1? */
};
struct pg_config_arm {
le32 variant;
union ...;
};
I am really uneasy with describing all those nasty architectural details
in the virtio-iommu specification. We certainly won't start describing the
content bit-by-bit of tcr or mair here, but just declaring these fields
might be sufficient.
(5) Once the table is attached, the driver can simply write the page
tables and expect the physical IOMMU to observe the mappings without
any additional request. When changing or removing a mapping, however,
the driver must send an invalidate request.
VIRTIO_IOMMU_T_INVALIDATE
struct virtio_iommu_req_invalidate {
le32 address_space;
le32 context;
le32 flags;
le64 virt_addr;
le64 range_size;
u8 opaque[64];
};
VIRTIO_IOMMU_INVALIDATE_T_VADDR: invalidate a single VA range
from 'context' (context is 0 when !F_INDIRECT).
VIRTIO_IOMMU_INVALIDATE_T_SINGLE: invalidate all mappings from
'context' (context is 0 when !F_INDIRECT). virt_addr and range_size
are ignored.
VIRTIO_IOMMU_INVALIDATE_T_TABLE: with F_INDIRECT, invalidate entries
in the table that changed. Device reads the table again, compares it
to previous values, and invalidate all mappings for contexts that
changed. context, virt_addr and range_size are ignored.
IOMMUs may offer hints and quirks in their invalidation packets. The
opaque structure in invalidate would allow to transport those. This
depends on the page table format and as with architectural page-table
definitions, I really don't want to have those details in the spec itself.
2. Sharing MMU page tables
--------------------------
The guest can share process page-tables with the physical IOMMU. To do
that, it sends PROBE_TABLE with (F_INDIRECT | F_NATIVE | F_FAULT). The
page table format is implicit, so the pg_format array can be empty (unless
the guest wants to query some specific property, e.g. number of levels
supported by the pIOMMU?). If the host answers with success, guest can
send its MMU page table details with ATTACH_TABLE and (F_NATIVE |
F_INDIRECT | F_FAULT) flags.
F_FAULT means that the host communicates page requests from device to the
guest, and the guest can handle them by mapping virtual address in the
fault to pages. It is only available with VIRTIO_IOMMU_F_FAULT_QUEUE (see
below.)
F_NATIVE means that the pIOMMU pgtable format is the same as guest MMU
pgtable format.
F_INDIRECT means that 'table' pointer is a context table, instead of a
64 2 1 0
table ----> +---------------------+
| pgd |0|1|<--- context 0
| --- |0|0|<--- context 1
| pgd |0|1|
| --- |0|0|
| --- |0|0|
+---------------------+
| \___Entry is valid
|______reserved
Question: do we want per-context page table format, or can it stay global
for the whole indirect table?
Having a context table allows to provide multiple address spaces for a
single device. In the simplest form, without F_INDIRECT we have a single
address space per device, but some devices may implement more, for
instance devices with the PCI PASID extension.
A slot's position in the context table gives an ID, between 0 and
nr_contexts. The guest can use this ID to have the device target a
specific address space with DMA. The mechanism to do that is
device-specific. For a PCI device, the ID is a PASID, and PCI doesn't
define a specific way of using them for DMA, it's the device driver's
concern.
3. Fault reporting
------------------
VIRTIO_IOMMU_F_EVENT_QUEUE
With this feature, an event virtqueue (1) is available. For now it will
only be used for fault handling, but I'm calling it eventq so that other
asynchronous features can piggy-back on it. Device may report faults and
page requests by sending buffers via the used ring.
#define VIRTIO_IOMMU_T_FAULT 0x05
struct virtio_iommu_evt_fault {
struct virtio_iommu_evt_head {
u8 type;
u8 reserved[3];
};
u32 address_space;
u32 context;
u64 vaddr;
u32 flags; /* Access details: R/W/X */
/* In the reply: */
u32 reply; /* Fault handled, or failure */
u64 paddr;
};
Driver must send the reply via the request queue, with the fault status
in 'reply', and the mapped page in 'paddr' on success.
Existing fault handling interfaces such as PRI have a tag (PRG) allowing
to identify a page request (or group thereof) when sending a reply. I
wonder if this would be useful to us, but it seems like the
(address_space, context, vaddr) tuple is sufficient to identify a page
fault, provided the device doesn't send duplicate faults. Duplicate faults
could be required if they have a side effect, for instance implementing a
poor man's doorbell. If this is desirable, we could add a fault_id field.
4. Host implementation with VFIO
--------------------------------
The VFIO interface for sharing page tables is being worked on at the
moment by Intel. Other virtual IOMMU implementation will most likely let
guest manage full context tables (PASID tables) themselves, giving the
context table pointer to the pIOMMU via a VFIO ioctl.
For the architecture-agnostic virtio-iommu however, we shouldn't have to
implement all possible formats of context table (they are at least
different between ARM SMMU and Intel IOMMU, and will certainly be extended
in future physical IOMMU architectures.) In addition, most users might
only care about having one page directory per device, as SVM is a luxury
at the moment and few devices support it. For these reasons, we should
allow to pass single page directories via VFIO, using very similar
structures as described above, whilst reusing the VFIO channel developed
for Intel vIOMMU.
* VFIO_SVM_INFO: probe page table formats
* VFIO_SVM_BIND: set pgd and arch-specific configuration
There is an inconvenient with letting the pIOMMU driver manage the guest's
context table. During a page table walk, the pIOMMU translates the context
table pointer using the stage-2 page tables. The context table must
therefore be mapped in guest-physical space by the pIOMMU driver. One
solution is to let the pIOMMU driver reserve some GPA space upfront using
the iommu and sysfs resv API [1]. The host would then carve that region
out of the guest-physical space using a firmware mechanism (for example DT
reserved-memory node).
III. Relaxed operations
=======================
VIRTIO_IOMMU_F_RELAXED
Adding an IOMMU dramatically reduces performance of a device, because
map/unmap operations are costly and produce a lot of TLB traffic. For
significant performance improvements, device might allow the driver to
sacrifice safety for speed. In this mode, the driver does not need to send
UNMAP requests. The semantics of MAP change and are more complex to
(1) If [start:end] isn't mapped, request succeeds as usual.
(2) If [start:end] overlaps an existing mapping [old_start:old_end], we
unmap [max(start, old_start):min(end, old_end)] and replace it with
[start:end].
(3) If [start:end] overlaps an existing mapping that matches the new map
request exactly (same flags, same phys address), the old mapping is
kept.
This squashing could be performed by the guest. The driver can catch unmap
requests from the DMA layer, and only relay map requests for (1) and (2).
A MAP request is therefore able to split and partially override an
existing mapping, which isn't allowed in non-relaxed mode. UNMAP requests
are unnecessary, but are now allowed to split or carve holes in mappings.
In this model, a MAP request may take longer, but we may have a net gain
by removing a lot of redundant requests. Squashing series of map/unmap
performed by the guest for the same mapping improves temporal reuse of
IOVA mappings, which I can observe by simply dumping IOMMU activity of a
virtio device. It reduce the number of TLB invalidations to the strict
minimum while keeping correctness of DMA operations (provided the device
obeys its driver). There is a good read on the subject of optimistic
teardown in paper [2].
This model is completely unsafe. A stale DMA transaction might access a
page long after the device driver in the guest unmapped it and
decommissioned the page. The DMA transaction might hit into a completely
different part of the system that is now reusing the page. Existing
relaxed implementations attempt to mitigate the risk by setting a timeout
on the teardown. Unmap requests from device drivers are not discarded
entirely, but buffered and sent at a later time. Paper [2] reports good
results with a 10ms delay.
We could add a way for device and driver to negotiate a vulnerability
window to mitigate the risk of DMA attacks. Driver might not accept a
window at all, since it requires more infrastructure to keep delayed
mappings. In my opinion, it should be made clear that regardless of the
duration of this window, any driver accepting F_RELAXED feature makes the
guest completely vulnerable, and the choice boils down to either isolation
or speed, not a bit of both.
IV. Misc
========
I think we have enough to go on for a while. To improve MAP throughput, I
considered adding a MAP_SG request depending on a feature bit, with
struct virtio_iommu_req_map_sg {
struct virtio_iommu_req_head;
u32 address_space;
u32 nr_elems;
u64 virt_addr;
u64 size;
u64 phys_addr[nr_elems];
};
virt_addr -> phys_addr[0]
virt_addr + size -> phys_addr[1]
virt_addr + 2 * size -> phys_addr[2]
...
This would avoid the overhead of multiple map commands. We could try to
find a more cunning format to compress virtually-contiguous mappings with
different (phys, size) pairs as well. But Linux drivers rarely prefer
map_sg() functions over regular map(), so I don't know if the whole map_sg
feature is worth the effort. All we would gain is a few bytes anyway.
My current map_sg implementation in the virtio-iommu driver adds a batch
of map requests to the queue and kick the host once. That might be enough
of an optimization.
Another invasive optimization would be adding grouped requests. By adding
two flags in the header, L and G, we can group sequences of requests
together, and have one status at the end, either 0 if all requests in the
group succeeded, or the status of the first request that failed. This is
all in-order. Requests in a group follow each others, there is no sequence
identifier.
___ L: request is last in the group
/ _ G: request is part of a group
| /
v v
31 9 8 7 0
+--------------------------------+ <------- RO descriptor
| res0 |0|1| type |
+--------------------------------+
| payload |
+--------------------------------+
| res0 |0|1| type |
+--------------------------------+
| payload |
+--------------------------------+
| res0 |0|1| type |
+--------------------------------+
| payload |
+--------------------------------+
| res0 |1|1| type |
+--------------------------------+
| payload |
+--------------------------------+ <------- WO descriptor
| res0 | status |
+--------------------------------+
This adds some complexity on the device, since it must unroll whatever was
done by successful requests in a group as soon as one fails, and reject
all subsequent ones. A group of requests is an atomic operation. As with
map_sg, this change mostly allows to save space and virtio descriptors.
[1] https://www.kernel.org/doc/Documentation/ABI/testing/sysfs-kernel-iommu_groups
[2] vIOMMU: Efficient IOMMU Emulation
N. Amit, M. Ben-Yehuda, D. Tsafrir, A. Schuster
Jean-Philippe Brucker
2017-04-07 19:23:14 UTC
Permalink
The virtio IOMMU is a para-virtualized device, allowing to send IOMMU
requests such as map/unmap over virtio-mmio transport. This driver should
illustrate the initial proposal for virtio-iommu, that you hopefully
received with it. It handle attach, detach, map and unmap requests.

The bulk of the code is to create requests and send them through virtio.
Implementing the IOMMU API is fairly straightforward since the
virtio-iommu MAP/UNMAP interface is almost identical. I threw in a custom
map_sg() function which takes up some space, but is optional. The core
function would send a sequence of map requests, waiting for a reply
between each mapping. This optimization avoids yielding to the host after
each map, and instead prepares a batch of requests in the virtio ring and
kicks the host once.

It must be applied on top of the probe deferral work for IOMMU, currently
under discussion. This allows to dissociate early driver detection and
device probing: device-tree or ACPI is parsed early to find which devices
are translated by the IOMMU, but the IOMMU itself cannot be probed until
the core virtio module is loaded.

Enabling DEBUG makes it extremely verbose at the moment, but it should be
calmer in next versions.

Signed-off-by: Jean-Philippe Brucker <jean-***@arm.com>
---
drivers/iommu/Kconfig | 11 +
drivers/iommu/Makefile | 1 +
drivers/iommu/virtio-iommu.c | 980 ++++++++++++++++++++++++++++++++++++++
include/uapi/linux/Kbuild | 1 +
include/uapi/linux/virtio_ids.h | 1 +
include/uapi/linux/virtio_iommu.h | 142 ++++++
6 files changed, 1136 insertions(+)
create mode 100644 drivers/iommu/virtio-iommu.c
create mode 100644 include/uapi/linux/virtio_iommu.h

diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig
index 37e204f3d9be..8cd56ee9a93a 100644
--- a/drivers/iommu/Kconfig
+++ b/drivers/iommu/Kconfig
@@ -359,4 +359,15 @@ config MTK_IOMMU_V1

if unsure, say N here.

+config VIRTIO_IOMMU
+ tristate "Virtio IOMMU driver"
+ depends on VIRTIO_MMIO
+ select IOMMU_API
+ select INTERVAL_TREE
+ select ARM_DMA_USE_IOMMU if ARM
+ help
+ Para-virtualised IOMMU driver with virtio.
+
+ Say Y here if you intend to run this kernel as a guest.
+
endif # IOMMU_SUPPORT
diff --git a/drivers/iommu/Makefile b/drivers/iommu/Makefile
index 195f7b997d8e..1199d8475802 100644
--- a/drivers/iommu/Makefile
+++ b/drivers/iommu/Makefile
@@ -27,3 +27,4 @@ obj-$(CONFIG_TEGRA_IOMMU_SMMU) += tegra-smmu.o
obj-$(CONFIG_EXYNOS_IOMMU) += exynos-iommu.o
obj-$(CONFIG_FSL_PAMU) += fsl_pamu.o fsl_pamu_domain.o
obj-$(CONFIG_S390_IOMMU) += s390-iommu.o
+obj-$(CONFIG_VIRTIO_IOMMU) += virtio-iommu.o
diff --git a/drivers/iommu/virtio-iommu.c b/drivers/iommu/virtio-iommu.c
new file mode 100644
index 000000000000..1cf4f57b7817
--- /dev/null
+++ b/drivers/iommu/virtio-iommu.c
@@ -0,0 +1,980 @@
+/*
+ * Virtio driver for the paravirtualized IOMMU
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
+ *
+ * Copyright (C) 2017 ARM Limited
+ *
+ * Author: Jean-Philippe Brucker <jean-***@arm.com>
+ */
+
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
+#include <linux/amba/bus.h>
+#include <linux/delay.h>
+#include <linux/dma-iommu.h>
+#include <linux/freezer.h>
+#include <linux/interval_tree.h>
+#include <linux/iommu.h>
+#include <linux/module.h>
+#include <linux/of_iommu.h>
+#include <linux/of_platform.h>
+#include <linux/platform_device.h>
+#include <linux/virtio.h>
+#include <linux/virtio_config.h>
+#include <linux/virtio_ids.h>
+#include <linux/wait.h>
+
+#include <uapi/linux/virtio_iommu.h>
+
+struct viommu_dev {
+ struct iommu_device iommu;
+ struct device *dev;
+ struct virtio_device *vdev;
+
+ struct virtqueue *vq;
+ struct list_head pending_requests;
+ /* Serialize anything touching the vq and the request list */
+ spinlock_t vq_lock;
+
+ struct list_head list;
+
+ /* Device configuration */
+ u64 pgsize_bitmap;
+ u64 aperture_start;
+ u64 aperture_end;
+};
+
+struct viommu_mapping {
+ phys_addr_t paddr;
+ struct interval_tree_node iova;
+};
+
+struct viommu_domain {
+ struct iommu_domain domain;
+ struct viommu_dev *viommu;
+ struct mutex mutex;
+ u64 id;
+
+ spinlock_t mappings_lock;
+ struct rb_root mappings;
+
+ /* Number of devices attached to this domain */
+ unsigned long attached;
+};
+
+struct viommu_endpoint {
+ struct viommu_dev *viommu;
+ struct viommu_domain *vdomain;
+};
+
+struct viommu_request {
+ struct scatterlist head;
+ struct scatterlist tail;
+
+ int written;
+ struct list_head list;
+};
+
+/* TODO: use an IDA */
+static atomic64_t viommu_domain_ids_gen;
+
+#define to_viommu_domain(domain) container_of(domain, struct viommu_domain, domain)
+
+/* Virtio transport */
+
+static int viommu_status_to_errno(u8 status)
+{
+ switch (status) {
+ case VIRTIO_IOMMU_S_OK:
+ return 0;
+ case VIRTIO_IOMMU_S_UNSUPP:
+ return -ENOSYS;
+ case VIRTIO_IOMMU_S_INVAL:
+ return -EINVAL;
+ case VIRTIO_IOMMU_S_RANGE:
+ return -ERANGE;
+ case VIRTIO_IOMMU_S_NOENT:
+ return -ENOENT;
+ case VIRTIO_IOMMU_S_FAULT:
+ return -EFAULT;
+ case VIRTIO_IOMMU_S_IOERR:
+ case VIRTIO_IOMMU_S_DEVERR:
+ default:
+ return -EIO;
+ }
+}
+
+static int viommu_get_req_size(struct virtio_iommu_req_head *req, size_t *head,
+ size_t *tail)
+{
+ size_t size;
+ union virtio_iommu_req r;
+
+ *tail = sizeof(struct virtio_iommu_req_tail);
+
+ switch (req->type) {
+ case VIRTIO_IOMMU_T_ATTACH:
+ size = sizeof(r.attach);
+ break;
+ case VIRTIO_IOMMU_T_DETACH:
+ size = sizeof(r.detach);
+ break;
+ case VIRTIO_IOMMU_T_MAP:
+ size = sizeof(r.map);
+ break;
+ case VIRTIO_IOMMU_T_UNMAP:
+ size = sizeof(r.unmap);
+ break;
+ default:
+ return -EINVAL;
+ }
+
+ *head = size - *tail;
+ return 0;
+}
+
+static int viommu_receive_resp(struct viommu_dev *viommu, int nr_expected)
+{
+
+ unsigned int len;
+ int nr_received = 0;
+ struct viommu_request *req, *pending, *next;
+
+ pending = list_first_entry_or_null(&viommu->pending_requests,
+ struct viommu_request, list);
+ if (WARN_ON(!pending))
+ return 0;
+
+ while ((req = virtqueue_get_buf(viommu->vq, &len)) != NULL) {
+ if (req != pending) {
+ dev_warn(viommu->dev, "discarding stale request\n");
+ continue;
+ }
+
+ pending->written = len;
+
+ if (++nr_received == nr_expected) {
+ list_del(&pending->list);
+ /*
+ * In an ideal world, we'd wake up the waiter for this
+ * group of requests here. But everything is painfully
+ * synchronous, so waiter is the caller.
+ */
+ break;
+ }
+
+ next = list_next_entry(pending, list);
+ list_del(&pending->list);
+
+ if (WARN_ON(list_empty(&viommu->pending_requests)))
+ return 0;
+
+ pending = next;
+ }
+
+ return nr_received;
+}
+
+/* Must be called with vq_lock held */
+static int _viommu_send_reqs_sync(struct viommu_dev *viommu,
+ struct viommu_request *req, int nr,
+ int *nr_sent)
+{
+ int i, ret;
+ ktime_t timeout;
+ int nr_received = 0;
+ struct scatterlist *sg[2];
+ /*
+ * FIXME: as it stands, 1s timeout per request. This is a voluntary
+ * exaggeration because I have no idea how real our ktime is. Are we
+ * using a RTC? Are we aware of steal time? I don't know much about
+ * this, need to do some digging.
+ */
+ unsigned long timeout_ms = 1000;
+
+ *nr_sent = 0;
+
+ for (i = 0; i < nr; i++, req++) {
+ /*
+ * The backend will allocate one indirect descriptor for each
+ * request, which allows to double the ring consumption, but
+ * might be slower.
+ */
+ req->written = 0;
+
+ sg[0] = &req->head;
+ sg[1] = &req->tail;
+
+ ret = virtqueue_add_sgs(viommu->vq, sg, 1, 1, req,
+ GFP_ATOMIC);
+ if (ret)
+ break;
+
+ list_add_tail(&req->list, &viommu->pending_requests);
+ }
+
+ if (i && !virtqueue_kick(viommu->vq))
+ return -EPIPE;
+
+ /*
+ * Absolutely no wiggle room here. We're not allowed to sleep as callers
+ * might be holding spinlocks, so we have to poll like savages until
+ * something appears. Hopefully the host already handled the request
+ * during the above kick and returned it to us.
+ *
+ * A nice improvement would be for the caller to tell us if we can sleep
+ * whilst mapping, but this has to go through the IOMMU/DMA API.
+ */
+ timeout = ktime_add_ms(ktime_get(), timeout_ms * i);
+ while (nr_received < i && ktime_before(ktime_get(), timeout)) {
+ nr_received += viommu_receive_resp(viommu, i - nr_received);
+ if (nr_received < i) {
+ /*
+ * FIXME: what's a good way to yield to host? A second
+ * virtqueue_kick won't have any effect since we haven't
+ * added any descriptor.
+ */
+ udelay(10);
+ }
+ }
+ dev_dbg(viommu->dev, "request took %lld us\n",
+ ktime_us_delta(ktime_get(), ktime_sub_ms(timeout, timeout_ms * i)));
+
+ if (nr_received != i)
+ ret = -ETIMEDOUT;
+
+ if (ret == -ENOSPC && nr_received)
+ /*
+ * We've freed some space since virtio told us that the ring is
+ * full, tell the caller to come back later (after releasing the
+ * lock first, to be fair to other threads)
+ */
+ ret = -EAGAIN;
+
+ *nr_sent = nr_received;
+
+ return ret;
+}
+
+/**
+ * viommu_send_reqs_sync - add a batch of requests, kick the host and wait for
+ * them to return
+ *
+ * @req: array of requests
+ * @nr: size of the array
+ * @nr_sent: contains the number of requests actually sent after this function
+ * returns
+ *
+ * Return 0 on success, or an error if we failed to send some of the requests.
+ */
+static int viommu_send_reqs_sync(struct viommu_dev *viommu,
+ struct viommu_request *req, int nr,
+ int *nr_sent)
+{
+ int ret;
+ int sent = 0;
+ unsigned long flags;
+
+ *nr_sent = 0;
+ do {
+ spin_lock_irqsave(&viommu->vq_lock, flags);
+ ret = _viommu_send_reqs_sync(viommu, req, nr, &sent);
+ spin_unlock_irqrestore(&viommu->vq_lock, flags);
+
+ *nr_sent += sent;
+ req += sent;
+ nr -= sent;
+ } while (ret == -EAGAIN);
+
+ return ret;
+}
+
+/**
+ * viommu_send_req_sync - send one request and wait for reply
+ *
+ * @head_ptr: pointer to a virtio_iommu_req_* structure
+ *
+ * Returns 0 if the request was successful, or an error number otherwise. No
+ * distinction is done between transport and request errors.
+ */
+static int viommu_send_req_sync(struct viommu_dev *viommu, void *head_ptr)
+{
+ int ret;
+ int nr_sent;
+ struct viommu_request req;
+ size_t head_size, tail_size;
+ struct virtio_iommu_req_tail *tail;
+ struct virtio_iommu_req_head *head = head_ptr;
+
+ ret = viommu_get_req_size(head, &head_size, &tail_size);
+ if (ret)
+ return ret;
+
+ dev_dbg(viommu->dev, "Sending request 0x%x, %zu bytes\n", head->type,
+ head_size + tail_size);
+
+ tail = head_ptr + head_size;
+
+ sg_init_one(&req.head, head, head_size);
+ sg_init_one(&req.tail, tail, tail_size);
+
+ ret = viommu_send_reqs_sync(viommu, &req, 1, &nr_sent);
+ if (ret || !req.written || nr_sent != 1) {
+ dev_err(viommu->dev, "failed to send command\n");
+ return -EIO;
+ }
+
+ ret = -viommu_status_to_errno(tail->status);
+
+ if (ret)
+ dev_dbg(viommu->dev, " completed with %d\n", ret);
+
+ return ret;
+}
+
+static int viommu_tlb_map(struct viommu_domain *vdomain, unsigned long iova,
+ phys_addr_t paddr, size_t size)
+{
+ unsigned long flags;
+ struct viommu_mapping *mapping;
+
+ mapping = kzalloc(sizeof(*mapping), GFP_ATOMIC);
+ if (!mapping)
+ return -ENOMEM;
+
+ mapping->paddr = paddr;
+ mapping->iova.start = iova;
+ mapping->iova.last = iova + size - 1;
+
+ spin_lock_irqsave(&vdomain->mappings_lock, flags);
+ interval_tree_insert(&mapping->iova, &vdomain->mappings);
+ spin_unlock_irqrestore(&vdomain->mappings_lock, flags);
+
+ return 0;
+}
+
+static size_t viommu_tlb_unmap(struct viommu_domain *vdomain,
+ unsigned long iova, size_t size)
+{
+ size_t unmapped = 0;
+ unsigned long flags;
+ unsigned long last = iova + size - 1;
+ struct viommu_mapping *mapping = NULL;
+ struct interval_tree_node *node, *next;
+
+ spin_lock_irqsave(&vdomain->mappings_lock, flags);
+ next = interval_tree_iter_first(&vdomain->mappings, iova, last);
+ while (next) {
+ node = next;
+ mapping = container_of(node, struct viommu_mapping, iova);
+
+ next = interval_tree_iter_next(node, iova, last);
+
+ /*
+ * Note that for a partial range, this will return the full
+ * mapping so we avoid sending split requests to the device.
+ */
+ unmapped += mapping->iova.last - mapping->iova.start + 1;
+
+ interval_tree_remove(node, &vdomain->mappings);
+ kfree(mapping);
+ }
+ spin_unlock_irqrestore(&vdomain->mappings_lock, flags);
+
+ return unmapped;
+}
+
+/* IOMMU API */
+
+static bool viommu_capable(enum iommu_cap cap)
+{
+ return false; /* :( */
+}
+
+static struct iommu_domain *viommu_domain_alloc(unsigned type)
+{
+ struct viommu_domain *vdomain;
+
+ if (type != IOMMU_DOMAIN_UNMANAGED && type != IOMMU_DOMAIN_DMA)
+ return NULL;
+
+ vdomain = kzalloc(sizeof(struct viommu_domain), GFP_KERNEL);
+ if (!vdomain)
+ return NULL;
+
+ vdomain->id = atomic64_inc_return_relaxed(&viommu_domain_ids_gen);
+
+ mutex_init(&vdomain->mutex);
+ spin_lock_init(&vdomain->mappings_lock);
+ vdomain->mappings = RB_ROOT;
+
+ pr_debug("alloc domain of type %d -> %llu\n", type, vdomain->id);
+
+ if (type == IOMMU_DOMAIN_DMA &&
+ iommu_get_dma_cookie(&vdomain->domain)) {
+ kfree(vdomain);
+ return NULL;
+ }
+
+ return &vdomain->domain;
+}
+
+static void viommu_domain_free(struct iommu_domain *domain)
+{
+ struct viommu_domain *vdomain = to_viommu_domain(domain);
+
+ pr_debug("free domain %llu\n", vdomain->id);
+
+ iommu_put_dma_cookie(domain);
+
+ /* Free all remaining mappings (size 2^64) */
+ viommu_tlb_unmap(vdomain, 0, 0);
+
+ kfree(vdomain);
+}
+
+static int viommu_attach_dev(struct iommu_domain *domain, struct device *dev)
+{
+ int i;
+ int ret = 0;
+ struct iommu_fwspec *fwspec = dev->iommu_fwspec;
+ struct viommu_endpoint *vdev = fwspec->iommu_priv;
+ struct viommu_domain *vdomain = to_viommu_domain(domain);
+ struct virtio_iommu_req_attach req = {
+ .head.type = VIRTIO_IOMMU_T_ATTACH,
+ .address_space = cpu_to_le32(vdomain->id),
+ };
+
+ mutex_lock(&vdomain->mutex);
+ if (!vdomain->viommu) {
+ struct viommu_dev *viommu = vdev->viommu;
+
+ vdomain->viommu = viommu;
+
+ domain->pgsize_bitmap = viommu->pgsize_bitmap;
+ domain->geometry.aperture_start = viommu->aperture_start;
+ domain->geometry.aperture_end = viommu->aperture_end;
+ domain->geometry.force_aperture = true;
+
+ } else if (vdomain->viommu != vdev->viommu) {
+ dev_err(dev, "cannot attach to foreign VIOMMU\n");
+ ret = -EXDEV;
+ }
+ mutex_unlock(&vdomain->mutex);
+
+ if (ret)
+ return ret;
+
+ /*
+ * When attaching the device to a new domain, it will be detached from
+ * the old one and, if as as a result the old domain isn't attached to
+ * any device, all mappings are removed from the old domain and it is
+ * freed. (Note that we can't use get_domain_for_dev here, it returns
+ * the default domain during initial attach.)
+ *
+ * Take note of the device disappearing, so we can ignore unmap request
+ * on stale domains (that is, between this detach and the upcoming
+ * free.)
+ *
+ * vdev->vdomain is protected by group->mutex
+ */
+ if (vdev->vdomain) {
+ dev_dbg(dev, "detach from domain %llu\n", vdev->vdomain->id);
+ vdev->vdomain->attached--;
+ }
+
+ dev_dbg(dev, "attach to domain %llu\n", vdomain->id);
+
+ for (i = 0; i < fwspec->num_ids; i++) {
+ req.device = cpu_to_le32(fwspec->ids[i]);
+
+ ret = viommu_send_req_sync(vdomain->viommu, &req);
+ if (ret)
+ break;
+ }
+
+ vdomain->attached++;
+ vdev->vdomain = vdomain;
+
+ return ret;
+}
+
+static int viommu_map(struct iommu_domain *domain, unsigned long iova,
+ phys_addr_t paddr, size_t size, int prot)
+{
+ int ret;
+ struct viommu_domain *vdomain = to_viommu_domain(domain);
+ struct virtio_iommu_req_map req = {
+ .head.type = VIRTIO_IOMMU_T_MAP,
+ .address_space = cpu_to_le32(vdomain->id),
+ .virt_addr = cpu_to_le64(iova),
+ .phys_addr = cpu_to_le64(paddr),
+ .size = cpu_to_le64(size),
+ };
+
+ pr_debug("map %llu 0x%lx -> 0x%llx (%zu)\n", vdomain->id, iova,
+ paddr, size);
+
+ if (!vdomain->attached)
+ return -ENODEV;
+
+ if (prot & IOMMU_READ)
+ req.flags |= cpu_to_le32(VIRTIO_IOMMU_MAP_F_READ);
+
+ if (prot & IOMMU_WRITE)
+ req.flags |= cpu_to_le32(VIRTIO_IOMMU_MAP_F_WRITE);
+
+ ret = viommu_tlb_map(vdomain, iova, paddr, size);
+ if (ret)
+ return ret;
+
+ ret = viommu_send_req_sync(vdomain->viommu, &req);
+ if (ret)
+ viommu_tlb_unmap(vdomain, iova, size);
+
+ return ret;
+}
+
+static size_t viommu_unmap(struct iommu_domain *domain, unsigned long iova,
+ size_t size)
+{
+ int ret;
+ size_t unmapped;
+ struct viommu_domain *vdomain = to_viommu_domain(domain);
+ struct virtio_iommu_req_unmap req = {
+ .head.type = VIRTIO_IOMMU_T_UNMAP,
+ .address_space = cpu_to_le32(vdomain->id),
+ .virt_addr = cpu_to_le64(iova),
+ };
+
+ pr_debug("unmap %llu 0x%lx (%zu)\n", vdomain->id, iova, size);
+
+ /* Callers may unmap after detach, but device already took care of it. */
+ if (!vdomain->attached)
+ return size;
+
+ unmapped = viommu_tlb_unmap(vdomain, iova, size);
+ if (unmapped < size)
+ return 0;
+
+ req.size = cpu_to_le64(unmapped);
+
+ ret = viommu_send_req_sync(vdomain->viommu, &req);
+ if (ret)
+ return 0;
+
+ return unmapped;
+}
+
+static size_t viommu_map_sg(struct iommu_domain *domain, unsigned long iova,
+ struct scatterlist *sg, unsigned int nents, int prot)
+{
+ int i, ret;
+ int nr_sent;
+ size_t mapped;
+ size_t min_pagesz;
+ size_t total_size;
+ struct scatterlist *s;
+ unsigned int flags = 0;
+ unsigned long cur_iova;
+ unsigned long mapped_iova;
+ size_t head_size, tail_size;
+ struct viommu_request reqs[nents];
+ struct virtio_iommu_req_map map_reqs[nents];
+ struct viommu_domain *vdomain = to_viommu_domain(domain);
+
+ if (!vdomain->attached)
+ return 0;
+
+ pr_debug("map_sg %llu %u 0x%lx\n", vdomain->id, nents, iova);
+
+ if (prot & IOMMU_READ)
+ flags |= VIRTIO_IOMMU_MAP_F_READ;
+
+ if (prot & IOMMU_WRITE)
+ flags |= VIRTIO_IOMMU_MAP_F_WRITE;
+
+ min_pagesz = 1 << __ffs(domain->pgsize_bitmap);
+ tail_size = sizeof(struct virtio_iommu_req_tail);
+ head_size = sizeof(*map_reqs) - tail_size;
+
+ cur_iova = iova;
+
+ for_each_sg(sg, s, nents, i) {
+ size_t size = s->length;
+ phys_addr_t paddr = sg_phys(s);
+ void *tail = (void *)&map_reqs[i] + head_size;
+
+ if (!IS_ALIGNED(paddr | size, min_pagesz)) {
+ ret = -EFAULT;
+ break;
+ }
+
+ /* TODO: merge physically-contiguous mappings if any */
+ map_reqs[i] = (struct virtio_iommu_req_map) {
+ .head.type = VIRTIO_IOMMU_T_MAP,
+ .address_space = cpu_to_le32(vdomain->id),
+ .flags = cpu_to_le32(flags),
+ .virt_addr = cpu_to_le64(cur_iova),
+ .phys_addr = cpu_to_le64(paddr),
+ .size = cpu_to_le64(size),
+ };
+
+ ret = viommu_tlb_map(vdomain, cur_iova, paddr, size);
+ if (ret)
+ break;
+
+ sg_init_one(&reqs[i].head, &map_reqs[i], head_size);
+ sg_init_one(&reqs[i].tail, tail, tail_size);
+
+ cur_iova += size;
+ }
+
+ total_size = cur_iova - iova;
+
+ if (ret) {
+ viommu_tlb_unmap(vdomain, iova, total_size);
+ return 0;
+ }
+
+ ret = viommu_send_reqs_sync(vdomain->viommu, reqs, i, &nr_sent);
+
+ if (nr_sent != nents)
+ goto err_rollback;
+
+ for (i = 0; i < nents; i++) {
+ if (!reqs[i].written || map_reqs[i].tail.status)
+ goto err_rollback;
+ }
+
+ return total_size;
+
+err_rollback:
+ /*
+ * Any request in the range might have failed. Unmap what was
+ * successful.
+ */
+ cur_iova = iova;
+ mapped_iova = iova;
+ mapped = 0;
+ for_each_sg(sg, s, nents, i) {
+ size_t size = s->length;
+
+ cur_iova += size;
+
+ if (!reqs[i].written || map_reqs[i].tail.status) {
+ if (mapped)
+ viommu_unmap(domain, mapped_iova, mapped);
+
+ mapped_iova = cur_iova;
+ mapped = 0;
+ } else {
+ mapped += size;
+ }
+ }
+
+ viommu_tlb_unmap(vdomain, iova, total_size);
+
+ return 0;
+}
+
+static phys_addr_t viommu_iova_to_phys(struct iommu_domain *domain,
+ dma_addr_t iova)
+{
+ u64 paddr = 0;
+ unsigned long flags;
+ struct viommu_mapping *mapping;
+ struct interval_tree_node *node;
+ struct viommu_domain *vdomain = to_viommu_domain(domain);
+
+ spin_lock_irqsave(&vdomain->mappings_lock, flags);
+ node = interval_tree_iter_first(&vdomain->mappings, iova, iova);
+ if (node) {
+ mapping = container_of(node, struct viommu_mapping, iova);
+ paddr = mapping->paddr + (iova - mapping->iova.start);
+ }
+ spin_unlock_irqrestore(&vdomain->mappings_lock, flags);
+
+ pr_debug("iova_to_phys %llu 0x%llx->0x%llx\n", vdomain->id, iova,
+ paddr);
+
+ return paddr;
+}
+
+static struct iommu_ops viommu_ops;
+static struct virtio_driver virtio_iommu_drv;
+
+static int viommu_match_node(struct device *dev, void *data)
+{
+ return dev->parent->fwnode == data;
+}
+
+static struct viommu_dev *viommu_get_by_fwnode(struct fwnode_handle *fwnode)
+{
+ struct device *dev = driver_find_device(&virtio_iommu_drv.driver, NULL,
+ fwnode, viommu_match_node);
+ put_device(dev);
+
+ return dev ? dev_to_virtio(dev)->priv : NULL;
+}
+
+static int viommu_add_device(struct device *dev)
+{
+ struct iommu_group *group;
+ struct viommu_endpoint *vdev;
+ struct viommu_dev *viommu = NULL;
+ struct iommu_fwspec *fwspec = dev->iommu_fwspec;
+
+ if (!fwspec || fwspec->ops != &viommu_ops)
+ return -ENODEV;
+
+ viommu = viommu_get_by_fwnode(fwspec->iommu_fwnode);
+ if (!viommu)
+ return -ENODEV;
+
+ vdev = kzalloc(sizeof(*vdev), GFP_KERNEL);
+ if (!vdev)
+ return -ENOMEM;
+
+ vdev->viommu = viommu;
+ fwspec->iommu_priv = vdev;
+
+ /*
+ * Last step creates a default domain and attaches to it. Everything
+ * must be ready.
+ */
+ group = iommu_group_get_for_dev(dev);
+
+ return PTR_ERR_OR_ZERO(group);
+}
+
+static void viommu_remove_device(struct device *dev)
+{
+ kfree(dev->iommu_fwspec->iommu_priv);
+}
+
+static struct iommu_group *
+viommu_device_group(struct device *dev)
+{
+ if (dev_is_pci(dev))
+ return pci_device_group(dev);
+ else
+ return generic_device_group(dev);
+}
+
+static int viommu_of_xlate(struct device *dev, struct of_phandle_args *args)
+{
+ u32 *id = args->args;
+
+ dev_dbg(dev, "of_xlate 0x%x\n", *id);
+ return iommu_fwspec_add_ids(dev, args->args, 1);
+}
+
+/*
+ * (Maybe) temporary hack for device pass-through into guest userspace. On ARM
+ * with an ITS, VFIO will look for a region where to map the doorbell, even
+ * though the virtual doorbell is never written to by the device, and instead
+ * the host injects interrupts directly. TODO: sort this out in VFIO.
+ */
+#define MSI_IOVA_BASE 0x8000000
+#define MSI_IOVA_LENGTH 0x100000
+
+static void viommu_get_resv_regions(struct device *dev, struct list_head *head)
+{
+ struct iommu_resv_region *region;
+ int prot = IOMMU_WRITE | IOMMU_NOEXEC | IOMMU_MMIO;
+
+ region = iommu_alloc_resv_region(MSI_IOVA_BASE, MSI_IOVA_LENGTH, prot,
+ IOMMU_RESV_MSI);
+ if (!region)
+ return;
+
+ list_add_tail(&region->list, head);
+}
+
+static void viommu_put_resv_regions(struct device *dev, struct list_head *head)
+{
+ struct iommu_resv_region *entry, *next;
+
+ list_for_each_entry_safe(entry, next, head, list)
+ kfree(entry);
+}
+
+static struct iommu_ops viommu_ops = {
+ .capable = viommu_capable,
+ .domain_alloc = viommu_domain_alloc,
+ .domain_free = viommu_domain_free,
+ .attach_dev = viommu_attach_dev,
+ .map = viommu_map,
+ .unmap = viommu_unmap,
+ .map_sg = viommu_map_sg,
+ .iova_to_phys = viommu_iova_to_phys,
+ .add_device = viommu_add_device,
+ .remove_device = viommu_remove_device,
+ .device_group = viommu_device_group,
+ .of_xlate = viommu_of_xlate,
+ .get_resv_regions = viommu_get_resv_regions,
+ .put_resv_regions = viommu_put_resv_regions,
+};
+
+static int viommu_init_vq(struct viommu_dev *viommu)
+{
+ struct virtio_device *vdev = dev_to_virtio(viommu->dev);
+ vq_callback_t *callback = NULL;
+ const char *name = "request";
+ int ret;
+
+ ret = vdev->config->find_vqs(vdev, 1, &viommu->vq, &callback,
+ &name, NULL);
+ if (ret)
+ dev_err(viommu->dev, "cannot find VQ\n");
+
+ return ret;
+}
+
+static int viommu_probe(struct virtio_device *vdev)
+{
+ struct device *parent_dev = vdev->dev.parent;
+ struct viommu_dev *viommu = NULL;
+ struct device *dev = &vdev->dev;
+ int ret;
+
+ viommu = kzalloc(sizeof(*viommu), GFP_KERNEL);
+ if (!viommu)
+ return -ENOMEM;
+
+ spin_lock_init(&viommu->vq_lock);
+ INIT_LIST_HEAD(&viommu->pending_requests);
+ viommu->dev = dev;
+ viommu->vdev = vdev;
+
+ ret = viommu_init_vq(viommu);
+ if (ret)
+ goto err_free_viommu;
+
+ virtio_cread(vdev, struct virtio_iommu_config, page_sizes,
+ &viommu->pgsize_bitmap);
+
+ viommu->aperture_end = -1UL;
+
+ virtio_cread_feature(vdev, VIRTIO_IOMMU_F_INPUT_RANGE,
+ struct virtio_iommu_config, input_range.start,
+ &viommu->aperture_start);
+
+ virtio_cread_feature(vdev, VIRTIO_IOMMU_F_INPUT_RANGE,
+ struct virtio_iommu_config, input_range.end,
+ &viommu->aperture_end);
+
+ if (!viommu->pgsize_bitmap) {
+ ret = -EINVAL;
+ goto err_free_viommu;
+ }
+
+ viommu_ops.pgsize_bitmap = viommu->pgsize_bitmap;
+
+ /*
+ * Not strictly necessary, virtio would enable it later. This allows to
+ * start using the request queue early.
+ */
+ virtio_device_ready(vdev);
+
+ ret = iommu_device_sysfs_add(&viommu->iommu, dev, NULL, "%s",
+ virtio_bus_name(vdev));
+ if (ret)
+ goto err_free_viommu;
+
+ iommu_device_set_ops(&viommu->iommu, &viommu_ops);
+ iommu_device_set_fwnode(&viommu->iommu, parent_dev->fwnode);
+
+ iommu_device_register(&viommu->iommu);
+
+#ifdef CONFIG_PCI
+ if (pci_bus_type.iommu_ops != &viommu_ops) {
+ pci_request_acs();
+ ret = bus_set_iommu(&pci_bus_type, &viommu_ops);
+ if (ret)
+ goto err_unregister;
+ }
+#endif
+#ifdef CONFIG_ARM_AMBA
+ if (amba_bustype.iommu_ops != &viommu_ops) {
+ ret = bus_set_iommu(&amba_bustype, &viommu_ops);
+ if (ret)
+ goto err_unregister;
+ }
+#endif
+ if (platform_bus_type.iommu_ops != &viommu_ops) {
+ ret = bus_set_iommu(&platform_bus_type, &viommu_ops);
+ if (ret)
+ goto err_unregister;
+ }
+
+ vdev->priv = viommu;
+
+ dev_info(viommu->dev, "probe successful\n");
+
+ return 0;
+
+err_unregister:
+ iommu_device_unregister(&viommu->iommu);
+
+err_free_viommu:
+ kfree(viommu);
+
+ return ret;
+}
+
+static void viommu_remove(struct virtio_device *vdev)
+{
+ struct viommu_dev *viommu = vdev->priv;
+
+ iommu_device_unregister(&viommu->iommu);
+ kfree(viommu);
+
+ dev_info(&vdev->dev, "device removed\n");
+}
+
+static void viommu_config_changed(struct virtio_device *vdev)
+{
+ dev_warn(&vdev->dev, "config changed\n");
+}
+
+static unsigned int features[] = {
+ VIRTIO_IOMMU_F_INPUT_RANGE,
+};
+
+static struct virtio_device_id id_table[] = {
+ { VIRTIO_ID_IOMMU, VIRTIO_DEV_ANY_ID },
+ { 0 },
+};
+
+static struct virtio_driver virtio_iommu_drv = {
+ .driver.name = KBUILD_MODNAME,
+ .driver.owner = THIS_MODULE,
+ .id_table = id_table,
+ .feature_table = features,
+ .feature_table_size = ARRAY_SIZE(features),
+ .probe = viommu_probe,
+ .remove = viommu_remove,
+ .config_changed = viommu_config_changed,
+};
+
+module_virtio_driver(virtio_iommu_drv);
+
+IOMMU_OF_DECLARE(viommu, "virtio,mmio", NULL);
+
+MODULE_DESCRIPTION("virtio-iommu driver");
+MODULE_AUTHOR("Jean-Philippe Brucker <jean-***@arm.com>");
+MODULE_LICENSE("GPL v2");
diff --git a/include/uapi/linux/Kbuild b/include/uapi/linux/Kbuild
index 1f25c86374ad..c0cb0f173258 100644
--- a/include/uapi/linux/Kbuild
+++ b/include/uapi/linux/Kbuild
@@ -467,6 +467,7 @@ header-y += virtio_console.h
header-y += virtio_gpu.h
header-y += virtio_ids.h
header-y += virtio_input.h
+header-y += virtio_iommu.h
header-y += virtio_mmio.h
header-y += virtio_net.h
header-y += virtio_pci.h
diff --git a/include/uapi/linux/virtio_ids.h b/include/uapi/linux/virtio_ids.h
index 6d5c3b2d4f4d..934ed3d3cd3f 100644
--- a/include/uapi/linux/virtio_ids.h
+++ b/include/uapi/linux/virtio_ids.h
@@ -43,5 +43,6 @@
#define VIRTIO_ID_INPUT 18 /* virtio input */
#define VIRTIO_ID_VSOCK 19 /* virtio vsock transport */
#define VIRTIO_ID_CRYPTO 20 /* virtio crypto */
+#define VIRTIO_ID_IOMMU 61216 /* virtio IOMMU (temporary) */

#endif /* _LINUX_VIRTIO_IDS_H */
diff --git a/include/uapi/linux/virtio_iommu.h b/include/uapi/linux/virtio_iommu.h
new file mode 100644
index 000000000000..ec74c9a727d4
--- /dev/null
+++ b/include/uapi/linux/virtio_iommu.h
@@ -0,0 +1,142 @@
+/*
+ * Copyright (C) 2017 ARM Ltd.
+ *
+ * This header is BSD licensed so anyone can use the definitions
+ * to implement compatible drivers/servers:
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ * 1. Redistributions of source code must retain the above copyright
+ * notice, this list of conditions and the following disclaimer.
+ * 2. Redistributions in binary form must reproduce the above copyright
+ * notice, this list of conditions and the following disclaimer in the
+ * documentation and/or other materials provided with the distribution.
+ * 3. Neither the name of ARM Ltd. nor the names of its contributors
+ * may be used to endorse or promote products derived from this software
+ * without specific prior written permission.
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ * ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS
+ * FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL IBM OR
+ * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF
+ * USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
+ * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT
+ * OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+ * SUCH DAMAGE.
+ */
+#ifndef _UAPI_LINUX_VIRTIO_IOMMU_H
+#define _UAPI_LINUX_VIRTIO_IOMMU_H
+
+/* Feature bits */
+#define VIRTIO_IOMMU_F_INPUT_RANGE 0
+#define VIRTIO_IOMMU_F_IOASID_BITS 1
+#define VIRTIO_IOMMU_F_MAP_UNMAP 2
+#define VIRTIO_IOMMU_F_BYPASS 3
+
+__packed
+struct virtio_iommu_config {
+ /* Supported page sizes */
+ __u64 page_sizes;
+ struct virtio_iommu_range {
+ __u64 start;
+ __u64 end;
+ } input_range;
+ __u8 ioasid_bits;
+};
+
+/* Request types */
+#define VIRTIO_IOMMU_T_ATTACH 0x01
+#define VIRTIO_IOMMU_T_DETACH 0x02
+#define VIRTIO_IOMMU_T_MAP 0x03
+#define VIRTIO_IOMMU_T_UNMAP 0x04
+
+/* Status types */
+#define VIRTIO_IOMMU_S_OK 0x00
+#define VIRTIO_IOMMU_S_IOERR 0x01
+#define VIRTIO_IOMMU_S_UNSUPP 0x02
+#define VIRTIO_IOMMU_S_DEVERR 0x03
+#define VIRTIO_IOMMU_S_INVAL 0x04
+#define VIRTIO_IOMMU_S_RANGE 0x05
+#define VIRTIO_IOMMU_S_NOENT 0x06
+#define VIRTIO_IOMMU_S_FAULT 0x07
+
+__packed
+struct virtio_iommu_req_head {
+ __u8 type;
+ __u8 reserved[3];
+};
+
+__packed
+struct virtio_iommu_req_tail {
+ __u8 status;
+ __u8 reserved[3];
+};
+
+__packed
+struct virtio_iommu_req_attach {
+ struct virtio_iommu_req_head head;
+
+ __le32 address_space;
+ __le32 device;
+ __le32 reserved;
+
+ struct virtio_iommu_req_tail tail;
+};
+
+__packed
+struct virtio_iommu_req_detach {
+ struct virtio_iommu_req_head head;
+
+ __le32 device;
+ __le32 reserved;
+
+ struct virtio_iommu_req_tail tail;
+};
+
+#define VIRTIO_IOMMU_MAP_F_READ (1 << 0)
+#define VIRTIO_IOMMU_MAP_F_WRITE (1 << 1)
+#define VIRTIO_IOMMU_MAP_F_EXEC (1 << 2)
+
+#define VIRTIO_IOMMU_MAP_F_MASK (VIRTIO_IOMMU_MAP_F_READ | \
+ VIRTIO_IOMMU_MAP_F_WRITE | \
+ VIRTIO_IOMMU_MAP_F_EXEC)
+
+__packed
+struct virtio_iommu_req_map {
+ struct virtio_iommu_req_head head;
+
+ __le32 address_space;
+ __le32 flags;
+ __le64 virt_addr;
+ __le64 phys_addr;
+ __le64 size;
+
+ struct virtio_iommu_req_tail tail;
+};
+
+__packed
+struct virtio_iommu_req_unmap {
+ struct virtio_iommu_req_head head;
+
+ __le32 address_space;
+ __le32 flags;
+ __le64 virt_addr;
+ __le64 size;
+
+ struct virtio_iommu_req_tail tail;
+};
+
+union virtio_iommu_req {
+ struct virtio_iommu_req_head head;
+
+ struct virtio_iommu_req_attach attach;
+ struct virtio_iommu_req_detach detach;
+ struct virtio_iommu_req_map map;
+ struct virtio_iommu_req_unmap unmap;
+};
+
+#endif
--
2.12.1
Bharat Bhushan
2017-06-16 08:48:00 UTC
Permalink
Hi Jean
-----Original Message-----
open.org] On Behalf Of Jean-Philippe Brucker
Sent: Saturday, April 08, 2017 12:53 AM
Subject: [virtio-dev] [RFC PATCH linux] iommu: Add virtio-iommu driver
The virtio IOMMU is a para-virtualized device, allowing to send IOMMU
requests such as map/unmap over virtio-mmio transport. This driver should
illustrate the initial proposal for virtio-iommu, that you hopefully received
with it. It handle attach, detach, map and unmap requests.
The bulk of the code is to create requests and send them through virtio.
Implementing the IOMMU API is fairly straightforward since the virtio-iommu
MAP/UNMAP interface is almost identical. I threw in a custom
map_sg() function which takes up some space, but is optional. The core
function would send a sequence of map requests, waiting for a reply
between each mapping. This optimization avoids yielding to the host after
each map, and instead prepares a batch of requests in the virtio ring and
kicks the host once.
It must be applied on top of the probe deferral work for IOMMU, currently
under discussion. This allows to dissociate early driver detection and device
probing: device-tree or ACPI is parsed early to find which devices are
translated by the IOMMU, but the IOMMU itself cannot be probed until the
core virtio module is loaded.
Enabling DEBUG makes it extremely verbose at the moment, but it should be
calmer in next versions.
---
drivers/iommu/Kconfig | 11 +
drivers/iommu/Makefile | 1 +
drivers/iommu/virtio-iommu.c | 980
++++++++++++++++++++++++++++++++++++++
include/uapi/linux/Kbuild | 1 +
include/uapi/linux/virtio_ids.h | 1 +
include/uapi/linux/virtio_iommu.h | 142 ++++++
6 files changed, 1136 insertions(+)
create mode 100644 drivers/iommu/virtio-iommu.c create mode 100644
include/uapi/linux/virtio_iommu.h
diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig index
37e204f3d9be..8cd56ee9a93a 100644
--- a/drivers/iommu/Kconfig
+++ b/drivers/iommu/Kconfig
@@ -359,4 +359,15 @@ config MTK_IOMMU_V1
if unsure, say N here.
+config VIRTIO_IOMMU
+ tristate "Virtio IOMMU driver"
+ depends on VIRTIO_MMIO
+ select IOMMU_API
+ select INTERVAL_TREE
+ select ARM_DMA_USE_IOMMU if ARM
+ help
+ Para-virtualised IOMMU driver with virtio.
+
+ Say Y here if you intend to run this kernel as a guest.
+
endif # IOMMU_SUPPORT
diff --git a/drivers/iommu/Makefile b/drivers/iommu/Makefile index
195f7b997d8e..1199d8475802 100644
--- a/drivers/iommu/Makefile
+++ b/drivers/iommu/Makefile
@@ -27,3 +27,4 @@ obj-$(CONFIG_TEGRA_IOMMU_SMMU) += tegra-
smmu.o
obj-$(CONFIG_EXYNOS_IOMMU) += exynos-iommu.o
obj-$(CONFIG_FSL_PAMU) += fsl_pamu.o fsl_pamu_domain.o
obj-$(CONFIG_S390_IOMMU) += s390-iommu.o
+obj-$(CONFIG_VIRTIO_IOMMU) += virtio-iommu.o
diff --git a/drivers/iommu/virtio-iommu.c b/drivers/iommu/virtio-iommu.c
new file mode 100644 index 000000000000..1cf4f57b7817
--- /dev/null
+++ b/drivers/iommu/virtio-iommu.c
@@ -0,0 +1,980 @@
+/*
+ * Virtio driver for the paravirtualized IOMMU
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
+ *
+ * Copyright (C) 2017 ARM Limited
+ *
+
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
+#include <linux/amba/bus.h>
+#include <linux/delay.h>
+#include <linux/dma-iommu.h>
+#include <linux/freezer.h>
+#include <linux/interval_tree.h>
+#include <linux/iommu.h>
+#include <linux/module.h>
+#include <linux/of_iommu.h>
+#include <linux/of_platform.h>
+#include <linux/platform_device.h>
+#include <linux/virtio.h>
+#include <linux/virtio_config.h>
+#include <linux/virtio_ids.h>
+#include <linux/wait.h>
+
+#include <uapi/linux/virtio_iommu.h>
+
+struct viommu_dev {
+ struct iommu_device iommu;
+ struct device *dev;
+ struct virtio_device *vdev;
+
+ struct virtqueue *vq;
+ struct list_head pending_requests;
+ /* Serialize anything touching the vq and the request list */
+ spinlock_t vq_lock;
+
+ struct list_head list;
+
+ /* Device configuration */
+ u64 pgsize_bitmap;
+ u64 aperture_start;
+ u64 aperture_end;
+};
+
+struct viommu_mapping {
+ phys_addr_t paddr;
+ struct interval_tree_node iova;
+};
+
+struct viommu_domain {
+ struct iommu_domain domain;
+ struct viommu_dev *viommu;
+ struct mutex mutex;
+ u64 id;
+
+ spinlock_t mappings_lock;
+ struct rb_root mappings;
+
+ /* Number of devices attached to this domain */
+ unsigned long attached;
+};
+
+struct viommu_endpoint {
+ struct viommu_dev *viommu;
+ struct viommu_domain *vdomain;
+};
+
+struct viommu_request {
+ struct scatterlist head;
+ struct scatterlist tail;
+
+ int written;
+ struct list_head list;
+};
+
+/* TODO: use an IDA */
+static atomic64_t viommu_domain_ids_gen;
+
+#define to_viommu_domain(domain) container_of(domain, struct
+viommu_domain, domain)
+
+/* Virtio transport */
+
+static int viommu_status_to_errno(u8 status) {
+ switch (status) {
+ return 0;
+ return -ENOSYS;
+ return -EINVAL;
+ return -ERANGE;
+ return -ENOENT;
+ return -EFAULT;
+ return -EIO;
+ }
+}
+
+static int viommu_get_req_size(struct virtio_iommu_req_head *req, size_t *head,
+ size_t *tail)
+{
+ size_t size;
+ union virtio_iommu_req r;
+
+ *tail = sizeof(struct virtio_iommu_req_tail);
+
+ switch (req->type) {
+ size = sizeof(r.attach);
+ break;
+ size = sizeof(r.detach);
+ break;
+ size = sizeof(r.map);
+ break;
+ size = sizeof(r.unmap);
+ break;
+ return -EINVAL;
+ }
+
+ *head = size - *tail;
+ return 0;
+}
+
+static int viommu_receive_resp(struct viommu_dev *viommu, int
+nr_expected) {
+
+ unsigned int len;
+ int nr_received = 0;
+ struct viommu_request *req, *pending, *next;
+
+ pending = list_first_entry_or_null(&viommu->pending_requests,
+ struct viommu_request, list);
+ if (WARN_ON(!pending))
+ return 0;
+
+ while ((req = virtqueue_get_buf(viommu->vq, &len)) != NULL) {
+ if (req != pending) {
+ dev_warn(viommu->dev, "discarding stale
request\n");
+ continue;
+ }
+
+ pending->written = len;
+
+ if (++nr_received == nr_expected) {
+ list_del(&pending->list);
+ /*
+ * In an ideal world, we'd wake up the waiter for this
+ * group of requests here. But everything is painfully
+ * synchronous, so waiter is the caller.
+ */
+ break;
+ }
+
+ next = list_next_entry(pending, list);
+ list_del(&pending->list);
+
+ if (WARN_ON(list_empty(&viommu->pending_requests)))
+ return 0;
+
+ pending = next;
+ }
+
+ return nr_received;
+}
+
+/* Must be called with vq_lock held */
+static int _viommu_send_reqs_sync(struct viommu_dev *viommu,
+ struct viommu_request *req, int nr,
+ int *nr_sent)
+{
+ int i, ret;
+ ktime_t timeout;
+ int nr_received = 0;
+ struct scatterlist *sg[2];
+ /*
+ * FIXME: as it stands, 1s timeout per request. This is a voluntary
+ * exaggeration because I have no idea how real our ktime is. Are we
+ * using a RTC? Are we aware of steal time? I don't know much about
+ * this, need to do some digging.
+ */
+ unsigned long timeout_ms = 1000;
+
+ *nr_sent = 0;
+
+ for (i = 0; i < nr; i++, req++) {
+ /*
+ * The backend will allocate one indirect descriptor for each
+ * request, which allows to double the ring consumption, but
+ * might be slower.
+ */
+ req->written = 0;
+
+ sg[0] = &req->head;
+ sg[1] = &req->tail;
+
+ ret = virtqueue_add_sgs(viommu->vq, sg, 1, 1, req,
+ GFP_ATOMIC);
+ if (ret)
+ break;
+
+ list_add_tail(&req->list, &viommu->pending_requests);
+ }
+
+ if (i && !virtqueue_kick(viommu->vq))
+ return -EPIPE;
+
+ /*
+ * Absolutely no wiggle room here. We're not allowed to sleep as callers
+ * might be holding spinlocks, so we have to poll like savages until
+ * something appears. Hopefully the host already handled the
request
+ * during the above kick and returned it to us.
+ *
+ * A nice improvement would be for the caller to tell us if we can sleep
+ * whilst mapping, but this has to go through the IOMMU/DMA API.
+ */
+ timeout = ktime_add_ms(ktime_get(), timeout_ms * i);
+ while (nr_received < i && ktime_before(ktime_get(), timeout)) {
+ nr_received += viommu_receive_resp(viommu, i -
nr_received);
+ if (nr_received < i) {
+ /*
+ * FIXME: what's a good way to yield to host? A
second
+ * virtqueue_kick won't have any effect since we
haven't
+ * added any descriptor.
+ */
+ udelay(10);
+ }
+ }
+ dev_dbg(viommu->dev, "request took %lld us\n",
+ ktime_us_delta(ktime_get(), ktime_sub_ms(timeout,
timeout_ms * i)));
+
+ if (nr_received != i)
+ ret = -ETIMEDOUT;
+
+ if (ret == -ENOSPC && nr_received)
+ /*
+ * We've freed some space since virtio told us that the ring is
+ * full, tell the caller to come back later (after releasing the
+ * lock first, to be fair to other threads)
+ */
+ ret = -EAGAIN;
+
+ *nr_sent = nr_received;
+
+ return ret;
+}
+
+/**
+ * viommu_send_reqs_sync - add a batch of requests, kick the host and wait for
+ * them to return
+ *
+ * returns
+ *
+ * Return 0 on success, or an error if we failed to send some of the requests.
+ */
+static int viommu_send_reqs_sync(struct viommu_dev *viommu,
+ struct viommu_request *req, int nr,
+ int *nr_sent)
+{
+ int ret;
+ int sent = 0;
+ unsigned long flags;
+
+ *nr_sent = 0;
+ do {
+ spin_lock_irqsave(&viommu->vq_lock, flags);
+ ret = _viommu_send_reqs_sync(viommu, req, nr, &sent);
+ spin_unlock_irqrestore(&viommu->vq_lock, flags);
+
+ *nr_sent += sent;
+ req += sent;
+ nr -= sent;
+ } while (ret == -EAGAIN);
+
+ return ret;
+}
+
+/**
+ * viommu_send_req_sync - send one request and wait for reply
+ *
+ *
+ * Returns 0 if the request was successful, or an error number
+otherwise. No
+ * distinction is done between transport and request errors.
+ */
+static int viommu_send_req_sync(struct viommu_dev *viommu, void
+*head_ptr) {
+ int ret;
+ int nr_sent;
+ struct viommu_request req;
+ size_t head_size, tail_size;
+ struct virtio_iommu_req_tail *tail;
+ struct virtio_iommu_req_head *head = head_ptr;
+
+ ret = viommu_get_req_size(head, &head_size, &tail_size);
+ if (ret)
+ return ret;
+
+ dev_dbg(viommu->dev, "Sending request 0x%x, %zu bytes\n",
head->type,
+ head_size + tail_size);
+
+ tail = head_ptr + head_size;
+
+ sg_init_one(&req.head, head, head_size);
+ sg_init_one(&req.tail, tail, tail_size);
+
+ ret = viommu_send_reqs_sync(viommu, &req, 1, &nr_sent);
+ if (ret || !req.written || nr_sent != 1) {
+ dev_err(viommu->dev, "failed to send command\n");
+ return -EIO;
+ }
+
+ ret = -viommu_status_to_errno(tail->status);
+
+ if (ret)
+ dev_dbg(viommu->dev, " completed with %d\n", ret);
+
+ return ret;
+}
+
+static int viommu_tlb_map(struct viommu_domain *vdomain, unsigned long iova,
+ phys_addr_t paddr, size_t size)
+{
+ unsigned long flags;
+ struct viommu_mapping *mapping;
+
+ mapping = kzalloc(sizeof(*mapping), GFP_ATOMIC);
+ if (!mapping)
+ return -ENOMEM;
+
+ mapping->paddr = paddr;
+ mapping->iova.start = iova;
+ mapping->iova.last = iova + size - 1;
+
+ spin_lock_irqsave(&vdomain->mappings_lock, flags);
+ interval_tree_insert(&mapping->iova, &vdomain->mappings);
+ spin_unlock_irqrestore(&vdomain->mappings_lock, flags);
+
+ return 0;
+}
+
+static size_t viommu_tlb_unmap(struct viommu_domain *vdomain,
+ unsigned long iova, size_t size) {
+ size_t unmapped = 0;
+ unsigned long flags;
+ unsigned long last = iova + size - 1;
+ struct viommu_mapping *mapping = NULL;
+ struct interval_tree_node *node, *next;
+
+ spin_lock_irqsave(&vdomain->mappings_lock, flags);
+ next = interval_tree_iter_first(&vdomain->mappings, iova, last);
+ while (next) {
+ node = next;
+ mapping = container_of(node, struct viommu_mapping,
iova);
+
+ next = interval_tree_iter_next(node, iova, last);
+
+ /*
+ * Note that for a partial range, this will return the full
+ * mapping so we avoid sending split requests to the device.
+ */
+ unmapped += mapping->iova.last - mapping->iova.start + 1;
+
+ interval_tree_remove(node, &vdomain->mappings);
+ kfree(mapping);
+ }
+ spin_unlock_irqrestore(&vdomain->mappings_lock, flags);
+
+ return unmapped;
+}
+
+/* IOMMU API */
+
+static bool viommu_capable(enum iommu_cap cap) {
+ return false; /* :( */
+}
+
+static struct iommu_domain *viommu_domain_alloc(unsigned type) {
+ struct viommu_domain *vdomain;
+
+ if (type != IOMMU_DOMAIN_UNMANAGED && type !=
IOMMU_DOMAIN_DMA)
+ return NULL;
+
+ vdomain = kzalloc(sizeof(struct viommu_domain), GFP_KERNEL);
+ if (!vdomain)
+ return NULL;
+
+ vdomain->id =
atomic64_inc_return_relaxed(&viommu_domain_ids_gen);
+
+ mutex_init(&vdomain->mutex);
+ spin_lock_init(&vdomain->mappings_lock);
+ vdomain->mappings = RB_ROOT;
+
+ pr_debug("alloc domain of type %d -> %llu\n", type, vdomain->id);
+
+ if (type == IOMMU_DOMAIN_DMA &&
+ iommu_get_dma_cookie(&vdomain->domain)) {
+ kfree(vdomain);
+ return NULL;
+ }
+
+ return &vdomain->domain;
+}
+
+static void viommu_domain_free(struct iommu_domain *domain) {
+ struct viommu_domain *vdomain = to_viommu_domain(domain);
+
+ pr_debug("free domain %llu\n", vdomain->id);
+
+ iommu_put_dma_cookie(domain);
+
+ /* Free all remaining mappings (size 2^64) */
+ viommu_tlb_unmap(vdomain, 0, 0);
+
+ kfree(vdomain);
+}
+
+static int viommu_attach_dev(struct iommu_domain *domain, struct
device
+*dev) {
+ int i;
+ int ret = 0;
+ struct iommu_fwspec *fwspec = dev->iommu_fwspec;
+ struct viommu_endpoint *vdev = fwspec->iommu_priv;
+ struct viommu_domain *vdomain = to_viommu_domain(domain);
+ struct virtio_iommu_req_attach req = {
+ .head.type = VIRTIO_IOMMU_T_ATTACH,
+ .address_space = cpu_to_le32(vdomain->id),
+ };
+
+ mutex_lock(&vdomain->mutex);
+ if (!vdomain->viommu) {
+ struct viommu_dev *viommu = vdev->viommu;
+
+ vdomain->viommu = viommu;
+
+ domain->pgsize_bitmap = viommu-
Post by Jean-Philippe Brucker
pgsize_bitmap;
+ domain->geometry.aperture_start = viommu-
Post by Jean-Philippe Brucker
aperture_start;
+ domain->geometry.aperture_end = viommu-
Post by Jean-Philippe Brucker
aperture_end;
+ domain->geometry.force_aperture = true;
+
+ } else if (vdomain->viommu != vdev->viommu) {
+ dev_err(dev, "cannot attach to foreign VIOMMU\n");
+ ret = -EXDEV;
+ }
+ mutex_unlock(&vdomain->mutex);
+
+ if (ret)
+ return ret;
+
+ /*
+ * When attaching the device to a new domain, it will be detached from
+ * the old one and, if as as a result the old domain isn't attached to
+ * any device, all mappings are removed from the old domain and it is
+ * freed. (Note that we can't use get_domain_for_dev here, it returns
+ * the default domain during initial attach.)
+ *
+ * Take note of the device disappearing, so we can ignore unmap request
+ * on stale domains (that is, between this detach and the upcoming
+ * free.)
+ *
+ * vdev->vdomain is protected by group->mutex
+ */
+ if (vdev->vdomain) {
+ dev_dbg(dev, "detach from domain %llu\n", vdev-
Post by Jean-Philippe Brucker
vdomain->id);
+ vdev->vdomain->attached--;
+ }
+
+ dev_dbg(dev, "attach to domain %llu\n", vdomain->id);
+
+ for (i = 0; i < fwspec->num_ids; i++) {
+ req.device = cpu_to_le32(fwspec->ids[i]);
+
+ ret = viommu_send_req_sync(vdomain->viommu, &req);
+ if (ret)
+ break;
+ }
+
+ vdomain->attached++;
+ vdev->vdomain = vdomain;
+
+ return ret;
+}
+
+static int viommu_map(struct iommu_domain *domain, unsigned long iova,
+ phys_addr_t paddr, size_t size, int prot) {
+ int ret;
+ struct viommu_domain *vdomain = to_viommu_domain(domain);
+ struct virtio_iommu_req_map req = {
+ .head.type = VIRTIO_IOMMU_T_MAP,
+ .address_space = cpu_to_le32(vdomain->id),
+ .virt_addr = cpu_to_le64(iova),
+ .phys_addr = cpu_to_le64(paddr),
+ .size = cpu_to_le64(size),
+ };
+
+ pr_debug("map %llu 0x%lx -> 0x%llx (%zu)\n", vdomain->id, iova,
+ paddr, size);
A query, when I am tracing above prints I see same physical address is mapped with two different virtual address, do you know why kernel does this?

Thanks
-Bharat
+
+ if (!vdomain->attached)
+ return -ENODEV;
+
+ if (prot & IOMMU_READ)
+ req.flags |= cpu_to_le32(VIRTIO_IOMMU_MAP_F_READ);
+
+ if (prot & IOMMU_WRITE)
+ req.flags |= cpu_to_le32(VIRTIO_IOMMU_MAP_F_WRITE);
+
+ ret = viommu_tlb_map(vdomain, iova, paddr, size);
+ if (ret)
+ return ret;
+
+ ret = viommu_send_req_sync(vdomain->viommu, &req);
+ if (ret)
+ viommu_tlb_unmap(vdomain, iova, size);
+
+ return ret;
+}
+
+static size_t viommu_unmap(struct iommu_domain *domain, unsigned long iova,
+ size_t size)
+{
+ int ret;
+ size_t unmapped;
+ struct viommu_domain *vdomain = to_viommu_domain(domain);
+ struct virtio_iommu_req_unmap req = {
+ .head.type = VIRTIO_IOMMU_T_UNMAP,
+ .address_space = cpu_to_le32(vdomain->id),
+ .virt_addr = cpu_to_le64(iova),
+ };
+
+ pr_debug("unmap %llu 0x%lx (%zu)\n", vdomain->id, iova, size);
+
+ /* Callers may unmap after detach, but device already took care of it. */
+ if (!vdomain->attached)
+ return size;
+
+ unmapped = viommu_tlb_unmap(vdomain, iova, size);
+ if (unmapped < size)
+ return 0;
+
+ req.size = cpu_to_le64(unmapped);
+
+ ret = viommu_send_req_sync(vdomain->viommu, &req);
+ if (ret)
+ return 0;
+
+ return unmapped;
+}
+
+static size_t viommu_map_sg(struct iommu_domain *domain, unsigned long iova,
+ struct scatterlist *sg, unsigned int nents, int prot) {
+ int i, ret;
+ int nr_sent;
+ size_t mapped;
+ size_t min_pagesz;
+ size_t total_size;
+ struct scatterlist *s;
+ unsigned int flags = 0;
+ unsigned long cur_iova;
+ unsigned long mapped_iova;
+ size_t head_size, tail_size;
+ struct viommu_request reqs[nents];
+ struct virtio_iommu_req_map map_reqs[nents];
+ struct viommu_domain *vdomain = to_viommu_domain(domain);
+
+ if (!vdomain->attached)
+ return 0;
+
+ pr_debug("map_sg %llu %u 0x%lx\n", vdomain->id, nents, iova);
+
+ if (prot & IOMMU_READ)
+ flags |= VIRTIO_IOMMU_MAP_F_READ;
+
+ if (prot & IOMMU_WRITE)
+ flags |= VIRTIO_IOMMU_MAP_F_WRITE;
+
+ min_pagesz = 1 << __ffs(domain->pgsize_bitmap);
+ tail_size = sizeof(struct virtio_iommu_req_tail);
+ head_size = sizeof(*map_reqs) - tail_size;
+
+ cur_iova = iova;
+
+ for_each_sg(sg, s, nents, i) {
+ size_t size = s->length;
+ phys_addr_t paddr = sg_phys(s);
+ void *tail = (void *)&map_reqs[i] + head_size;
+
+ if (!IS_ALIGNED(paddr | size, min_pagesz)) {
+ ret = -EFAULT;
+ break;
+ }
+
+ /* TODO: merge physically-contiguous mappings if any */
+ map_reqs[i] = (struct virtio_iommu_req_map) {
+ .head.type = VIRTIO_IOMMU_T_MAP,
+ .address_space = cpu_to_le32(vdomain->id),
+ .flags = cpu_to_le32(flags),
+ .virt_addr = cpu_to_le64(cur_iova),
+ .phys_addr = cpu_to_le64(paddr),
+ .size = cpu_to_le64(size),
+ };
+
+ ret = viommu_tlb_map(vdomain, cur_iova, paddr, size);
+ if (ret)
+ break;
+
+ sg_init_one(&reqs[i].head, &map_reqs[i], head_size);
+ sg_init_one(&reqs[i].tail, tail, tail_size);
+
+ cur_iova += size;
+ }
+
+ total_size = cur_iova - iova;
+
+ if (ret) {
+ viommu_tlb_unmap(vdomain, iova, total_size);
+ return 0;
+ }
+
+ ret = viommu_send_reqs_sync(vdomain->viommu, reqs, i,
&nr_sent);
+
+ if (nr_sent != nents)
+ goto err_rollback;
+
+ for (i = 0; i < nents; i++) {
+ if (!reqs[i].written || map_reqs[i].tail.status)
+ goto err_rollback;
+ }
+
+ return total_size;
+
+ /*
+ * Any request in the range might have failed. Unmap what was
+ * successful.
+ */
+ cur_iova = iova;
+ mapped_iova = iova;
+ mapped = 0;
+ for_each_sg(sg, s, nents, i) {
+ size_t size = s->length;
+
+ cur_iova += size;
+
+ if (!reqs[i].written || map_reqs[i].tail.status) {
+ if (mapped)
+ viommu_unmap(domain, mapped_iova,
mapped);
+
+ mapped_iova = cur_iova;
+ mapped = 0;
+ } else {
+ mapped += size;
+ }
+ }
+
+ viommu_tlb_unmap(vdomain, iova, total_size);
+
+ return 0;
+}
+
+static phys_addr_t viommu_iova_to_phys(struct iommu_domain *domain,
+ dma_addr_t iova)
+{
+ u64 paddr = 0;
+ unsigned long flags;
+ struct viommu_mapping *mapping;
+ struct interval_tree_node *node;
+ struct viommu_domain *vdomain = to_viommu_domain(domain);
+
+ spin_lock_irqsave(&vdomain->mappings_lock, flags);
+ node = interval_tree_iter_first(&vdomain->mappings, iova, iova);
+ if (node) {
+ mapping = container_of(node, struct viommu_mapping,
iova);
+ paddr = mapping->paddr + (iova - mapping->iova.start);
+ }
+ spin_unlock_irqrestore(&vdomain->mappings_lock, flags);
+
+ pr_debug("iova_to_phys %llu 0x%llx->0x%llx\n", vdomain->id, iova,
+ paddr);
+
+ return paddr;
+}
+
+static struct iommu_ops viommu_ops;
+static struct virtio_driver virtio_iommu_drv;
+
+static int viommu_match_node(struct device *dev, void *data) {
+ return dev->parent->fwnode == data;
+}
+
+static struct viommu_dev *viommu_get_by_fwnode(struct
fwnode_handle
+*fwnode) {
+ struct device *dev = driver_find_device(&virtio_iommu_drv.driver, NULL,
+ fwnode,
viommu_match_node);
+ put_device(dev);
+
+ return dev ? dev_to_virtio(dev)->priv : NULL; }
+
+static int viommu_add_device(struct device *dev) {
+ struct iommu_group *group;
+ struct viommu_endpoint *vdev;
+ struct viommu_dev *viommu = NULL;
+ struct iommu_fwspec *fwspec = dev->iommu_fwspec;
+
+ if (!fwspec || fwspec->ops != &viommu_ops)
+ return -ENODEV;
+
+ viommu = viommu_get_by_fwnode(fwspec->iommu_fwnode);
+ if (!viommu)
+ return -ENODEV;
+
+ vdev = kzalloc(sizeof(*vdev), GFP_KERNEL);
+ if (!vdev)
+ return -ENOMEM;
+
+ vdev->viommu = viommu;
+ fwspec->iommu_priv = vdev;
+
+ /*
+ * Last step creates a default domain and attaches to it. Everything
+ * must be ready.
+ */
+ group = iommu_group_get_for_dev(dev);
+
+ return PTR_ERR_OR_ZERO(group);
+}
+
+static void viommu_remove_device(struct device *dev) {
+ kfree(dev->iommu_fwspec->iommu_priv);
+}
+
+static struct iommu_group *
+viommu_device_group(struct device *dev) {
+ if (dev_is_pci(dev))
+ return pci_device_group(dev);
+ else
+ return generic_device_group(dev);
+}
+
+static int viommu_of_xlate(struct device *dev, struct of_phandle_args
+*args) {
+ u32 *id = args->args;
+
+ dev_dbg(dev, "of_xlate 0x%x\n", *id);
+ return iommu_fwspec_add_ids(dev, args->args, 1); }
+
+/*
+ * (Maybe) temporary hack for device pass-through into guest userspace.
+On ARM
+ * with an ITS, VFIO will look for a region where to map the doorbell,
+even
+ * though the virtual doorbell is never written to by the device, and
+instead
+ * the host injects interrupts directly. TODO: sort this out in VFIO.
+ */
+#define MSI_IOVA_BASE 0x8000000
+#define MSI_IOVA_LENGTH 0x100000
+
+static void viommu_get_resv_regions(struct device *dev, struct
+list_head *head) {
+ struct iommu_resv_region *region;
+ int prot = IOMMU_WRITE | IOMMU_NOEXEC | IOMMU_MMIO;
+
+ region = iommu_alloc_resv_region(MSI_IOVA_BASE,
MSI_IOVA_LENGTH, prot,
+ IOMMU_RESV_MSI);
+ if (!region)
+ return;
+
+ list_add_tail(&region->list, head);
+}
+
+static void viommu_put_resv_regions(struct device *dev, struct
+list_head *head) {
+ struct iommu_resv_region *entry, *next;
+
+ list_for_each_entry_safe(entry, next, head, list)
+ kfree(entry);
+}
+
+static struct iommu_ops viommu_ops = {
+ .capable = viommu_capable,
+ .domain_alloc = viommu_domain_alloc,
+ .domain_free = viommu_domain_free,
+ .attach_dev = viommu_attach_dev,
+ .map = viommu_map,
+ .unmap = viommu_unmap,
+ .map_sg = viommu_map_sg,
+ .iova_to_phys = viommu_iova_to_phys,
+ .add_device = viommu_add_device,
+ .remove_device = viommu_remove_device,
+ .device_group = viommu_device_group,
+ .of_xlate = viommu_of_xlate,
+ .get_resv_regions = viommu_get_resv_regions,
+ .put_resv_regions = viommu_put_resv_regions,
+};
+
+static int viommu_init_vq(struct viommu_dev *viommu) {
+ struct virtio_device *vdev = dev_to_virtio(viommu->dev);
+ vq_callback_t *callback = NULL;
+ const char *name = "request";
+ int ret;
+
+ ret = vdev->config->find_vqs(vdev, 1, &viommu->vq, &callback,
+ &name, NULL);
+ if (ret)
+ dev_err(viommu->dev, "cannot find VQ\n");
+
+ return ret;
+}
+
+static int viommu_probe(struct virtio_device *vdev) {
+ struct device *parent_dev = vdev->dev.parent;
+ struct viommu_dev *viommu = NULL;
+ struct device *dev = &vdev->dev;
+ int ret;
+
+ viommu = kzalloc(sizeof(*viommu), GFP_KERNEL);
+ if (!viommu)
+ return -ENOMEM;
+
+ spin_lock_init(&viommu->vq_lock);
+ INIT_LIST_HEAD(&viommu->pending_requests);
+ viommu->dev = dev;
+ viommu->vdev = vdev;
+
+ ret = viommu_init_vq(viommu);
+ if (ret)
+ goto err_free_viommu;
+
+ virtio_cread(vdev, struct virtio_iommu_config, page_sizes,
+ &viommu->pgsize_bitmap);
+
+ viommu->aperture_end = -1UL;
+
+ virtio_cread_feature(vdev, VIRTIO_IOMMU_F_INPUT_RANGE,
+ struct virtio_iommu_config, input_range.start,
+ &viommu->aperture_start);
+
+ virtio_cread_feature(vdev, VIRTIO_IOMMU_F_INPUT_RANGE,
+ struct virtio_iommu_config, input_range.end,
+ &viommu->aperture_end);
+
+ if (!viommu->pgsize_bitmap) {
+ ret = -EINVAL;
+ goto err_free_viommu;
+ }
+
+ viommu_ops.pgsize_bitmap = viommu->pgsize_bitmap;
+
+ /*
+ * Not strictly necessary, virtio would enable it later. This allows to
+ * start using the request queue early.
+ */
+ virtio_device_ready(vdev);
+
+ ret = iommu_device_sysfs_add(&viommu->iommu, dev, NULL, "%s",
+ virtio_bus_name(vdev));
+ if (ret)
+ goto err_free_viommu;
+
+ iommu_device_set_ops(&viommu->iommu, &viommu_ops);
+ iommu_device_set_fwnode(&viommu->iommu, parent_dev-
Post by Jean-Philippe Brucker
fwnode);
+
+ iommu_device_register(&viommu->iommu);
+
+#ifdef CONFIG_PCI
+ if (pci_bus_type.iommu_ops != &viommu_ops) {
+ pci_request_acs();
+ ret = bus_set_iommu(&pci_bus_type, &viommu_ops);
+ if (ret)
+ goto err_unregister;
+ }
+#endif
+#ifdef CONFIG_ARM_AMBA
+ if (amba_bustype.iommu_ops != &viommu_ops) {
+ ret = bus_set_iommu(&amba_bustype, &viommu_ops);
+ if (ret)
+ goto err_unregister;
+ }
+#endif
+ if (platform_bus_type.iommu_ops != &viommu_ops) {
+ ret = bus_set_iommu(&platform_bus_type, &viommu_ops);
+ if (ret)
+ goto err_unregister;
+ }
+
+ vdev->priv = viommu;
+
+ dev_info(viommu->dev, "probe successful\n");
+
+ return 0;
+
+ iommu_device_unregister(&viommu->iommu);
+
+ kfree(viommu);
+
+ return ret;
+}
+
+static void viommu_remove(struct virtio_device *vdev) {
+ struct viommu_dev *viommu = vdev->priv;
+
+ iommu_device_unregister(&viommu->iommu);
+ kfree(viommu);
+
+ dev_info(&vdev->dev, "device removed\n"); }
+
+static void viommu_config_changed(struct virtio_device *vdev) {
+ dev_warn(&vdev->dev, "config changed\n"); }
+
+static unsigned int features[] = {
+ VIRTIO_IOMMU_F_INPUT_RANGE,
+};
+
+static struct virtio_device_id id_table[] = {
+ { VIRTIO_ID_IOMMU, VIRTIO_DEV_ANY_ID },
+ { 0 },
+};
+
+static struct virtio_driver virtio_iommu_drv = {
+ .driver.name = KBUILD_MODNAME,
+ .driver.owner = THIS_MODULE,
+ .id_table = id_table,
+ .feature_table = features,
+ .feature_table_size = ARRAY_SIZE(features),
+ .probe = viommu_probe,
+ .remove = viommu_remove,
+ .config_changed = viommu_config_changed,
+};
+
+module_virtio_driver(virtio_iommu_drv);
+
+IOMMU_OF_DECLARE(viommu, "virtio,mmio", NULL);
+
+MODULE_DESCRIPTION("virtio-iommu driver"); MODULE_AUTHOR("Jean-
Philippe
+MODULE_LICENSE("GPL v2");
diff --git a/include/uapi/linux/Kbuild b/include/uapi/linux/Kbuild index
1f25c86374ad..c0cb0f173258 100644
--- a/include/uapi/linux/Kbuild
+++ b/include/uapi/linux/Kbuild
@@ -467,6 +467,7 @@ header-y += virtio_console.h header-y +=
virtio_gpu.h header-y += virtio_ids.h header-y += virtio_input.h
+header-y += virtio_iommu.h
header-y += virtio_mmio.h
header-y += virtio_net.h
header-y += virtio_pci.h
diff --git a/include/uapi/linux/virtio_ids.h b/include/uapi/linux/virtio_ids.h
index 6d5c3b2d4f4d..934ed3d3cd3f 100644
--- a/include/uapi/linux/virtio_ids.h
+++ b/include/uapi/linux/virtio_ids.h
@@ -43,5 +43,6 @@
#define VIRTIO_ID_INPUT 18 /* virtio input */
#define VIRTIO_ID_VSOCK 19 /* virtio vsock transport */
#define VIRTIO_ID_CRYPTO 20 /* virtio crypto */
+#define VIRTIO_ID_IOMMU 61216 /* virtio IOMMU (temporary) */
#endif /* _LINUX_VIRTIO_IDS_H */
diff --git a/include/uapi/linux/virtio_iommu.h
b/include/uapi/linux/virtio_iommu.h
new file mode 100644
index 000000000000..ec74c9a727d4
--- /dev/null
+++ b/include/uapi/linux/virtio_iommu.h
@@ -0,0 +1,142 @@
+/*
+ * Copyright (C) 2017 ARM Ltd.
+ *
+ * This header is BSD licensed so anyone can use the definitions
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * 1. Redistributions of source code must retain the above copyright
+ * notice, this list of conditions and the following disclaimer.
+ * 2. Redistributions in binary form must reproduce the above copyright
+ * notice, this list of conditions and the following disclaimer in the
+ * documentation and/or other materials provided with the distribution.
+ * 3. Neither the name of ARM Ltd. nor the names of its contributors
+ * may be used to endorse or promote products derived from this software
+ * without specific prior written permission.
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND
CONTRIBUTORS
+ * ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND
FITNESS
+ * FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL IBM OR
+ * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF
+ * USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
CAUSED AND
+ * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT
+ * OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+ * SUCH DAMAGE.
+ */
+#ifndef _UAPI_LINUX_VIRTIO_IOMMU_H
+#define _UAPI_LINUX_VIRTIO_IOMMU_H
+
+/* Feature bits */
+#define VIRTIO_IOMMU_F_INPUT_RANGE 0
+#define VIRTIO_IOMMU_F_IOASID_BITS 1
+#define VIRTIO_IOMMU_F_MAP_UNMAP 2
+#define VIRTIO_IOMMU_F_BYPASS 3
+
+__packed
+struct virtio_iommu_config {
+ /* Supported page sizes */
+ __u64 page_sizes;
+ struct virtio_iommu_range {
+ __u64 start;
+ __u64 end;
+ } input_range;
+ __u8 ioasid_bits;
+};
+
+/* Request types */
+#define VIRTIO_IOMMU_T_ATTACH 0x01
+#define VIRTIO_IOMMU_T_DETACH 0x02
+#define VIRTIO_IOMMU_T_MAP 0x03
+#define VIRTIO_IOMMU_T_UNMAP 0x04
+
+/* Status types */
+#define VIRTIO_IOMMU_S_OK 0x00
+#define VIRTIO_IOMMU_S_IOERR 0x01
+#define VIRTIO_IOMMU_S_UNSUPP 0x02
+#define VIRTIO_IOMMU_S_DEVERR 0x03
+#define VIRTIO_IOMMU_S_INVAL 0x04
+#define VIRTIO_IOMMU_S_RANGE 0x05
+#define VIRTIO_IOMMU_S_NOENT 0x06
+#define VIRTIO_IOMMU_S_FAULT 0x07
+
+__packed
+struct virtio_iommu_req_head {
+ __u8 type;
+ __u8 reserved[3];
+};
+
+__packed
+struct virtio_iommu_req_tail {
+ __u8 status;
+ __u8 reserved[3];
+};
+
+__packed
+struct virtio_iommu_req_attach {
+ struct virtio_iommu_req_head head;
+
+ __le32 address_space;
+ __le32 device;
+ __le32 reserved;
+
+ struct virtio_iommu_req_tail tail;
+};
+
+__packed
+struct virtio_iommu_req_detach {
+ struct virtio_iommu_req_head head;
+
+ __le32 device;
+ __le32 reserved;
+
+ struct virtio_iommu_req_tail tail;
+};
+
+#define VIRTIO_IOMMU_MAP_F_READ (1 << 0)
+#define VIRTIO_IOMMU_MAP_F_WRITE (1 << 1)
+#define VIRTIO_IOMMU_MAP_F_EXEC (1 << 2)
+
+#define VIRTIO_IOMMU_MAP_F_MASK
(VIRTIO_IOMMU_MAP_F_READ | \
+
VIRTIO_IOMMU_MAP_F_WRITE | \
+
VIRTIO_IOMMU_MAP_F_EXEC)
+
+__packed
+struct virtio_iommu_req_map {
+ struct virtio_iommu_req_head head;
+
+ __le32 address_space;
+ __le32 flags;
+ __le64 virt_addr;
+ __le64 phys_addr;
+ __le64 size;
+
+ struct virtio_iommu_req_tail tail;
+};
+
+__packed
+struct virtio_iommu_req_unmap {
+ struct virtio_iommu_req_head head;
+
+ __le32 address_space;
+ __le32 flags;
+ __le64 virt_addr;
+ __le64 size;
+
+ struct virtio_iommu_req_tail tail;
+};
+
+union virtio_iommu_req {
+ struct virtio_iommu_req_head head;
+
+ struct virtio_iommu_req_attach attach;
+ struct virtio_iommu_req_detach detach;
+ struct virtio_iommu_req_map map;
+ struct virtio_iommu_req_unmap unmap;
+};
+
+#endif
--
2.12.1
---------------------------------------------------------------------
Jean-Philippe Brucker
2017-06-16 11:36:53 UTC
Permalink
Post by Bharat Bhushan
Hi Jean
Post by Jean-Philippe Brucker
+static int viommu_map(struct iommu_domain *domain, unsigned long iova,
+ phys_addr_t paddr, size_t size, int prot) {
+ int ret;
+ struct viommu_domain *vdomain = to_viommu_domain(domain);
+ struct virtio_iommu_req_map req = {
+ .head.type = VIRTIO_IOMMU_T_MAP,
+ .address_space = cpu_to_le32(vdomain->id),
+ .virt_addr = cpu_to_le64(iova),
+ .phys_addr = cpu_to_le64(paddr),
+ .size = cpu_to_le64(size),
+ };
+
+ pr_debug("map %llu 0x%lx -> 0x%llx (%zu)\n", vdomain->id, iova,
+ paddr, size);
A query, when I am tracing above prints I see same physical address is mapped with two different virtual address, do you know why kernel does this?
That really depends which driver is calling into viommu. iommu_map is
called from the DMA API, which can be used by any device drivers. Within
an address space, multiple IOVAs pointing to the same PA isn't forbidden.

For example, looking at MAP requests for a virtio-net device, I get the
following trace:

ioas[1] map 0xfffffff3000 -> 0x8faa0000 (4096)
ioas[1] map 0xfffffff2000 -> 0x8faa0000 (4096)
ioas[1] map 0xfffffff1000 -> 0x8faa0000 (4096)
ioas[1] map 0xfffffff0000 -> 0x8faa0000 (4096)
ioas[1] map 0xffffffef000 -> 0x8faa0000 (4096)
ioas[1] map 0xffffffee000 -> 0x8faa0000 (4096)
ioas[1] map 0xffffffed000 -> 0x8faa0000 (4096)
ioas[1] map 0xffffffec000 -> 0x8faa0000 (4096)
ioas[1] map 0xffffffeb000 -> 0x8faa0000 (4096)
ioas[1] map 0xffffffea000 -> 0x8faa0000 (4096)
ioas[1] map 0xffffffe8000 -> 0x8faa0000 (8192)
...

During initialization, the virtio-net driver primes the rx queue with
receive buffers, that the host will then fill with network packets. It
calls virtqueue_add_inbuf_ctx to create descriptors on the rx virtqueue
for each buffer. Each buffer is 0x180 bytes here, so one 4k page can
contain around 10 (actually 11, with the last one crossing page boundary).

I guess the call trace goes like this:
virtnet_open
try_fill_recv
add_recvbuf_mergeable
virtqueue_add_inbuf_ctx
vring_map_one_sg
dma_map_page
__iommu_dma_map

But the IOMMU cannot map fragments of pages, since the granule is 0x1000.
Therefore when virtqueue_add_inbuf_ctx maps the buffer, __iommu_dma_map
aligns address and size on full pages. Someone motivated could probably
optimize this by caching mapped pages and reusing IOVAs, but currently
that's how it goes.

Thanks,
Jean

Jean-Philippe Brucker
2017-04-07 19:24:40 UTC
Permalink
Implement a virtio-iommu device and translate DMA traffic from vfio and virtio
devices. Virtio needed some rework to support scatter-gather accesses to vring
and buffers at page granularity. Patch 3 implements the actual virtio-iommu
device.

Adding --viommu on the command-line now inserts a virtual IOMMU in front
of all virtio and vfio devices:

$ lkvm run -k Image --console virtio -p console=hvc0 \
--viommu --vfio 0 --vfio 4 --irqchip gicv3-its
...
[ 2.998949] virtio_iommu virtio0: probe successful
[ 3.007739] virtio_iommu virtio1: probe successful
...
[ 3.165023] iommu: Adding device 0000:00:00.0 to group 0
[ 3.536480] iommu: Adding device 10200.virtio to group 1
[ 3.553643] iommu: Adding device 10600.virtio to group 2
[ 3.570687] iommu: Adding device 10800.virtio to group 3
[ 3.627425] iommu: Adding device 10a00.virtio to group 4
[ 7.823689] iommu: Adding device 0000:00:01.0 to group 5
...

Patches 13 and 14 add debug facilities. Some statistics are gathered for each
address space and can be queried via the debug builtin:

$ lkvm debug -n guest-1210 --iommu stats
iommu 0 "viommu-vfio"
kicks 1255
requests 1256
ioas 1
maps 7
unmaps 4
resident 2101248
ioas 6
maps 623
unmaps 620
resident 16384
iommu 1 "viommu-virtio"
kicks 11426
requests 11431
ioas 2
maps 2836
unmaps 2835
resident 8192
accesses 2836
...

This is based on the VFIO patchset[1], itself based on Andre's ITS work.
The VFIO bits have only been tested on a software model and are unlikely
to work on actual hardware, but I also tested virtio on an ARM Juno.

[1] http://www.spinics.net/lists/kvm/msg147624.html

Jean-Philippe Brucker (15):
virtio: synchronize virtio-iommu headers with Linux
FDT: (re)introduce a dynamic phandle allocator
virtio: add virtio-iommu
Add a simple IOMMU
iommu: describe IOMMU topology in device-trees
irq: register MSI doorbell addresses
virtio: factor virtqueue initialization
virtio: add vIOMMU instance for virtio devices
virtio: access vring and buffers through IOMMU mappings
virtio-pci: translate MSIs with the virtual IOMMU
virtio: set VIRTIO_F_IOMMU_PLATFORM when necessary
vfio: add support for virtual IOMMU
virtio-iommu: debug via IPC
virtio-iommu: implement basic debug commands
virtio: use virtio-iommu when available

Makefile | 3 +
arm/gic.c | 4 +
arm/include/arm-common/fdt-arch.h | 2 +-
arm/pci.c | 49 ++-
builtin-debug.c | 8 +-
builtin-run.c | 2 +
fdt.c | 35 ++
include/kvm/builtin-debug.h | 6 +
include/kvm/devices.h | 4 +
include/kvm/fdt.h | 20 +
include/kvm/iommu.h | 105 +++++
include/kvm/irq.h | 3 +
include/kvm/kvm-config.h | 1 +
include/kvm/vfio.h | 2 +
include/kvm/virtio-iommu.h | 15 +
include/kvm/virtio-mmio.h | 1 +
include/kvm/virtio-pci.h | 2 +
include/kvm/virtio.h | 137 +++++-
include/linux/virtio_config.h | 74 ++++
include/linux/virtio_ids.h | 4 +
include/linux/virtio_iommu.h | 142 ++++++
iommu.c | 240 ++++++++++
irq.c | 35 ++
kvm-ipc.c | 43 +-
mips/include/kvm/fdt-arch.h | 2 +-
powerpc/include/kvm/fdt-arch.h | 2 +-
vfio.c | 281 +++++++++++-
virtio/9p.c | 7 +-
virtio/balloon.c | 7 +-
virtio/blk.c | 10 +-
virtio/console.c | 7 +-
virtio/core.c | 240 ++++++++--
virtio/iommu.c | 902 ++++++++++++++++++++++++++++++++++++++
virtio/mmio.c | 44 +-
virtio/net.c | 8 +-
virtio/pci.c | 61 ++-
virtio/rng.c | 6 +-
virtio/scsi.c | 6 +-
x86/include/kvm/fdt-arch.h | 2 +-
39 files changed, 2389 insertions(+), 133 deletions(-)
create mode 100644 fdt.c
create mode 100644 include/kvm/iommu.h
create mode 100644 include/kvm/virtio-iommu.h
create mode 100644 include/linux/virtio_config.h
create mode 100644 include/linux/virtio_iommu.h
create mode 100644 iommu.c
create mode 100644 virtio/iommu.c
--
2.12.1
Jean-Philippe Brucker
2017-04-07 19:24:41 UTC
Permalink
Pull virtio-iommu header (initial proposal) from Linux. Also add
virtio_config.h because it defines VIRTIO_F_IOMMU_PLATFORM, which I'm
going to need soon, and it's not provided by my toolchain.

Signed-off-by: Jean-Philippe Brucker <jean-***@arm.com>
---
include/linux/virtio_config.h | 74 ++++++++++++++++++++++
include/linux/virtio_ids.h | 4 ++
include/linux/virtio_iommu.h | 142 ++++++++++++++++++++++++++++++++++++++++++
3 files changed, 220 insertions(+)
create mode 100644 include/linux/virtio_config.h
create mode 100644 include/linux/virtio_iommu.h

diff --git a/include/linux/virtio_config.h b/include/linux/virtio_config.h
new file mode 100644
index 00000000..648b688f
--- /dev/null
+++ b/include/linux/virtio_config.h
@@ -0,0 +1,74 @@
+#ifndef _LINUX_VIRTIO_CONFIG_H
+#define _LINUX_VIRTIO_CONFIG_H
+/* This header, excluding the #ifdef __KERNEL__ part, is BSD licensed so
+ * anyone can use the definitions to implement compatible drivers/servers.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ * 1. Redistributions of source code must retain the above copyright
+ * notice, this list of conditions and the following disclaimer.
+ * 2. Redistributions in binary form must reproduce the above copyright
+ * notice, this list of conditions and the following disclaimer in the
+ * documentation and/or other materials provided with the distribution.
+ * 3. Neither the name of IBM nor the names of its contributors
+ * may be used to endorse or promote products derived from this software
+ * without specific prior written permission.
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS ``AS IS'' AND
+ * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+ * ARE DISCLAIMED. IN NO EVENT SHALL IBM OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
+ * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
+ * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
+ * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
+ * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+ * SUCH DAMAGE. */
+
+/* Virtio devices use a standardized configuration space to define their
+ * features and pass configuration information, but each implementation can
+ * store and access that space differently. */
+#include <linux/types.h>
+
+/* Status byte for guest to report progress, and synchronize features. */
+/* We have seen device and processed generic fields (VIRTIO_CONFIG_F_VIRTIO) */
+#define VIRTIO_CONFIG_S_ACKNOWLEDGE 1
+/* We have found a driver for the device. */
+#define VIRTIO_CONFIG_S_DRIVER 2
+/* Driver has used its parts of the config, and is happy */
+#define VIRTIO_CONFIG_S_DRIVER_OK 4
+/* Driver has finished configuring features */
+#define VIRTIO_CONFIG_S_FEATURES_OK 8
+/* Device entered invalid state, driver must reset it */
+#define VIRTIO_CONFIG_S_NEEDS_RESET 0x40
+/* We've given up on this device. */
+#define VIRTIO_CONFIG_S_FAILED 0x80
+
+/* Some virtio feature bits (currently bits 28 through 32) are reserved for the
+ * transport being used (eg. virtio_ring), the rest are per-device feature
+ * bits. */
+#define VIRTIO_TRANSPORT_F_START 28
+#define VIRTIO_TRANSPORT_F_END 34
+
+#ifndef VIRTIO_CONFIG_NO_LEGACY
+/* Do we get callbacks when the ring is completely used, even if we've
+ * suppressed them? */
+#define VIRTIO_F_NOTIFY_ON_EMPTY 24
+
+/* Can the device handle any descriptor layout? */
+#define VIRTIO_F_ANY_LAYOUT 27
+#endif /* VIRTIO_CONFIG_NO_LEGACY */
+
+/* v1.0 compliant. */
+#define VIRTIO_F_VERSION_1 32
+
+/*
+ * If clear - device has the IOMMU bypass quirk feature.
+ * If set - use platform tools to detect the IOMMU.
+ *
+ * Note the reverse polarity (compared to most other features),
+ * this is for compatibility with legacy systems.
+ */
+#define VIRTIO_F_IOMMU_PLATFORM 33
+#endif /* _LINUX_VIRTIO_CONFIG_H */
diff --git a/include/linux/virtio_ids.h b/include/linux/virtio_ids.h
index 5f60aa4b..934ed3d3 100644
--- a/include/linux/virtio_ids.h
+++ b/include/linux/virtio_ids.h
@@ -39,6 +39,10 @@
#define VIRTIO_ID_9P 9 /* 9p virtio console */
#define VIRTIO_ID_RPROC_SERIAL 11 /* virtio remoteproc serial link */
#define VIRTIO_ID_CAIF 12 /* Virtio caif */
+#define VIRTIO_ID_GPU 16 /* virtio GPU */
#define VIRTIO_ID_INPUT 18 /* virtio input */
+#define VIRTIO_ID_VSOCK 19 /* virtio vsock transport */
+#define VIRTIO_ID_CRYPTO 20 /* virtio crypto */
+#define VIRTIO_ID_IOMMU 61216 /* virtio IOMMU (temporary) */

#endif /* _LINUX_VIRTIO_IDS_H */
diff --git a/include/linux/virtio_iommu.h b/include/linux/virtio_iommu.h
new file mode 100644
index 00000000..beb21d44
--- /dev/null
+++ b/include/linux/virtio_iommu.h
@@ -0,0 +1,142 @@
+/*
+ * Copyright (C) 2017 ARM Ltd.
+ *
+ * This header is BSD licensed so anyone can use the definitions
+ * to implement compatible drivers/servers:
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ * 1. Redistributions of source code must retain the above copyright
+ * notice, this list of conditions and the following disclaimer.
+ * 2. Redistributions in binary form must reproduce the above copyright
+ * notice, this list of conditions and the following disclaimer in the
+ * documentation and/or other materials provided with the distribution.
+ * 3. Neither the name of ARM Ltd. nor the names of its contributors
+ * may be used to endorse or promote products derived from this software
+ * without specific prior written permission.
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ * ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS
+ * FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL IBM OR
+ * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF
+ * USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
+ * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT
+ * OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+ * SUCH DAMAGE.
+ */
+#ifndef _LINUX_VIRTIO_IOMMU_H
+#define _LINUX_VIRTIO_IOMMU_H
+
+/* Feature bits */
+#define VIRTIO_IOMMU_F_INPUT_RANGE 0
+#define VIRTIO_IOMMU_F_IOASID_BITS 1
+#define VIRTIO_IOMMU_F_MAP_UNMAP 2
+#define VIRTIO_IOMMU_F_BYPASS 3
+
+__attribute__((packed))
+struct virtio_iommu_config {
+ /* Supported page sizes */
+ __u64 page_sizes;
+ struct virtio_iommu_range {
+ __u64 start;
+ __u64 end;
+ } input_range;
+ __u8 ioasid_bits;
+};
+
+/* Request types */
+#define VIRTIO_IOMMU_T_ATTACH 0x01
+#define VIRTIO_IOMMU_T_DETACH 0x02
+#define VIRTIO_IOMMU_T_MAP 0x03
+#define VIRTIO_IOMMU_T_UNMAP 0x04
+
+/* Status types */
+#define VIRTIO_IOMMU_S_OK 0x00
+#define VIRTIO_IOMMU_S_IOERR 0x01
+#define VIRTIO_IOMMU_S_UNSUPP 0x02
+#define VIRTIO_IOMMU_S_DEVERR 0x03
+#define VIRTIO_IOMMU_S_INVAL 0x04
+#define VIRTIO_IOMMU_S_RANGE 0x05
+#define VIRTIO_IOMMU_S_NOENT 0x06
+#define VIRTIO_IOMMU_S_FAULT 0x07
+
+__attribute__((packed))
+struct virtio_iommu_req_head {
+ __u8 type;
+ __u8 reserved[3];
+};
+
+__attribute__((packed))
+struct virtio_iommu_req_tail {
+ __u8 status;
+ __u8 reserved[3];
+};
+
+__attribute__((packed))
+struct virtio_iommu_req_attach {
+ struct virtio_iommu_req_head head;
+
+ __le32 address_space;
+ __le32 device;
+ __le32 reserved;
+
+ struct virtio_iommu_req_tail tail;
+};
+
+__attribute__((packed))
+struct virtio_iommu_req_detach {
+ struct virtio_iommu_req_head head;
+
+ __le32 device;
+ __le32 reserved;
+
+ struct virtio_iommu_req_tail tail;
+};
+
+#define VIRTIO_IOMMU_MAP_F_READ (1 << 0)
+#define VIRTIO_IOMMU_MAP_F_WRITE (1 << 1)
+#define VIRTIO_IOMMU_MAP_F_EXEC (1 << 2)
+
+#define VIRTIO_IOMMU_MAP_F_MASK (VIRTIO_IOMMU_MAP_F_READ | \
+ VIRTIO_IOMMU_MAP_F_WRITE | \
+ VIRTIO_IOMMU_MAP_F_EXEC)
+
+__attribute__((packed))
+struct virtio_iommu_req_map {
+ struct virtio_iommu_req_head head;
+
+ __le32 address_space;
+ __le32 flags;
+ __le64 virt_addr;
+ __le64 phys_addr;
+ __le64 size;
+
+ struct virtio_iommu_req_tail tail;
+};
+
+__attribute__((packed))
+struct virtio_iommu_req_unmap {
+ struct virtio_iommu_req_head head;
+
+ __le32 address_space;
+ __le32 flags;
+ __le64 virt_addr;
+ __le64 size;
+
+ struct virtio_iommu_req_tail tail;
+};
+
+union virtio_iommu_req {
+ struct virtio_iommu_req_head head;
+
+ struct virtio_iommu_req_attach attach;
+ struct virtio_iommu_req_detach detach;
+ struct virtio_iommu_req_map map;
+ struct virtio_iommu_req_unmap unmap;
+};
+
+#endif
--
2.12.1
Jean-Philippe Brucker
2017-04-07 19:24:42 UTC
Permalink
The phandle allocator was removed because static values were sufficient
for creating a common irqchip. With adding multiple virtual IOMMUs to the
device-tree, there is a need for a dynamic allocation of phandles. Add a
simple allocator that returns values above the static ones.

Signed-off-by: Jean-Philippe Brucker <jean-***@arm.com>
---
Makefile | 1 +
arm/include/arm-common/fdt-arch.h | 2 +-
fdt.c | 15 +++++++++++++++
include/kvm/fdt.h | 13 +++++++++++++
mips/include/kvm/fdt-arch.h | 2 +-
powerpc/include/kvm/fdt-arch.h | 2 +-
x86/include/kvm/fdt-arch.h | 2 +-
7 files changed, 33 insertions(+), 4 deletions(-)
create mode 100644 fdt.c

diff --git a/Makefile b/Makefile
index 6d5f5d9d..3e21c597 100644
--- a/Makefile
+++ b/Makefile
@@ -303,6 +303,7 @@ ifeq (y,$(ARCH_WANT_LIBFDT))
CFLAGS_STATOPT += -DCONFIG_HAS_LIBFDT
LIBS_DYNOPT += -lfdt
LIBS_STATOPT += -lfdt
+ OBJS += fdt.o
endif
endif

diff --git a/arm/include/arm-common/fdt-arch.h b/arm/include/arm-common/fdt-arch.h
index 60c2d406..ed4ff3d4 100644
--- a/arm/include/arm-common/fdt-arch.h
+++ b/arm/include/arm-common/fdt-arch.h
@@ -1,6 +1,6 @@
#ifndef ARM__FDT_H
#define ARM__FDT_H

-enum phandles {PHANDLE_RESERVED = 0, PHANDLE_GIC, PHANDLE_MSI, PHANDLES_MAX};
+enum phandles {PHANDLE_RESERVED = 0, PHANDLE_GIC, PHANDLE_MSI, ARCH_PHANDLES_MAX};

#endif /* ARM__FDT_H */
diff --git a/fdt.c b/fdt.c
new file mode 100644
index 00000000..6db03d4e
--- /dev/null
+++ b/fdt.c
@@ -0,0 +1,15 @@
+/*
+ * Commonly used FDT functions.
+ */
+
+#include "kvm/fdt.h"
+
+static u32 next_phandle = PHANDLE_RESERVED;
+
+u32 fdt_alloc_phandle(void)
+{
+ if (next_phandle == PHANDLE_RESERVED)
+ next_phandle = ARCH_PHANDLES_MAX;
+
+ return next_phandle++;
+}
diff --git a/include/kvm/fdt.h b/include/kvm/fdt.h
index beadc7f3..503887f9 100644
--- a/include/kvm/fdt.h
+++ b/include/kvm/fdt.h
@@ -35,4 +35,17 @@ enum irq_type {
} \
} while (0)

+#ifdef CONFIG_HAS_LIBFDT
+
+u32 fdt_alloc_phandle(void);
+
+#else
+
+static inline u32 fdt_alloc_phandle(void)
+{
+ return PHANDLE_RESERVED;
+}
+
+#endif /* CONFIG_HAS_LIBFDT */
+
#endif /* KVM__FDT_H */
diff --git a/mips/include/kvm/fdt-arch.h b/mips/include/kvm/fdt-arch.h
index b0302457..3d004117 100644
--- a/mips/include/kvm/fdt-arch.h
+++ b/mips/include/kvm/fdt-arch.h
@@ -1,6 +1,6 @@
#ifndef KVM__KVM_FDT_H
#define KVM__KVM_FDT_H

-enum phandles {PHANDLE_RESERVED = 0, PHANDLES_MAX};
+enum phandles {PHANDLE_RESERVED = 0, ARCH_PHANDLES_MAX};

#endif /* KVM__KVM_FDT_H */
diff --git a/powerpc/include/kvm/fdt-arch.h b/powerpc/include/kvm/fdt-arch.h
index d48c0554..4ae4d3a0 100644
--- a/powerpc/include/kvm/fdt-arch.h
+++ b/powerpc/include/kvm/fdt-arch.h
@@ -1,6 +1,6 @@
#ifndef KVM__KVM_FDT_H
#define KVM__KVM_FDT_H

-enum phandles {PHANDLE_RESERVED = 0, PHANDLE_XICP, PHANDLES_MAX};
+enum phandles {PHANDLE_RESERVED = 0, PHANDLE_XICP, ARCH_PHANDLES_MAX};

#endif /* KVM__KVM_FDT_H */
diff --git a/x86/include/kvm/fdt-arch.h b/x86/include/kvm/fdt-arch.h
index eebd73f9..aba06ad8 100644
--- a/x86/include/kvm/fdt-arch.h
+++ b/x86/include/kvm/fdt-arch.h
@@ -1,6 +1,6 @@
#ifndef X86__FDT_ARCH_H
#define X86__FDT_ARCH_H

-enum phandles {PHANDLE_RESERVED = 0, PHANDLES_MAX};
+enum phandles {PHANDLE_RESERVED = 0, ARCH_PHANDLES_MAX};

#endif /* KVM__KVM_FDT_H */
--
2.12.1
Jean-Philippe Brucker
2017-04-07 19:24:43 UTC
Permalink
Implement a simple para-virtualized IOMMU for handling device address
spaces in guests.

Four operations are implemented:
* attach/detach: guest creates an address space, symbolized by a unique
identifier (IOASID), and attaches the device to it.
* map/unmap: guest creates a GVA->GPA mapping in an address space. Devices
attached to this address space can then access the GVA.

Each subsystem can register its own IOMMU, by calling register/unregister.
A unique device-tree phandle is allocated for each IOMMU. The IOMMU
receives commands from the driver through the virtqueue, and has a set of
callbacks for each device, allowing to implement different map/unmap
operations for passed-through and emulated devices. Note that a single
virtual IOMMU per guest would be enough, this multi-instance model is just
here for experimenting and allow different subsystems to offer different
vIOMMU features.

Add a global --viommu parameter to enable the virtual IOMMU.

Signed-off-by: Jean-Philippe Brucker <jean-***@arm.com>
---
Makefile | 1 +
builtin-run.c | 2 +
include/kvm/devices.h | 4 +
include/kvm/iommu.h | 64 +++++
include/kvm/kvm-config.h | 1 +
include/kvm/virtio-iommu.h | 10 +
virtio/iommu.c | 628 +++++++++++++++++++++++++++++++++++++++++++++
virtio/mmio.c | 11 +
8 files changed, 721 insertions(+)
create mode 100644 include/kvm/iommu.h
create mode 100644 include/kvm/virtio-iommu.h
create mode 100644 virtio/iommu.c

diff --git a/Makefile b/Makefile
index 3e21c597..67953870 100644
--- a/Makefile
+++ b/Makefile
@@ -68,6 +68,7 @@ OBJS += virtio/net.o
OBJS += virtio/rng.o
OBJS += virtio/balloon.o
OBJS += virtio/pci.o
+OBJS += virtio/iommu.o
OBJS += disk/blk.o
OBJS += disk/qcow.o
OBJS += disk/raw.o
diff --git a/builtin-run.c b/builtin-run.c
index b4790ebc..7535b531 100644
--- a/builtin-run.c
+++ b/builtin-run.c
@@ -113,6 +113,8 @@ void kvm_run_set_wrapper_sandbox(void)
OPT_BOOLEAN('\0', "sdl", &(cfg)->sdl, "Enable SDL framebuffer"),\
OPT_BOOLEAN('\0', "rng", &(cfg)->virtio_rng, "Enable virtio" \
" Random Number Generator"), \
+ OPT_BOOLEAN('\0', "viommu", &(cfg)->viommu, \
+ "Enable virtio IOMMU"), \
OPT_CALLBACK('\0', "9p", NULL, "dir_to_share,tag_name", \
"Enable virtio 9p to share files between host and" \
" guest", virtio_9p_rootdir_parser, kvm), \
diff --git a/include/kvm/devices.h b/include/kvm/devices.h
index 405f1952..70a00c5b 100644
--- a/include/kvm/devices.h
+++ b/include/kvm/devices.h
@@ -11,11 +11,15 @@ enum device_bus_type {
DEVICE_BUS_MAX,
};

+struct iommu_ops;
+
struct device_header {
enum device_bus_type bus_type;
void *data;
int dev_num;
struct rb_node node;
+ struct iommu_ops *iommu_ops;
+ void *iommu_data;
};

int device__register(struct device_header *dev);
diff --git a/include/kvm/iommu.h b/include/kvm/iommu.h
new file mode 100644
index 00000000..925e1993
--- /dev/null
+++ b/include/kvm/iommu.h
@@ -0,0 +1,64 @@
+#ifndef KVM_IOMMU_H
+#define KVM_IOMMU_H
+
+#include <stdlib.h>
+
+#include "devices.h"
+
+#define IOMMU_PROT_NONE 0x0
+#define IOMMU_PROT_READ 0x1
+#define IOMMU_PROT_WRITE 0x2
+#define IOMMU_PROT_EXEC 0x4
+
+struct iommu_ops {
+ const struct iommu_properties *(*get_properties)(struct device_header *);
+
+ void *(*alloc_address_space)(struct device_header *);
+ void (*free_address_space)(void *);
+
+ int (*attach)(void *, struct device_header *, int flags);
+ int (*detach)(void *, struct device_header *);
+ int (*map)(void *, u64 virt_addr, u64 phys_addr, u64 size, int prot);
+ int (*unmap)(void *, u64 virt_addr, u64 size, int flags);
+};
+
+struct iommu_properties {
+ const char *name;
+ u32 phandle;
+
+ size_t input_addr_size;
+ u64 pgsize_mask;
+};
+
+/*
+ * All devices presented to the system have a device ID, that allows the IOMMU
+ * to identify them. Since multiple buses can share an IOMMU, this device ID
+ * must be unique system-wide. We define it here as:
+ *
+ * (bus_type << 16) + dev_num
+ *
+ * Where dev_num is the device number on the bus as allocated by devices.c
+ *
+ * TODO: enforce this limit, by checking that the device number allocator
+ * doesn't overflow BUS_SIZE.
+ */
+
+#define BUS_SIZE 0x10000
+
+static inline long device_to_iommu_id(struct device_header *dev)
+{
+ return dev->bus_type * BUS_SIZE + dev->dev_num;
+}
+
+#define iommu_id_to_bus(device_id) ((device_id) / BUS_SIZE)
+#define iommu_id_to_devnum(device_id) ((device_id) % BUS_SIZE)
+
+static inline struct device_header *iommu_get_device(u32 device_id)
+{
+ enum device_bus_type bus = iommu_id_to_bus(device_id);
+ u32 dev_num = iommu_id_to_devnum(device_id);
+
+ return device__find_dev(bus, dev_num);
+}
+
+#endif /* KVM_IOMMU_H */
diff --git a/include/kvm/kvm-config.h b/include/kvm/kvm-config.h
index 62dc6a2f..9678065b 100644
--- a/include/kvm/kvm-config.h
+++ b/include/kvm/kvm-config.h
@@ -60,6 +60,7 @@ struct kvm_config {
bool no_dhcp;
bool ioport_debug;
bool mmio_debug;
+ bool viommu;
};

#endif
diff --git a/include/kvm/virtio-iommu.h b/include/kvm/virtio-iommu.h
new file mode 100644
index 00000000..5532c82b
--- /dev/null
+++ b/include/kvm/virtio-iommu.h
@@ -0,0 +1,10 @@
+#ifndef KVM_VIRTIO_IOMMU_H
+#define KVM_VIRTIO_IOMMU_H
+
+#include "virtio.h"
+
+const struct iommu_properties *viommu_get_properties(void *dev);
+void *viommu_register(struct kvm *kvm, struct iommu_properties *props);
+void viommu_unregister(struct kvm *kvm, void *cookie);
+
+#endif
diff --git a/virtio/iommu.c b/virtio/iommu.c
new file mode 100644
index 00000000..c72e7322
--- /dev/null
+++ b/virtio/iommu.c
@@ -0,0 +1,628 @@
+#include <errno.h>
+#include <stdbool.h>
+
+#include <linux/compiler.h>
+
+#include <linux/bitops.h>
+#include <linux/byteorder.h>
+#include <linux/err.h>
+#include <linux/list.h>
+#include <linux/types.h>
+#include <linux/virtio_ids.h>
+#include <linux/virtio_iommu.h>
+
+#include "kvm/guest_compat.h"
+#include "kvm/iommu.h"
+#include "kvm/threadpool.h"
+#include "kvm/virtio.h"
+#include "kvm/virtio-iommu.h"
+
+/* Max size */
+#define VIOMMU_DEFAULT_QUEUE_SIZE 256
+
+struct viommu_endpoint {
+ struct device_header *dev;
+ struct viommu_ioas *ioas;
+ struct list_head list;
+};
+
+struct viommu_ioas {
+ u32 id;
+
+ struct mutex devices_mutex;
+ struct list_head devices;
+ size_t nr_devices;
+ struct rb_node node;
+
+ struct iommu_ops *ops;
+ void *priv;
+};
+
+struct viommu_dev {
+ struct virtio_device vdev;
+ struct virtio_iommu_config config;
+
+ const struct iommu_properties *properties;
+
+ struct virt_queue vq;
+ size_t queue_size;
+ struct thread_pool__job job;
+
+ struct rb_root address_spaces;
+ struct kvm *kvm;
+};
+
+static int compat_id = -1;
+
+static struct viommu_ioas *viommu_find_ioas(struct viommu_dev *viommu,
+ u32 ioasid)
+{
+ struct rb_node *node;
+ struct viommu_ioas *ioas;
+
+ node = viommu->address_spaces.rb_node;
+ while (node) {
+ ioas = container_of(node, struct viommu_ioas, node);
+ if (ioas->id > ioasid)
+ node = node->rb_left;
+ else if (ioas->id < ioasid)
+ node = node->rb_right;
+ else
+ return ioas;
+ }
+
+ return NULL;
+}
+
+static struct viommu_ioas *viommu_alloc_ioas(struct viommu_dev *viommu,
+ struct device_header *device,
+ u32 ioasid)
+{
+ struct rb_node **node, *parent = NULL;
+ struct viommu_ioas *new_ioas, *ioas;
+ struct iommu_ops *ops = device->iommu_ops;
+
+ if (!ops || !ops->get_properties || !ops->alloc_address_space ||
+ !ops->free_address_space || !ops->attach || !ops->detach ||
+ !ops->map || !ops->unmap) {
+ /* Catch programming mistakes early */
+ pr_err("Invalid IOMMU ops");
+ return NULL;
+ }
+
+ new_ioas = calloc(1, sizeof(*new_ioas));
+ if (!new_ioas)
+ return NULL;
+
+ INIT_LIST_HEAD(&new_ioas->devices);
+ mutex_init(&new_ioas->devices_mutex);
+ new_ioas->id = ioasid;
+ new_ioas->ops = ops;
+ new_ioas->priv = ops->alloc_address_space(device);
+
+ /* A NULL priv pointer is valid. */
+
+ node = &viommu->address_spaces.rb_node;
+ while (*node) {
+ ioas = container_of(*node, struct viommu_ioas, node);
+ parent = *node;
+
+ if (ioas->id > ioasid) {
+ node = &((*node)->rb_left);
+ } else if (ioas->id < ioasid) {
+ node = &((*node)->rb_right);
+ } else {
+ pr_err("IOAS exists!");
+ free(new_ioas);
+ return NULL;
+ }
+ }
+
+ rb_link_node(&new_ioas->node, parent, node);
+ rb_insert_color(&new_ioas->node, &viommu->address_spaces);
+
+ return new_ioas;
+}
+
+static void viommu_free_ioas(struct viommu_dev *viommu,
+ struct viommu_ioas *ioas)
+{
+ if (ioas->priv)
+ ioas->ops->free_address_space(ioas->priv);
+
+ rb_erase(&ioas->node, &viommu->address_spaces);
+ free(ioas);
+}
+
+static int viommu_ioas_add_device(struct viommu_ioas *ioas,
+ struct viommu_endpoint *vdev)
+{
+ mutex_lock(&ioas->devices_mutex);
+ list_add_tail(&vdev->list, &ioas->devices);
+ ioas->nr_devices++;
+ vdev->ioas = ioas;
+ mutex_unlock(&ioas->devices_mutex);
+
+ return 0;
+}
+
+static int viommu_ioas_del_device(struct viommu_ioas *ioas,
+ struct viommu_endpoint *vdev)
+{
+ mutex_lock(&ioas->devices_mutex);
+ list_del(&vdev->list);
+ ioas->nr_devices--;
+ vdev->ioas = NULL;
+ mutex_unlock(&ioas->devices_mutex);
+
+ return 0;
+}
+
+static struct viommu_endpoint *viommu_alloc_device(struct device_header *device)
+{
+ struct viommu_endpoint *vdev = calloc(1, sizeof(*vdev));
+
+ device->iommu_data = vdev;
+ vdev->dev = device;
+
+ return vdev;
+}
+
+static int viommu_detach_device(struct viommu_dev *viommu,
+ struct viommu_endpoint *vdev)
+{
+ int ret;
+ struct viommu_ioas *ioas = vdev->ioas;
+ struct device_header *device = vdev->dev;
+
+ if (!ioas)
+ return -EINVAL;
+
+ pr_debug("detaching device %#lx from IOAS %u",
+ device_to_iommu_id(device), ioas->id);
+
+ ret = device->iommu_ops->detach(ioas->priv, device);
+ if (!ret)
+ ret = viommu_ioas_del_device(ioas, vdev);
+
+ if (!ioas->nr_devices)
+ viommu_free_ioas(viommu, ioas);
+
+ return ret;
+}
+
+static int viommu_handle_attach(struct viommu_dev *viommu,
+ struct virtio_iommu_req_attach *attach)
+{
+ int ret;
+ struct viommu_ioas *ioas;
+ struct device_header *device;
+ struct viommu_endpoint *vdev;
+
+ u32 device_id = le32_to_cpu(attach->device);
+ u32 ioasid = le32_to_cpu(attach->address_space);
+
+ device = iommu_get_device(device_id);
+ if (IS_ERR_OR_NULL(device)) {
+ pr_err("could not find device %#x", device_id);
+ return -ENODEV;
+ }
+
+ pr_debug("attaching device %#x to IOAS %u", device_id, ioasid);
+
+ vdev = device->iommu_data;
+ if (!vdev) {
+ vdev = viommu_alloc_device(device);
+ if (!vdev)
+ return -ENOMEM;
+ }
+
+ ioas = viommu_find_ioas(viommu, ioasid);
+ if (!ioas) {
+ ioas = viommu_alloc_ioas(viommu, device, ioasid);
+ if (!ioas)
+ return -ENOMEM;
+ } else if (ioas->ops->map != device->iommu_ops->map ||
+ ioas->ops->unmap != device->iommu_ops->unmap) {
+ return -EINVAL;
+ }
+
+ if (vdev->ioas) {
+ ret = viommu_detach_device(viommu, vdev);
+ if (ret)
+ return ret;
+ }
+
+ ret = device->iommu_ops->attach(ioas->priv, device, 0);
+ if (!ret)
+ ret = viommu_ioas_add_device(ioas, vdev);
+
+ if (ret && ioas->nr_devices == 0)
+ viommu_free_ioas(viommu, ioas);
+
+ return ret;
+}
+
+static int viommu_handle_detach(struct viommu_dev *viommu,
+ struct virtio_iommu_req_detach *detach)
+{
+ struct device_header *device;
+ struct viommu_endpoint *vdev;
+
+ u32 device_id = le32_to_cpu(detach->device);
+
+ device = iommu_get_device(device_id);
+ if (IS_ERR_OR_NULL(device)) {
+ pr_err("could not find device %#x", device_id);
+ return -ENODEV;
+ }
+
+ vdev = device->iommu_data;
+ if (!vdev)
+ return -ENODEV;
+
+ return viommu_detach_device(viommu, vdev);
+}
+
+static int viommu_handle_map(struct viommu_dev *viommu,
+ struct virtio_iommu_req_map *map)
+{
+ int prot = 0;
+ struct viommu_ioas *ioas;
+
+ u32 ioasid = le32_to_cpu(map->address_space);
+ u64 virt_addr = le64_to_cpu(map->virt_addr);
+ u64 phys_addr = le64_to_cpu(map->phys_addr);
+ u64 size = le64_to_cpu(map->size);
+ u32 flags = le64_to_cpu(map->flags);
+
+ ioas = viommu_find_ioas(viommu, ioasid);
+ if (!ioas) {
+ pr_err("could not find address space %u", ioasid);
+ return -ESRCH;
+ }
+
+ if (flags & ~VIRTIO_IOMMU_MAP_F_MASK)
+ return -EINVAL;
+
+ if (flags & VIRTIO_IOMMU_MAP_F_READ)
+ prot |= IOMMU_PROT_READ;
+
+ if (flags & VIRTIO_IOMMU_MAP_F_WRITE)
+ prot |= IOMMU_PROT_WRITE;
+
+ if (flags & VIRTIO_IOMMU_MAP_F_EXEC)
+ prot |= IOMMU_PROT_EXEC;
+
+ pr_debug("map %#llx -> %#llx (%llu) to IOAS %u", virt_addr,
+ phys_addr, size, ioasid);
+
+ return ioas->ops->map(ioas->priv, virt_addr, phys_addr, size, prot);
+}
+
+static int viommu_handle_unmap(struct viommu_dev *viommu,
+ struct virtio_iommu_req_unmap *unmap)
+{
+ struct viommu_ioas *ioas;
+
+ u32 ioasid = le32_to_cpu(unmap->address_space);
+ u64 virt_addr = le64_to_cpu(unmap->virt_addr);
+ u64 size = le64_to_cpu(unmap->size);
+
+ ioas = viommu_find_ioas(viommu, ioasid);
+ if (!ioas) {
+ pr_err("could not find address space %u", ioasid);
+ return -ESRCH;
+ }
+
+ pr_debug("unmap %#llx (%llu) from IOAS %u", virt_addr, size,
+ ioasid);
+
+ return ioas->ops->unmap(ioas->priv, virt_addr, size, 0);
+}
+
+static size_t viommu_get_req_len(union virtio_iommu_req *req)
+{
+ switch (req->head.type) {
+ case VIRTIO_IOMMU_T_ATTACH:
+ return sizeof(req->attach);
+ case VIRTIO_IOMMU_T_DETACH:
+ return sizeof(req->detach);
+ case VIRTIO_IOMMU_T_MAP:
+ return sizeof(req->map);
+ case VIRTIO_IOMMU_T_UNMAP:
+ return sizeof(req->unmap);
+ default:
+ pr_err("unknown request type %x", req->head.type);
+ return 0;
+ }
+}
+
+static int viommu_errno_to_status(int err)
+{
+ switch (err) {
+ case 0:
+ return VIRTIO_IOMMU_S_OK;
+ case EIO:
+ return VIRTIO_IOMMU_S_IOERR;
+ case ENOSYS:
+ return VIRTIO_IOMMU_S_UNSUPP;
+ case ERANGE:
+ return VIRTIO_IOMMU_S_RANGE;
+ case EFAULT:
+ return VIRTIO_IOMMU_S_FAULT;
+ case EINVAL:
+ return VIRTIO_IOMMU_S_INVAL;
+ case ENOENT:
+ case ENODEV:
+ case ESRCH:
+ return VIRTIO_IOMMU_S_NOENT;
+ case ENOMEM:
+ case ENOSPC:
+ default:
+ return VIRTIO_IOMMU_S_DEVERR;
+ }
+}
+
+static ssize_t viommu_dispatch_commands(struct viommu_dev *viommu,
+ struct iovec *iov, int nr_in, int nr_out)
+{
+ u32 op;
+ int i, ret;
+ ssize_t written_len = 0;
+ size_t len, expected_len;
+ union virtio_iommu_req *req;
+ struct virtio_iommu_req_tail *tail;
+
+ /*
+ * Are we picking up in the middle of a request buffer? Keep a running
+ * count.
+ *
+ * Here we assume that a request is always made of two descriptors, a
+ * head and a tail. TODO: get rid of framing assumptions by keeping
+ * track of request fragments.
+ */
+ static bool is_head = true;
+ static int cur_status = 0;
+
+ for (i = 0; i < nr_in + nr_out; i++, is_head = !is_head) {
+ len = iov[i].iov_len;
+ if (is_head && len < sizeof(req->head)) {
+ pr_err("invalid command length (%zu)", len);
+ cur_status = EIO;
+ continue;
+ } else if (!is_head && len < sizeof(*tail)) {
+ pr_err("invalid tail length (%zu)", len);
+ cur_status = 0;
+ continue;
+ }
+
+ if (!is_head) {
+ int status = viommu_errno_to_status(cur_status);
+
+ tail = iov[i].iov_base;
+ tail->status = cpu_to_le32(status);
+ written_len += sizeof(tail->status);
+ cur_status = 0;
+ continue;
+ }
+
+ req = iov[i].iov_base;
+ op = req->head.type;
+ expected_len = viommu_get_req_len(req) - sizeof(*tail);
+ if (expected_len != len) {
+ pr_err("invalid command %x length (%zu != %zu)", op,
+ len, expected_len);
+ cur_status = EIO;
+ continue;
+ }
+
+ switch (op) {
+ case VIRTIO_IOMMU_T_ATTACH:
+ ret = viommu_handle_attach(viommu, &req->attach);
+ break;
+
+ case VIRTIO_IOMMU_T_DETACH:
+ ret = viommu_handle_detach(viommu, &req->detach);
+ break;
+
+ case VIRTIO_IOMMU_T_MAP:
+ ret = viommu_handle_map(viommu, &req->map);
+ break;
+
+ case VIRTIO_IOMMU_T_UNMAP:
+ ret = viommu_handle_unmap(viommu, &req->unmap);
+ break;
+
+ default:
+ pr_err("unhandled command %x", op);
+ ret = -ENOSYS;
+ }
+
+ if (ret)
+ cur_status = -ret;
+ }
+
+ return written_len;
+}
+
+static void viommu_command(struct kvm *kvm, void *dev)
+{
+ int len;
+ u16 head;
+ u16 out, in;
+
+ struct virt_queue *vq;
+ struct viommu_dev *viommu = dev;
+ struct iovec iov[VIOMMU_DEFAULT_QUEUE_SIZE];
+
+ vq = &viommu->vq;
+
+ while (virt_queue__available(vq)) {
+ head = virt_queue__get_iov(vq, iov, &out, &in, kvm);
+
+ len = viommu_dispatch_commands(viommu, iov, in, out);
+ if (len < 0) {
+ /* Critical error, abort everything */
+ pr_err("failed to dispatch viommu command");
+ return;
+ }
+
+ virt_queue__set_used_elem(vq, head, len);
+ }
+
+ if (virtio_queue__should_signal(vq))
+ viommu->vdev.ops->signal_vq(kvm, &viommu->vdev, 0);
+}
+
+/* Virtio API */
+static u8 *viommu_get_config(struct kvm *kvm, void *dev)
+{
+ struct viommu_dev *viommu = dev;
+
+ return (u8 *)&viommu->config;
+}
+
+static u32 viommu_get_host_features(struct kvm *kvm, void *dev)
+{
+ return 1ULL << VIRTIO_RING_F_EVENT_IDX
+ | 1ULL << VIRTIO_RING_F_INDIRECT_DESC
+ | 1ULL << VIRTIO_IOMMU_F_INPUT_RANGE;
+}
+
+static void viommu_set_guest_features(struct kvm *kvm, void *dev, u32 features)
+{
+}
+
+static int viommu_init_vq(struct kvm *kvm, void *dev, u32 vq, u32 page_size,
+ u32 align, u32 pfn)
+{
+ void *ptr;
+ struct virt_queue *queue;
+ struct viommu_dev *viommu = dev;
+
+ if (vq != 0)
+ return -ENODEV;
+
+ compat__remove_message(compat_id);
+
+ queue = &viommu->vq;
+ queue->pfn = pfn;
+ ptr = virtio_get_vq(kvm, queue->pfn, page_size);
+
+ vring_init(&queue->vring, viommu->queue_size, ptr, align);
+ virtio_init_device_vq(&viommu->vdev, queue);
+
+ thread_pool__init_job(&viommu->job, kvm, viommu_command, viommu);
+
+ return 0;
+}
+
+static int viommu_get_pfn_vq(struct kvm *kvm, void *dev, u32 vq)
+{
+ struct viommu_dev *viommu = dev;
+
+ return viommu->vq.pfn;
+}
+
+static int viommu_get_size_vq(struct kvm *kvm, void *dev, u32 vq)
+{
+ struct viommu_dev *viommu = dev;
+
+ return viommu->queue_size;
+}
+
+static int viommu_set_size_vq(struct kvm *kvm, void *dev, u32 vq, int size)
+{
+ struct viommu_dev *viommu = dev;
+
+ if (viommu->vq.pfn)
+ /* Already init, can't resize */
+ return viommu->queue_size;
+
+ viommu->queue_size = size;
+
+ return size;
+}
+
+static int viommu_notify_vq(struct kvm *kvm, void *dev, u32 vq)
+{
+ struct viommu_dev *viommu = dev;
+
+ thread_pool__do_job(&viommu->job);
+
+ return 0;
+}
+
+static void viommu_notify_vq_gsi(struct kvm *kvm, void *dev, u32 vq, u32 gsi)
+{
+ /* TODO: when implementing vhost */
+}
+
+static void viommu_notify_vq_eventfd(struct kvm *kvm, void *dev, u32 vq, u32 fd)
+{
+ /* TODO: when implementing vhost */
+}
+
+static struct virtio_ops iommu_dev_virtio_ops = {
+ .get_config = viommu_get_config,
+ .get_host_features = viommu_get_host_features,
+ .set_guest_features = viommu_set_guest_features,
+ .init_vq = viommu_init_vq,
+ .get_pfn_vq = viommu_get_pfn_vq,
+ .get_size_vq = viommu_get_size_vq,
+ .set_size_vq = viommu_set_size_vq,
+ .notify_vq = viommu_notify_vq,
+ .notify_vq_gsi = viommu_notify_vq_gsi,
+ .notify_vq_eventfd = viommu_notify_vq_eventfd,
+};
+
+const struct iommu_properties *viommu_get_properties(void *dev)
+{
+ struct viommu_dev *viommu = dev;
+
+ return viommu->properties;
+}
+
+void *viommu_register(struct kvm *kvm, struct iommu_properties *props)
+{
+ struct viommu_dev *viommu;
+ u64 pgsize_mask = ~(PAGE_SIZE - 1);
+
+ if (!kvm->cfg.viommu)
+ return NULL;
+
+ props->phandle = fdt_alloc_phandle();
+
+ viommu = calloc(1, sizeof(struct viommu_dev));
+ if (!viommu)
+ return NULL;
+
+ viommu->queue_size = VIOMMU_DEFAULT_QUEUE_SIZE;
+ viommu->address_spaces = (struct rb_root)RB_ROOT;
+ viommu->properties = props;
+
+ viommu->config.page_sizes = props->pgsize_mask ?: pgsize_mask;
+ viommu->config.input_range.end = props->input_addr_size % BITS_PER_LONG ?
+ (1UL << props->input_addr_size) - 1 :
+ -1UL;
+
+ if (virtio_init(kvm, viommu, &viommu->vdev, &iommu_dev_virtio_ops,
+ VIRTIO_MMIO, 0, VIRTIO_ID_IOMMU, 0)) {
+ free(viommu);
+ return NULL;
+ }
+
+ pr_info("Loaded virtual IOMMU %s", props->name);
+
+ if (compat_id == -1)
+ compat_id = virtio_compat_add_message("virtio-iommu",
+ "CONFIG_VIRTIO_IOMMU");
+
+ return viommu;
+}
+
+void viommu_unregister(struct kvm *kvm, void *viommu)
+{
+ free(viommu);
+}
diff --git a/virtio/mmio.c b/virtio/mmio.c
index f0af4bd1..b3dea51a 100644
--- a/virtio/mmio.c
+++ b/virtio/mmio.c
@@ -1,14 +1,17 @@
#include "kvm/devices.h"
#include "kvm/virtio-mmio.h"
#include "kvm/ioeventfd.h"
+#include "kvm/iommu.h"
#include "kvm/ioport.h"
#include "kvm/virtio.h"
+#include "kvm/virtio-iommu.h"
#include "kvm/kvm.h"
#include "kvm/kvm-cpu.h"
#include "kvm/irq.h"
#include "kvm/fdt.h"

#include <linux/virtio_mmio.h>
+#include <linux/virtio_ids.h>
#include <string.h>

static u32 virtio_mmio_io_space_blocks = KVM_VIRTIO_MMIO_AREA;
@@ -237,6 +240,7 @@ void generate_virtio_mmio_fdt_node(void *fdt,
u8 irq,
enum irq_type))
{
+ const struct iommu_properties *props;
char dev_name[DEVICE_NAME_MAX_LEN];
struct virtio_mmio *vmmio = container_of(dev_hdr,
struct virtio_mmio,
@@ -254,6 +258,13 @@ void generate_virtio_mmio_fdt_node(void *fdt,
_FDT(fdt_property(fdt, "reg", reg_prop, sizeof(reg_prop)));
_FDT(fdt_property(fdt, "dma-coherent", NULL, 0));
generate_irq_prop(fdt, vmmio->irq, IRQ_TYPE_EDGE_RISING);
+
+ if (vmmio->hdr.device_id == VIRTIO_ID_IOMMU) {
+ props = viommu_get_properties(vmmio->dev);
+ _FDT(fdt_property_cell(fdt, "phandle", props->phandle));
+ _FDT(fdt_property_cell(fdt, "#iommu-cells", 1));
+ }
+
_FDT(fdt_end_node(fdt));
}
#else
--
2.12.1
Jean-Philippe Brucker
2017-04-07 19:24:44 UTC
Permalink
Add a rb-tree based IOMMU with support for map, unmap and access
operations. It will be used to store mappings for virtio devices and MSI
doorbells. If needed, it could also be extended with a TLB implementation.

Signed-off-by: Jean-Philippe Brucker <jean-***@arm.com>
---
Makefile | 1 +
include/kvm/iommu.h | 9 +++
iommu.c | 162 ++++++++++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 172 insertions(+)
create mode 100644 iommu.c

diff --git a/Makefile b/Makefile
index 67953870..0c369206 100644
--- a/Makefile
+++ b/Makefile
@@ -73,6 +73,7 @@ OBJS += disk/blk.o
OBJS += disk/qcow.o
OBJS += disk/raw.o
OBJS += ioeventfd.o
+OBJS += iommu.o
OBJS += net/uip/core.o
OBJS += net/uip/arp.o
OBJS += net/uip/icmp.o
diff --git a/include/kvm/iommu.h b/include/kvm/iommu.h
index 925e1993..4164ba20 100644
--- a/include/kvm/iommu.h
+++ b/include/kvm/iommu.h
@@ -61,4 +61,13 @@ static inline struct device_header *iommu_get_device(u32 device_id)
return device__find_dev(bus, dev_num);
}

+void *iommu_alloc_address_space(struct device_header *dev);
+void iommu_free_address_space(void *address_space);
+
+int iommu_map(void *address_space, u64 virt_addr, u64 phys_addr, u64 size,
+ int prot);
+int iommu_unmap(void *address_space, u64 virt_addr, u64 size, int flags);
+u64 iommu_access(void *address_space, u64 addr, size_t size, size_t *out_size,
+ int prot);
+
#endif /* KVM_IOMMU_H */
diff --git a/iommu.c b/iommu.c
new file mode 100644
index 00000000..0a662404
--- /dev/null
+++ b/iommu.c
@@ -0,0 +1,162 @@
+/*
+ * Implement basic IOMMU operations - map, unmap and translate
+ */
+#include <errno.h>
+
+#include "kvm/iommu.h"
+#include "kvm/kvm.h"
+#include "kvm/mutex.h"
+#include "kvm/rbtree-interval.h"
+
+struct iommu_mapping {
+ struct rb_int_node iova_range;
+ u64 phys;
+ int prot;
+};
+
+struct iommu_ioas {
+ struct rb_root mappings;
+ struct mutex mutex;
+};
+
+void *iommu_alloc_address_space(struct device_header *unused)
+{
+ struct iommu_ioas *ioas = calloc(1, sizeof(*ioas));
+
+ if (!ioas)
+ return NULL;
+
+ ioas->mappings = (struct rb_root)RB_ROOT;
+ mutex_init(&ioas->mutex);
+
+ return ioas;
+}
+
+void iommu_free_address_space(void *address_space)
+{
+ struct iommu_ioas *ioas = address_space;
+ struct rb_int_node *int_node;
+ struct rb_node *node, *next;
+ struct iommu_mapping *map;
+
+ /* Postorder allows to free leaves first. */
+ node = rb_first_postorder(&ioas->mappings);
+ while (node) {
+ next = rb_next_postorder(node);
+
+ int_node = rb_int(node);
+ map = container_of(int_node, struct iommu_mapping, iova_range);
+ free(map);
+
+ node = next;
+ }
+
+ free(ioas);
+}
+
+int iommu_map(void *address_space, u64 virt_addr, u64 phys_addr,
+ u64 size, int prot)
+{
+ struct iommu_ioas *ioas = address_space;
+ struct iommu_mapping *map;
+
+ if (!ioas)
+ return -ENODEV;
+
+ map = malloc(sizeof(struct iommu_mapping));
+ if (!map)
+ return -ENOMEM;
+
+ map->phys = phys_addr;
+ map->iova_range = RB_INT_INIT(virt_addr, virt_addr + size - 1);
+ map->prot = prot;
+
+ mutex_lock(&ioas->mutex);
+ rb_int_insert(&ioas->mappings, &map->iova_range);
+ mutex_unlock(&ioas->mutex);
+
+ return 0;
+}
+
+int iommu_unmap(void *address_space, u64 virt_addr, u64 size, int flags)
+{
+ int ret = 0;
+ struct rb_int_node *node;
+ struct iommu_mapping *map;
+ struct iommu_ioas *ioas = address_space;
+
+ if (!ioas)
+ return -ENODEV;
+
+ mutex_lock(&ioas->mutex);
+ node = rb_int_search_single(&ioas->mappings, virt_addr);
+ while (node && size) {
+ struct rb_node *next = rb_next(&node->node);
+ size_t node_size = node->high - node->low + 1;
+ map = container_of(node, struct iommu_mapping, iova_range);
+
+ if (node_size > size) {
+ pr_debug("cannot split mapping");
+ ret = -EINVAL;
+ break;
+ }
+
+ size -= node_size;
+ virt_addr += node_size;
+
+ rb_erase(&node->node, &ioas->mappings);
+ free(map);
+ node = next ? container_of(next, struct rb_int_node, node) : NULL;
+ }
+
+ if (size && !ret) {
+ pr_debug("mapping not found");
+ ret = -ENXIO;
+ }
+ mutex_unlock(&ioas->mutex);
+
+ return ret;
+}
+
+/*
+ * Translate a virtual address into a physical one. Perform an access of @size
+ * bytes with protection @prot. If @addr isn't mapped in @address_space, return
+ * 0. If the permissions of the mapping don't match, return 0. If the access
+ * range specified by (addr, size) spans over multiple mappings, only access the
+ * first mapping and return the accessed size in @out_size. It is up to the
+ * caller to complete the access by calling the function again on the remaining
+ * range. Subsequent accesses are not guaranteed to succeed.
+ */
+u64 iommu_access(void *address_space, u64 addr, size_t size, size_t *out_size,
+ int prot)
+{
+ struct iommu_ioas *ioas = address_space;
+ struct iommu_mapping *map;
+ struct rb_int_node *node;
+ u64 out_addr = 0;
+
+ mutex_lock(&ioas->mutex);
+ node = rb_int_search_single(&ioas->mappings, addr);
+ if (!node) {
+ pr_err("fault at IOVA %#llx %zu", addr, size);
+ errno = EFAULT;
+ goto out_unlock; /* Segv incomming */
+ }
+
+ map = container_of(node, struct iommu_mapping, iova_range);
+ if (prot & ~map->prot) {
+ pr_err("permission fault at IOVA %#llx", addr);
+ errno = EPERM;
+ goto out_unlock;
+ }
+
+ out_addr = map->phys + (addr - node->low);
+ *out_size = min_t(size_t, node->high - addr + 1, size);
+
+ pr_debug("access %llx %zu/%zu %x -> %#llx", addr, *out_size, size,
+ prot, out_addr);
+out_unlock:
+ mutex_unlock(&ioas->mutex);
+
+ return out_addr;
+}
--
2.12.1
Jean-Philippe Brucker
2017-04-07 19:24:45 UTC
Permalink
Add an "iommu-map" property to the PCI host controller, describing which
iommus translate which devices. We describe individual devices in
iommu-map, not ranges. This patch is incompatible with current mainline
Linux, which requires *all* devices under a host controller to be
described by the iommu-map property when present. Unfortunately all PCI
devices in kvmtool are under the same root complex, and we have to omit
RIDs of devices that aren't behind the virtual IOMMU in iommu-map. Fixing
this either requires a simple patch in Linux, or to implement multiple
host controllers in kvmtool.

Add an "iommus" property to plaform devices that are behind an iommu.

Signed-off-by: Jean-Philippe Brucker <jean-***@arm.com>
---
arm/pci.c | 49 ++++++++++++++++++++++++++++++++++++++++++++++++-
fdt.c | 20 ++++++++++++++++++++
include/kvm/fdt.h | 7 +++++++
virtio/mmio.c | 1 +
4 files changed, 76 insertions(+), 1 deletion(-)

diff --git a/arm/pci.c b/arm/pci.c
index 557cfa98..968cbf5b 100644
--- a/arm/pci.c
+++ b/arm/pci.c
@@ -1,9 +1,11 @@
#include "kvm/devices.h"
#include "kvm/fdt.h"
+#include "kvm/iommu.h"
#include "kvm/kvm.h"
#include "kvm/of_pci.h"
#include "kvm/pci.h"
#include "kvm/util.h"
+#include "kvm/virtio-iommu.h"

#include "arm-common/pci.h"

@@ -24,11 +26,20 @@ struct of_interrupt_map_entry {
struct of_gic_irq gic_irq;
} __attribute__((packed));

+struct of_iommu_map_entry {
+ u32 rid_base;
+ u32 iommu_phandle;
+ u32 iommu_base;
+ u32 length;
+} __attribute__((packed));
+
void pci__generate_fdt_nodes(void *fdt)
{
struct device_header *dev_hdr;
struct of_interrupt_map_entry irq_map[OF_PCI_IRQ_MAP_MAX];
- unsigned nentries = 0;
+ struct of_iommu_map_entry *iommu_map;
+ unsigned nentries = 0, ntranslated = 0;
+ unsigned i;
/* Bus range */
u32 bus_range[] = { cpu_to_fdt32(0), cpu_to_fdt32(1), };
/* Configuration Space */
@@ -99,6 +110,9 @@ void pci__generate_fdt_nodes(void *fdt)
},
};

+ if (dev_hdr->iommu_ops)
+ ntranslated++;
+
nentries++;
dev_hdr = device__next_dev(dev_hdr);
}
@@ -121,5 +135,38 @@ void pci__generate_fdt_nodes(void *fdt)
sizeof(irq_mask)));
}

+ if (ntranslated) {
+ const struct iommu_properties *props;
+
+ iommu_map = malloc(ntranslated * sizeof(struct of_iommu_map_entry));
+ if (!iommu_map) {
+ pr_err("cannot allocate iommu_map.");
+ return;
+ }
+
+ dev_hdr = device__first_dev(DEVICE_BUS_PCI);
+ for (i = 0; i < ntranslated; dev_hdr = device__next_dev(dev_hdr)) {
+ struct of_iommu_map_entry *entry = &iommu_map[i];
+
+ if (!dev_hdr->iommu_ops)
+ continue;
+
+ props = dev_hdr->iommu_ops->get_properties(dev_hdr);
+
+ *entry = (struct of_iommu_map_entry) {
+ .rid_base = cpu_to_fdt32(dev_hdr->dev_num << 3),
+ .iommu_phandle = cpu_to_fdt32(props->phandle),
+ .iommu_base = cpu_to_fdt32(device_to_iommu_id(dev_hdr)),
+ .length = cpu_to_fdt32(1 << 3),
+ };
+
+ i++;
+ }
+
+ _FDT(fdt_property(fdt, "iommu-map", iommu_map,
+ ntranslated * sizeof(struct of_iommu_map_entry)));
+ free(iommu_map);
+ }
+
_FDT(fdt_end_node(fdt));
}
diff --git a/fdt.c b/fdt.c
index 6db03d4e..15d7bb29 100644
--- a/fdt.c
+++ b/fdt.c
@@ -2,7 +2,10 @@
* Commonly used FDT functions.
*/

+#include "kvm/devices.h"
#include "kvm/fdt.h"
+#include "kvm/iommu.h"
+#include "kvm/util.h"

static u32 next_phandle = PHANDLE_RESERVED;

@@ -13,3 +16,20 @@ u32 fdt_alloc_phandle(void)

return next_phandle++;
}
+
+void fdt_generate_iommus_prop(void *fdt, struct device_header *dev_hdr)
+{
+ const struct iommu_properties *props;
+
+ if (!dev_hdr->iommu_ops)
+ return;
+
+ props = dev_hdr->iommu_ops->get_properties(dev_hdr);
+
+ u32 iommus[] = {
+ cpu_to_fdt32(props->phandle),
+ cpu_to_fdt32(device_to_iommu_id(dev_hdr)),
+ };
+
+ _FDT(fdt_property(fdt, "iommus", iommus, sizeof(iommus)));
+}
diff --git a/include/kvm/fdt.h b/include/kvm/fdt.h
index 503887f9..c64fe8a3 100644
--- a/include/kvm/fdt.h
+++ b/include/kvm/fdt.h
@@ -37,7 +37,10 @@ enum irq_type {

#ifdef CONFIG_HAS_LIBFDT

+struct device_header;
+
u32 fdt_alloc_phandle(void);
+void fdt_generate_iommus_prop(void *fdt, struct device_header *dev);

#else

@@ -46,6 +49,10 @@ static inline u32 fdt_alloc_phandle(void)
return PHANDLE_RESERVED;
}

+static inline void fdt_generate_iommus_prop(void *fdt, struct device_header *dev)
+{
+}
+
#endif /* CONFIG_HAS_LIBFDT */

#endif /* KVM__FDT_H */
diff --git a/virtio/mmio.c b/virtio/mmio.c
index b3dea51a..16b44fbb 100644
--- a/virtio/mmio.c
+++ b/virtio/mmio.c
@@ -258,6 +258,7 @@ void generate_virtio_mmio_fdt_node(void *fdt,
_FDT(fdt_property(fdt, "reg", reg_prop, sizeof(reg_prop)));
_FDT(fdt_property(fdt, "dma-coherent", NULL, 0));
generate_irq_prop(fdt, vmmio->irq, IRQ_TYPE_EDGE_RISING);
+ fdt_generate_iommus_prop(fdt, dev_hdr);

if (vmmio->hdr.device_id == VIRTIO_ID_IOMMU) {
props = viommu_get_properties(vmmio->dev);
--
2.12.1
Jean-Philippe Brucker
2017-04-07 19:24:46 UTC
Permalink
For passed-through devices behind a vIOMMU, we'll need to translate writes
to MSI vectors. Let the IRQ code register MSI doorbells, and add a simple
way for other systems to check if an address is a doorbell.

Signed-off-by: Jean-Philippe Brucker <jean-***@arm.com>
---
arm/gic.c | 4 ++++
include/kvm/irq.h | 3 +++
irq.c | 35 +++++++++++++++++++++++++++++++++++
3 files changed, 42 insertions(+)

diff --git a/arm/gic.c b/arm/gic.c
index bf7a22a9..c708031e 100644
--- a/arm/gic.c
+++ b/arm/gic.c
@@ -108,6 +108,10 @@ static int gic__create_its_frame(struct kvm *kvm, u64 its_frame_addr)
};
int err;

+ err = irq__add_msi_doorbell(kvm, its_frame_addr, KVM_VGIC_V3_ITS_SIZE);
+ if (err)
+ return err;
+
err = ioctl(kvm->vm_fd, KVM_CREATE_DEVICE, &its_device);
if (err) {
fprintf(stderr,
diff --git a/include/kvm/irq.h b/include/kvm/irq.h
index a188a870..2a59257e 100644
--- a/include/kvm/irq.h
+++ b/include/kvm/irq.h
@@ -24,6 +24,9 @@ int irq__allocate_routing_entry(void);
int irq__add_msix_route(struct kvm *kvm, struct msi_msg *msg, u32 device_id);
void irq__update_msix_route(struct kvm *kvm, u32 gsi, struct msi_msg *msg);

+int irq__add_msi_doorbell(struct kvm *kvm, u64 addr, u64 size);
+bool irq__addr_is_msi_doorbell(struct kvm *kvm, u64 addr);
+
/*
* The function takes two eventfd arguments, trigger_fd and resample_fd. If
* resample_fd is <= 0, resampling is disabled and the IRQ is edge-triggered
diff --git a/irq.c b/irq.c
index a4ef75e4..a04f4d37 100644
--- a/irq.c
+++ b/irq.c
@@ -8,6 +8,14 @@
#include "kvm/irq.h"
#include "kvm/kvm-arch.h"

+struct kvm_msi_doorbell_region {
+ u64 start;
+ u64 end;
+ struct list_head head;
+};
+
+static LIST_HEAD(msi_doorbells);
+
static u8 next_line = KVM_IRQ_OFFSET;
static int allocated_gsis = 0;

@@ -147,6 +155,33 @@ void irq__update_msix_route(struct kvm *kvm, u32 gsi, struct msi_msg *msg)
die_perror("KVM_SET_GSI_ROUTING");
}

+int irq__add_msi_doorbell(struct kvm *kvm, u64 addr, u64 size)
+{
+ struct kvm_msi_doorbell_region *doorbell = malloc(sizeof(*doorbell));
+
+ if (!doorbell)
+ return -ENOMEM;
+
+ doorbell->start = addr;
+ doorbell->end = addr + size - 1;
+
+ list_add(&doorbell->head, &msi_doorbells);
+
+ return 0;
+}
+
+bool irq__addr_is_msi_doorbell(struct kvm *kvm, u64 addr)
+{
+ struct kvm_msi_doorbell_region *doorbell;
+
+ list_for_each_entry(doorbell, &msi_doorbells, head) {
+ if (addr >= doorbell->start && addr <= doorbell->end)
+ return true;
+ }
+
+ return false;
+}
+
int irq__common_add_irqfd(struct kvm *kvm, unsigned int gsi, int trigger_fd,
int resample_fd)
{
--
2.12.1
Jean-Philippe Brucker
2017-04-07 19:24:47 UTC
Permalink
All virtio devices are doing the same few operations when initializing
their virtqueues. Move these operations to virtio core, as we'll have to
complexify vring initialization when implementing a virtual IOMMU.

Signed-off-by: Jean-Philippe Brucker <jean-***@arm.com>
---
include/kvm/virtio.h | 16 +++++++++-------
virtio/9p.c | 7 ++-----
virtio/balloon.c | 7 +++----
virtio/blk.c | 10 ++--------
virtio/console.c | 7 ++-----
virtio/iommu.c | 10 ++--------
virtio/net.c | 8 ++------
virtio/rng.c | 6 ++----
virtio/scsi.c | 6 ++----
9 files changed, 26 insertions(+), 51 deletions(-)

diff --git a/include/kvm/virtio.h b/include/kvm/virtio.h
index 00a791ac..24c0c487 100644
--- a/include/kvm/virtio.h
+++ b/include/kvm/virtio.h
@@ -169,15 +169,17 @@ int virtio_init(struct kvm *kvm, void *dev, struct virtio_device *vdev,
int virtio_compat_add_message(const char *device, const char *config);
const char* virtio_trans_name(enum virtio_trans trans);

-static inline void *virtio_get_vq(struct kvm *kvm, u32 pfn, u32 page_size)
+static inline void virtio_init_device_vq(struct kvm *kvm,
+ struct virtio_device *vdev,
+ struct virt_queue *vq, size_t nr_descs,
+ u32 page_size, u32 align, u32 pfn)
{
- return guest_flat_to_host(kvm, (u64)pfn * page_size);
-}
+ void *p = guest_flat_to_host(kvm, (u64)pfn * page_size);

-static inline void virtio_init_device_vq(struct virtio_device *vdev,
- struct virt_queue *vq)
-{
- vq->endian = vdev->endian;
+ vq->endian = vdev->endian;
+ vq->pfn = pfn;
+
+ vring_init(&vq->vring, nr_descs, p, align);
}

#endif /* KVM__VIRTIO_H */
diff --git a/virtio/9p.c b/virtio/9p.c
index 69fdc4be..acd09bdd 100644
--- a/virtio/9p.c
+++ b/virtio/9p.c
@@ -1388,17 +1388,14 @@ static int init_vq(struct kvm *kvm, void *dev, u32 vq, u32 page_size, u32 align,
struct p9_dev *p9dev = dev;
struct p9_dev_job *job;
struct virt_queue *queue;
- void *p;

compat__remove_message(compat_id);

queue = &p9dev->vqs[vq];
- queue->pfn = pfn;
- p = virtio_get_vq(kvm, queue->pfn, page_size);
job = &p9dev->jobs[vq];

- vring_init(&queue->vring, VIRTQUEUE_NUM, p, align);
- virtio_init_device_vq(&p9dev->vdev, queue);
+ virtio_init_device_vq(kvm, &p9dev->vdev, queue, VIRTQUEUE_NUM,
+ page_size, align, pfn);

*job = (struct p9_dev_job) {
.vq = queue,
diff --git a/virtio/balloon.c b/virtio/balloon.c
index 9564aa39..9182cae6 100644
--- a/virtio/balloon.c
+++ b/virtio/balloon.c
@@ -198,16 +198,15 @@ static int init_vq(struct kvm *kvm, void *dev, u32 vq, u32 page_size, u32 align,
{
struct bln_dev *bdev = dev;
struct virt_queue *queue;
- void *p;

compat__remove_message(compat_id);

queue = &bdev->vqs[vq];
- queue->pfn = pfn;
- p = virtio_get_vq(kvm, queue->pfn, page_size);
+
+ virtio_init_device_vq(kvm, &bdev->vdev, queue, VIRTIO_BLN_QUEUE_SIZE,
+ page_size, align, pfn);

thread_pool__init_job(&bdev->jobs[vq], kvm, virtio_bln_do_io, queue);
- vring_init(&queue->vring, VIRTIO_BLN_QUEUE_SIZE, p, align);

return 0;
}
diff --git a/virtio/blk.c b/virtio/blk.c
index c485e4fc..8c6e59ba 100644
--- a/virtio/blk.c
+++ b/virtio/blk.c
@@ -178,17 +178,11 @@ static int init_vq(struct kvm *kvm, void *dev, u32 vq, u32 page_size, u32 align,
u32 pfn)
{
struct blk_dev *bdev = dev;
- struct virt_queue *queue;
- void *p;

compat__remove_message(compat_id);

- queue = &bdev->vqs[vq];
- queue->pfn = pfn;
- p = virtio_get_vq(kvm, queue->pfn, page_size);
-
- vring_init(&queue->vring, VIRTIO_BLK_QUEUE_SIZE, p, align);
- virtio_init_device_vq(&bdev->vdev, queue);
+ virtio_init_device_vq(kvm, &bdev->vdev, &bdev->vqs[vq],
+ VIRTIO_BLK_QUEUE_SIZE, page_size, align, pfn);

return 0;
}
diff --git a/virtio/console.c b/virtio/console.c
index f1c0a190..610962c4 100644
--- a/virtio/console.c
+++ b/virtio/console.c
@@ -143,18 +143,15 @@ static int init_vq(struct kvm *kvm, void *dev, u32 vq, u32 page_size, u32 align,
u32 pfn)
{
struct virt_queue *queue;
- void *p;

BUG_ON(vq >= VIRTIO_CONSOLE_NUM_QUEUES);

compat__remove_message(compat_id);

queue = &cdev.vqs[vq];
- queue->pfn = pfn;
- p = virtio_get_vq(kvm, queue->pfn, page_size);

- vring_init(&queue->vring, VIRTIO_CONSOLE_QUEUE_SIZE, p, align);
- virtio_init_device_vq(&cdev.vdev, queue);
+ virtio_init_device_vq(kvm, &cdev.vdev, queue, VIRTIO_CONSOLE_QUEUE_SIZE,
+ page_size, align, pfn);

if (vq == VIRTIO_CONSOLE_TX_QUEUE) {
thread_pool__init_job(&cdev.jobs[vq], kvm, virtio_console_handle_callback, queue);
diff --git a/virtio/iommu.c b/virtio/iommu.c
index c72e7322..2e5a23ee 100644
--- a/virtio/iommu.c
+++ b/virtio/iommu.c
@@ -497,8 +497,6 @@ static void viommu_set_guest_features(struct kvm *kvm, void *dev, u32 features)
static int viommu_init_vq(struct kvm *kvm, void *dev, u32 vq, u32 page_size,
u32 align, u32 pfn)
{
- void *ptr;
- struct virt_queue *queue;
struct viommu_dev *viommu = dev;

if (vq != 0)
@@ -506,12 +504,8 @@ static int viommu_init_vq(struct kvm *kvm, void *dev, u32 vq, u32 page_size,

compat__remove_message(compat_id);

- queue = &viommu->vq;
- queue->pfn = pfn;
- ptr = virtio_get_vq(kvm, queue->pfn, page_size);
-
- vring_init(&queue->vring, viommu->queue_size, ptr, align);
- virtio_init_device_vq(&viommu->vdev, queue);
+ virtio_init_device_vq(kvm, &viommu->vdev, &viommu->vq,
+ viommu->queue_size, page_size, align, pfn);

thread_pool__init_job(&viommu->job, kvm, viommu_command, viommu);

diff --git a/virtio/net.c b/virtio/net.c
index 529b4111..957cca09 100644
--- a/virtio/net.c
+++ b/virtio/net.c
@@ -505,17 +505,13 @@ static int init_vq(struct kvm *kvm, void *dev, u32 vq, u32 page_size, u32 align,
struct vhost_vring_addr addr;
struct net_dev *ndev = dev;
struct virt_queue *queue;
- void *p;
int r;

compat__remove_message(compat_id);

queue = &ndev->vqs[vq];
- queue->pfn = pfn;
- p = virtio_get_vq(kvm, queue->pfn, page_size);
-
- vring_init(&queue->vring, VIRTIO_NET_QUEUE_SIZE, p, align);
- virtio_init_device_vq(&ndev->vdev, queue);
+ virtio_init_device_vq(kvm, &ndev->vdev, queue, VIRTIO_NET_QUEUE_SIZE,
+ page_size, align, pfn);

mutex_init(&ndev->io_lock[vq]);
pthread_cond_init(&ndev->io_cond[vq], NULL);
diff --git a/virtio/rng.c b/virtio/rng.c
index 9b9e1283..5f525540 100644
--- a/virtio/rng.c
+++ b/virtio/rng.c
@@ -92,17 +92,15 @@ static int init_vq(struct kvm *kvm, void *dev, u32 vq, u32 page_size, u32 align,
struct rng_dev *rdev = dev;
struct virt_queue *queue;
struct rng_dev_job *job;
- void *p;

compat__remove_message(compat_id);

queue = &rdev->vqs[vq];
- queue->pfn = pfn;
- p = virtio_get_vq(kvm, queue->pfn, page_size);

job = &rdev->jobs[vq];

- vring_init(&queue->vring, VIRTIO_RNG_QUEUE_SIZE, p, align);
+ virtio_init_device_vq(kvm, &rdev->vdev, queue, VIRTIO_RNG_QUEUE_SIZE,
+ page_size, align, pfn);

*job = (struct rng_dev_job) {
.vq = queue,
diff --git a/virtio/scsi.c b/virtio/scsi.c
index a429ac85..e0fd85f6 100644
--- a/virtio/scsi.c
+++ b/virtio/scsi.c
@@ -57,16 +57,14 @@ static int init_vq(struct kvm *kvm, void *dev, u32 vq, u32 page_size, u32 align,
struct vhost_vring_addr addr;
struct scsi_dev *sdev = dev;
struct virt_queue *queue;
- void *p;
int r;

compat__remove_message(compat_id);

queue = &sdev->vqs[vq];
- queue->pfn = pfn;
- p = virtio_get_vq(kvm, queue->pfn, page_size);

- vring_init(&queue->vring, VIRTIO_SCSI_QUEUE_SIZE, p, align);
+ virtio_init_device_vq(kvm, &sdev->vdev, queue, VIRTIO_SCSI_QUEUE_SIZE,
+ page_size, align, pfn);

if (sdev->vhost_fd == 0)
return 0;
--
2.12.1
Jean-Philippe Brucker
2017-04-07 19:24:48 UTC
Permalink
Virtio devices can now opt-in to use an IOMMU, by setting the use_iommu
field. None of this will work in the current state, since virtio devices
still access memory linearly. A subsequent patch implements sg accesses.

Signed-off-by: Jean-Philippe Brucker <jean-***@arm.com>
---
include/kvm/virtio-mmio.h | 1 +
include/kvm/virtio-pci.h | 1 +
include/kvm/virtio.h | 13 ++++++++++++
virtio/core.c | 52 +++++++++++++++++++++++++++++++++++++++++++++++
virtio/mmio.c | 27 ++++++++++++++++++++++++
virtio/pci.c | 26 ++++++++++++++++++++++++
6 files changed, 120 insertions(+)

diff --git a/include/kvm/virtio-mmio.h b/include/kvm/virtio-mmio.h
index 835f421b..c25a4fd7 100644
--- a/include/kvm/virtio-mmio.h
+++ b/include/kvm/virtio-mmio.h
@@ -44,6 +44,7 @@ struct virtio_mmio_hdr {
struct virtio_mmio {
u32 addr;
void *dev;
+ struct virtio_device *vdev;
struct kvm *kvm;
u8 irq;
struct virtio_mmio_hdr hdr;
diff --git a/include/kvm/virtio-pci.h b/include/kvm/virtio-pci.h
index b70cadd8..26772f74 100644
--- a/include/kvm/virtio-pci.h
+++ b/include/kvm/virtio-pci.h
@@ -22,6 +22,7 @@ struct virtio_pci {
struct pci_device_header pci_hdr;
struct device_header dev_hdr;
void *dev;
+ struct virtio_device *vdev;
struct kvm *kvm;

u16 port_addr;
diff --git a/include/kvm/virtio.h b/include/kvm/virtio.h
index 24c0c487..9f2ff237 100644
--- a/include/kvm/virtio.h
+++ b/include/kvm/virtio.h
@@ -9,6 +9,7 @@
#include <linux/types.h>
#include <sys/uio.h>

+#include "kvm/iommu.h"
#include "kvm/kvm.h"

#define VIRTIO_IRQ_LOW 0
@@ -137,10 +138,12 @@ enum virtio_trans {
};

struct virtio_device {
+ bool use_iommu;
bool use_vhost;
void *virtio;
struct virtio_ops *ops;
u16 endian;
+ void *iotlb;
};

struct virtio_ops {
@@ -182,4 +185,14 @@ static inline void virtio_init_device_vq(struct kvm *kvm,
vring_init(&vq->vring, nr_descs, p, align);
}

+/*
+ * These are callbacks for IOMMU operations on virtio devices. They are not
+ * operations on the virtio-iommu device. Confusing, I know.
+ */
+const struct iommu_properties *
+virtio__iommu_get_properties(struct device_header *dev);
+
+int virtio__iommu_attach(void *, struct virtio_device *vdev, int flags);
+int virtio__iommu_detach(void *, struct virtio_device *vdev);
+
#endif /* KVM__VIRTIO_H */
diff --git a/virtio/core.c b/virtio/core.c
index d6ac289d..32bd4ebc 100644
--- a/virtio/core.c
+++ b/virtio/core.c
@@ -6,11 +6,16 @@
#include "kvm/guest_compat.h"
#include "kvm/barrier.h"
#include "kvm/virtio.h"
+#include "kvm/virtio-iommu.h"
#include "kvm/virtio-pci.h"
#include "kvm/virtio-mmio.h"
#include "kvm/util.h"
#include "kvm/kvm.h"

+static void *iommu = NULL;
+static struct iommu_properties iommu_props = {
+ .name = "viommu-virtio",
+};

const char* virtio_trans_name(enum virtio_trans trans)
{
@@ -198,6 +203,41 @@ bool virtio_queue__should_signal(struct virt_queue *vq)
return false;
}

+const struct iommu_properties *
+virtio__iommu_get_properties(struct device_header *dev)
+{
+ return &iommu_props;
+}
+
+int virtio__iommu_attach(void *priv, struct virtio_device *vdev, int flags)
+{
+ struct virtio_tlb *iotlb = priv;
+
+ if (!iotlb)
+ return -ENOMEM;
+
+ if (vdev->iotlb) {
+ pr_err("device already attached");
+ return -EINVAL;
+ }
+
+ vdev->iotlb = iotlb;
+
+ return 0;
+}
+
+int virtio__iommu_detach(void *priv, struct virtio_device *vdev)
+{
+ if (vdev->iotlb != priv) {
+ pr_err("wrong iotlb"); /* bug */
+ return -EINVAL;
+ }
+
+ vdev->iotlb = NULL;
+
+ return 0;
+}
+
int virtio_init(struct kvm *kvm, void *dev, struct virtio_device *vdev,
struct virtio_ops *ops, enum virtio_trans trans,
int device_id, int subsys_id, int class)
@@ -233,6 +273,18 @@ int virtio_init(struct kvm *kvm, void *dev, struct virtio_device *vdev,
return -1;
};

+ if (!iommu && vdev->use_iommu) {
+ iommu_props.pgsize_mask = ~(PAGE_SIZE - 1);
+ /*
+ * With legacy MMIO, we only have 32-bit to hold the vring PFN.
+ * This limits the IOVA size to (32 + 12) = 44 bits, when using
+ * 4k pages.
+ */
+ iommu_props.input_addr_size = 44;
+ iommu = viommu_register(kvm, &iommu_props);
+ }
+
+
return 0;
}

diff --git a/virtio/mmio.c b/virtio/mmio.c
index 16b44fbb..24a14a71 100644
--- a/virtio/mmio.c
+++ b/virtio/mmio.c
@@ -1,4 +1,5 @@
#include "kvm/devices.h"
+#include "kvm/virtio-iommu.h"
#include "kvm/virtio-mmio.h"
#include "kvm/ioeventfd.h"
#include "kvm/iommu.h"
@@ -286,6 +287,30 @@ void virtio_mmio_assign_irq(struct device_header *dev_hdr)
vmmio->irq = irq__alloc_line();
}

+#define mmio_dev_to_virtio(dev_hdr) \
+ container_of(dev_hdr, struct virtio_mmio, dev_hdr)->vdev
+
+static int virtio_mmio_iommu_attach(void *priv, struct device_header *dev_hdr,
+ int flags)
+{
+ return virtio__iommu_attach(priv, mmio_dev_to_virtio(dev_hdr), flags);
+}
+
+static int virtio_mmio_iommu_detach(void *priv, struct device_header *dev_hdr)
+{
+ return virtio__iommu_detach(priv, mmio_dev_to_virtio(dev_hdr));
+}
+
+static struct iommu_ops virtio_mmio_iommu_ops = {
+ .get_properties = virtio__iommu_get_properties,
+ .alloc_address_space = iommu_alloc_address_space,
+ .free_address_space = iommu_free_address_space,
+ .attach = virtio_mmio_iommu_attach,
+ .detach = virtio_mmio_iommu_detach,
+ .map = iommu_map,
+ .unmap = iommu_unmap,
+};
+
int virtio_mmio_init(struct kvm *kvm, void *dev, struct virtio_device *vdev,
int device_id, int subsys_id, int class)
{
@@ -294,6 +319,7 @@ int virtio_mmio_init(struct kvm *kvm, void *dev, struct virtio_device *vdev,
vmmio->addr = virtio_mmio_get_io_space_block(VIRTIO_MMIO_IO_SIZE);
vmmio->kvm = kvm;
vmmio->dev = dev;
+ vmmio->vdev = vdev;

kvm__register_mmio(kvm, vmmio->addr, VIRTIO_MMIO_IO_SIZE,
false, virtio_mmio_mmio_callback, vdev);
@@ -309,6 +335,7 @@ int virtio_mmio_init(struct kvm *kvm, void *dev, struct virtio_device *vdev,
vmmio->dev_hdr = (struct device_header) {
.bus_type = DEVICE_BUS_MMIO,
.data = generate_virtio_mmio_fdt_node,
+ .iommu_ops = vdev->use_iommu ? &virtio_mmio_iommu_ops : NULL,
};

device__register(&vmmio->dev_hdr);
diff --git a/virtio/pci.c b/virtio/pci.c
index b6ef389e..674d5143 100644
--- a/virtio/pci.c
+++ b/virtio/pci.c
@@ -408,6 +408,30 @@ static void virtio_pci__io_mmio_callback(struct kvm_cpu *vcpu,
kvm__emulate_io(vcpu, port, data, direction, len, 1);
}

+#define pci_dev_to_virtio(dev_hdr) \
+ (container_of(dev_hdr, struct virtio_pci, dev_hdr)->vdev)
+
+static int virtio_pci_iommu_attach(void *priv, struct device_header *dev_hdr,
+ int flags)
+{
+ return virtio__iommu_attach(priv, pci_dev_to_virtio(dev_hdr), flags);
+}
+
+static int virtio_pci_iommu_detach(void *priv, struct device_header *dev_hdr)
+{
+ return virtio__iommu_detach(priv, pci_dev_to_virtio(dev_hdr));
+}
+
+static struct iommu_ops virtio_pci_iommu_ops = {
+ .get_properties = virtio__iommu_get_properties,
+ .alloc_address_space = iommu_alloc_address_space,
+ .free_address_space = iommu_free_address_space,
+ .attach = virtio_pci_iommu_attach,
+ .detach = virtio_pci_iommu_detach,
+ .map = iommu_map,
+ .unmap = iommu_unmap,
+};
+
int virtio_pci__init(struct kvm *kvm, void *dev, struct virtio_device *vdev,
int device_id, int subsys_id, int class)
{
@@ -416,6 +440,7 @@ int virtio_pci__init(struct kvm *kvm, void *dev, struct virtio_device *vdev,

vpci->kvm = kvm;
vpci->dev = dev;
+ vpci->vdev = vdev;

r = ioport__register(kvm, IOPORT_EMPTY, &virtio_pci__io_ops, IOPORT_SIZE, vdev);
if (r < 0)
@@ -461,6 +486,7 @@ int virtio_pci__init(struct kvm *kvm, void *dev, struct virtio_device *vdev,
vpci->dev_hdr = (struct device_header) {
.bus_type = DEVICE_BUS_PCI,
.data = &vpci->pci_hdr,
+ .iommu_ops = vdev->use_iommu ? &virtio_pci_iommu_ops : NULL,
};

vpci->pci_hdr.msix.cap = PCI_CAP_ID_MSIX;
--
2.12.1
Jean-Philippe Brucker
2017-04-07 19:24:49 UTC
Permalink
Teach the virtio core how to access scattered vring structures. When
presenting a virtual IOMMU to the guest in front of virtio devices, the
virtio ring and buffers will be scattered across discontiguous guest-
physical pages. The device has to translate all IOVAs to host-virtual
addresses and gather the pages before accessing any structure.

Buffers described by vring.desc are already returned to the device via an
iovec. We simply have to fill them at a finer granularity and hope that:

1. The driver doesn't provide too many descriptors at a time, since the
iovec is only as big as the number of descriptor and an overflow is now
possible.

2. The device doesn't make assumption on message framing from vectors (ie.
a message can now be contained in more vectors than before). This is
forbidden by virtio 1.0 (and legacy with ANY_LAYOUT) but our
virtio-net, for instance, assumes that the first vector always contains
a full vnet header. In practice it's fine, but still extremely fragile.

For accessing vring and indirect descriptor tables, we now allocate an
iovec describing the IOMMU mappings of the structure, and make all
accesses via this iovec.

***

A more elegant way to do it would be to create a subprocess per
address-space, and remap fragments of guest memory in a contiguous manner:

.---- virtio-blk process
/
viommu process ----+------ virtio-net process
\
'---- some other device

(0) Initially, parent forks for each emulated device. Each child reserves
a large chunk of virtual memory with mmap (base), representing the
IOVA space, but doesn't populate it.
(1) virtio-dev wants to access guest memory, for instance read the vring.
It sends a TLB miss for an IOVA to the parent via pipe or socket.
(2) Parent viommu checks its translation table, and returns an offset in
guest memory.
(3) Child does a mmap in its IOVA space, using the fd that backs guest
memory: mmap(base + iova, pgsize, SHARED|FIXED, fd, offset)

This would be really cool, but I suspect it adds a lot of complexity,
since it's not clear which devices are entirely self-contained and which
need to access parent memory. So stay with scatter-gather accesses for
now.

Signed-off-by: Jean-Philippe Brucker <jean-***@arm.com>
---
include/kvm/virtio.h | 108 +++++++++++++++++++++++++++++--
virtio/core.c | 179 ++++++++++++++++++++++++++++++++++++++++++---------
2 files changed, 252 insertions(+), 35 deletions(-)

diff --git a/include/kvm/virtio.h b/include/kvm/virtio.h
index 9f2ff237..cdc960cd 100644
--- a/include/kvm/virtio.h
+++ b/include/kvm/virtio.h
@@ -29,12 +29,16 @@

struct virt_queue {
struct vring vring;
+ struct iovec *vring_sg;
+ size_t vring_nr_sg;
u32 pfn;
/* The last_avail_idx field is an index to ->ring of struct vring_avail.
It's where we assume the next request index is at. */
u16 last_avail_idx;
u16 last_used_signalled;
u16 endian;
+
+ struct virtio_device *vdev;
};

/*
@@ -96,26 +100,91 @@ static inline __u64 __virtio_h2g_u64(u16 endian, __u64 val)

#endif

+void *virtio_guest_access(struct kvm *kvm, struct virtio_device *vdev,
+ u64 addr, size_t size, size_t *out_size, int prot);
+int virtio_populate_sg(struct kvm *kvm, struct virtio_device *vdev, u64 addr,
+ size_t size, int prot, u16 cur_sg, u16 max_sg,
+ struct iovec iov[]);
+
+/*
+ * Access element in a virtio structure. If @iov is NULL, access is linear and
+ * @ptr represents a Host-Virtual Address (HVA).
+ *
+ * Otherwise, the structure is scattered in the guest-physical space, and is
+ * made virtually-contiguous by the virtual IOMMU. @iov describes the
+ * structure's IOVA->HVA fragments, @base is the IOVA of the structure, and @ptr
+ * an IOVA inside the structure. @max is the number of elements in @iov.
+ *
+ * HVA
+ * IOVA .----> +---+ iov[0].base
+ * @base-> +---+ ----' | |
+ * | | +---+
+ * +---+ ----. : :
+ * | | '----> +---+ iov[1].base
+ * @ptr-> | | | |
+ * +---+ | |--> out
+ * +---+
+ */
+static void *virtio_access_sg(struct iovec *iov, int max, void *base, void *ptr)
+{
+ int i;
+ size_t off = ptr - base;
+
+ if (!iov)
+ return ptr;
+
+ for (i = 0; i < max; i++) {
+ size_t sz = iov[i].iov_len;
+ if (off < sz)
+ return iov[i].iov_base + off;
+ off -= sz;
+ }
+
+ pr_err("virtio_access_sg overflow");
+ return NULL;
+}
+
+/*
+ * We only implement legacy vhost, so vring is a single virtually-contiguous
+ * structure starting at the descriptor table. Differentiation of accesses
+ * allows to ease a future move to virtio 1.0.
+ */
+#define vring_access_avail(vq, ptr) \
+ virtio_access_sg(vq->vring_sg, vq->vring_nr_sg, vq->vring.desc, ptr)
+#define vring_access_desc(vq, ptr) \
+ virtio_access_sg(vq->vring_sg, vq->vring_nr_sg, vq->vring.desc, ptr)
+#define vring_access_used(vq, ptr) \
+ virtio_access_sg(vq->vring_sg, vq->vring_nr_sg, vq->vring.desc, ptr)
+
static inline u16 virt_queue__pop(struct virt_queue *queue)
{
+ void *ptr;
__u16 guest_idx;

- guest_idx = queue->vring.avail->ring[queue->last_avail_idx++ % queue->vring.num];
+ ptr = &queue->vring.avail->ring[queue->last_avail_idx++ % queue->vring.num];
+ guest_idx = *(u16 *)vring_access_avail(queue, ptr);
+
return virtio_guest_to_host_u16(queue, guest_idx);
}

static inline struct vring_desc *virt_queue__get_desc(struct virt_queue *queue, u16 desc_ndx)
{
- return &queue->vring.desc[desc_ndx];
+ return vring_access_desc(queue, &queue->vring.desc[desc_ndx]);
}

static inline bool virt_queue__available(struct virt_queue *vq)
{
+ u16 *evt, *idx;
+
if (!vq->vring.avail)
return 0;

- vring_avail_event(&vq->vring) = virtio_host_to_guest_u16(vq, vq->last_avail_idx);
- return virtio_guest_to_host_u16(vq, vq->vring.avail->idx) != vq->last_avail_idx;
+ /* Disgusting casts under the hood: &(*&used[size]) */
+ evt = vring_access_used(vq, &vring_avail_event(&vq->vring));
+ idx = vring_access_avail(vq, &vq->vring.avail->idx);
+
+ *evt = virtio_host_to_guest_u16(vq, vq->last_avail_idx);
+ return virtio_guest_to_host_u16(vq, *idx) != vq->last_avail_idx;
}

void virt_queue__used_idx_advance(struct virt_queue *queue, u16 jump);
@@ -177,10 +246,39 @@ static inline void virtio_init_device_vq(struct kvm *kvm,
struct virt_queue *vq, size_t nr_descs,
u32 page_size, u32 align, u32 pfn)
{
- void *p = guest_flat_to_host(kvm, (u64)pfn * page_size);
+ void *p;

vq->endian = vdev->endian;
vq->pfn = pfn;
+ vq->vdev = vdev;
+ vq->vring_sg = NULL;
+
+ if (vdev->iotlb) {
+ u64 addr = (u64)pfn * page_size;
+ size_t size = vring_size(nr_descs, align);
+ /* Our IOMMU maps at PAGE_SIZE granularity */
+ size_t nr_sg = size / PAGE_SIZE;
+ int flags = IOMMU_PROT_READ | IOMMU_PROT_WRITE;
+
+ vq->vring_sg = calloc(nr_sg, sizeof(struct iovec));
+ if (!vq->vring_sg) {
+ pr_err("could not allocate vring_sg");
+ return; /* Explode later. */
+ }
+
+ vq->vring_nr_sg = virtio_populate_sg(kvm, vdev, addr, size,
+ flags, 0, nr_sg,
+ vq->vring_sg);
+ if (!vq->vring_nr_sg) {
+ pr_err("could not map vring");
+ free(vq->vring_sg);
+ }
+
+ /* vring is described with its IOVA */
+ p = (void *)addr;
+ } else {
+ p = guest_flat_to_host(kvm, (u64)pfn * page_size);
+ }

vring_init(&vq->vring, nr_descs, p, align);
}
diff --git a/virtio/core.c b/virtio/core.c
index 32bd4ebc..ba35e5f1 100644
--- a/virtio/core.c
+++ b/virtio/core.c
@@ -28,7 +28,8 @@ const char* virtio_trans_name(enum virtio_trans trans)

void virt_queue__used_idx_advance(struct virt_queue *queue, u16 jump)
{
- u16 idx = virtio_guest_to_host_u16(queue, queue->vring.used->idx);
+ u16 *ptr = vring_access_used(queue, &queue->vring.used->idx);
+ u16 idx = virtio_guest_to_host_u16(queue, *ptr);

/*
* Use wmb to assure that used elem was updated with head and len.
@@ -37,7 +38,7 @@ void virt_queue__used_idx_advance(struct virt_queue *queue, u16 jump)
*/
wmb();
idx += jump;
- queue->vring.used->idx = virtio_host_to_guest_u16(queue, idx);
+ *ptr = virtio_host_to_guest_u16(queue, idx);

/*
* Use wmb to assure used idx has been increased before we signal the guest.
@@ -52,10 +53,12 @@ virt_queue__set_used_elem_no_update(struct virt_queue *queue, u32 head,
u32 len, u16 offset)
{
struct vring_used_elem *used_elem;
- u16 idx = virtio_guest_to_host_u16(queue, queue->vring.used->idx);
+ u16 *ptr = vring_access_used(queue, &queue->vring.used->idx);
+ u16 idx = virtio_guest_to_host_u16(queue, *ptr);

- idx += offset;
- used_elem = &queue->vring.used->ring[idx % queue->vring.num];
+ idx = (idx + offset) % queue->vring.num;
+
+ used_elem = vring_access_used(queue, &queue->vring.used->ring[idx]);
used_elem->id = virtio_host_to_guest_u32(queue, head);
used_elem->len = virtio_host_to_guest_u32(queue, len);

@@ -84,16 +87,17 @@ static inline bool virt_desc__test_flag(struct virt_queue *vq,
* at the end.
*/
static unsigned next_desc(struct virt_queue *vq, struct vring_desc *desc,
- unsigned int i, unsigned int max)
+ unsigned int max)
{
unsigned int next;

/* If this descriptor says it doesn't chain, we're done. */
- if (!virt_desc__test_flag(vq, &desc[i], VRING_DESC_F_NEXT))
+ if (!virt_desc__test_flag(vq, desc, VRING_DESC_F_NEXT))
return max;

+ next = virtio_guest_to_host_u16(vq, desc->next);
/* Check they're not leading us off end of descriptors. */
- next = virtio_guest_to_host_u16(vq, desc[i].next);
+ next = min(next, max);
/* Make sure compiler knows to grab that: we don't want it changing! */
wmb();

@@ -102,32 +106,76 @@ static unsigned next_desc(struct virt_queue *vq, struct vring_desc *desc,

u16 virt_queue__get_head_iov(struct virt_queue *vq, struct iovec iov[], u16 *out, u16 *in, u16 head, struct kvm *kvm)
{
- struct vring_desc *desc;
+ struct vring_desc *desc_base, *desc;
+ bool indirect, is_write;
+ struct iovec *desc_sg;
+ size_t len, nr_sg;
+ u64 addr;
u16 idx;
u16 max;

idx = head;
*out = *in = 0;
max = vq->vring.num;
- desc = vq->vring.desc;
+ desc_base = vq->vring.desc;
+ desc_sg = vq->vring_sg;
+ nr_sg = vq->vring_nr_sg;
+
+ desc = vring_access_desc(vq, &desc_base[idx]);
+ indirect = virt_desc__test_flag(vq, desc, VRING_DESC_F_INDIRECT);
+ if (indirect) {
+ len = virtio_guest_to_host_u32(vq, desc->len);
+ max = len / sizeof(struct vring_desc);
+ addr = virtio_guest_to_host_u64(vq, desc->addr);
+ if (desc_sg) {
+ desc_sg = calloc(len / PAGE_SIZE + 1, sizeof(struct iovec));
+ if (!desc_sg)
+ return 0;
+
+ nr_sg = virtio_populate_sg(kvm, vq->vdev, addr, len,
+ IOMMU_PROT_READ, 0, max,
+ desc_sg);
+ if (!nr_sg) {
+ pr_err("failed to populate indirect table");
+ free(desc_sg);
+ return 0;
+ }
+
+ desc_base = (void *)addr;
+ } else {
+ desc_base = guest_flat_to_host(kvm, addr);
+ }

- if (virt_desc__test_flag(vq, &desc[idx], VRING_DESC_F_INDIRECT)) {
- max = virtio_guest_to_host_u32(vq, desc[idx].len) / sizeof(struct vring_desc);
- desc = guest_flat_to_host(kvm, virtio_guest_to_host_u64(vq, desc[idx].addr));
idx = 0;
}

do {
+ u16 nr_io;
+
+ desc = virtio_access_sg(desc_sg, nr_sg, desc_base, &desc_base[idx]);
+ is_write = virt_desc__test_flag(vq, desc, VRING_DESC_F_WRITE);
+
/* Grab the first descriptor, and check it's OK. */
- iov[*out + *in].iov_len = virtio_guest_to_host_u32(vq, desc[idx].len);
- iov[*out + *in].iov_base = guest_flat_to_host(kvm,
- virtio_guest_to_host_u64(vq, desc[idx].addr));
+ len = virtio_guest_to_host_u32(vq, desc->len);
+ addr = virtio_guest_to_host_u64(vq, desc->addr);
+
+ /*
+ * dodgy assumption alert: device uses vring.desc.num iovecs.
+ * True in practice, but they are not obligated to do so.
+ */
+ nr_io = virtio_populate_sg(kvm, vq->vdev, addr, len, is_write ?
+ IOMMU_PROT_WRITE : IOMMU_PROT_READ,
+ *out + *in, vq->vring.num, iov);
+
/* If this is an input descriptor, increment that count. */
- if (virt_desc__test_flag(vq, &desc[idx], VRING_DESC_F_WRITE))
- (*in)++;
+ if (is_write)
+ (*in) += nr_io;
else
- (*out)++;
- } while ((idx = next_desc(vq, desc, idx, max)) != max);
+ (*out) += nr_io;
+ } while ((idx = next_desc(vq, desc, max)) != max);
+
+ if (indirect && desc_sg)
+ free(desc_sg);

return head;
}
@@ -147,23 +195,35 @@ u16 virt_queue__get_inout_iov(struct kvm *kvm, struct virt_queue *queue,
u16 *in, u16 *out)
{
struct vring_desc *desc;
+ struct iovec *iov;
u16 head, idx;
+ bool is_write;
+ size_t len;
+ u64 addr;
+ int prot;
+ u16 *cur;

idx = head = virt_queue__pop(queue);
*out = *in = 0;
do {
- u64 addr;
desc = virt_queue__get_desc(queue, idx);
+ is_write = virt_desc__test_flag(queue, desc, VRING_DESC_F_WRITE);
+ len = virtio_guest_to_host_u32(queue, desc->len);
addr = virtio_guest_to_host_u64(queue, desc->addr);
- if (virt_desc__test_flag(queue, desc, VRING_DESC_F_WRITE)) {
- in_iov[*in].iov_base = guest_flat_to_host(kvm, addr);
- in_iov[*in].iov_len = virtio_guest_to_host_u32(queue, desc->len);
- (*in)++;
+ if (is_write) {
+ prot = IOMMU_PROT_WRITE;
+ iov = in_iov;
+ cur = in;
} else {
- out_iov[*out].iov_base = guest_flat_to_host(kvm, addr);
- out_iov[*out].iov_len = virtio_guest_to_host_u32(queue, desc->len);
- (*out)++;
+ prot = IOMMU_PROT_READ;
+ iov = out_iov;
+ cur = out;
}
+
+ /* dodgy assumption alert: device uses vring.desc.num iovecs */
+ *cur += virtio_populate_sg(kvm, queue->vdev, addr, len, prot,
+ *cur, queue->vring.num, iov);
+
if (virt_desc__test_flag(queue, desc, VRING_DESC_F_NEXT))
idx = virtio_guest_to_host_u16(queue, desc->next);
else
@@ -191,9 +251,12 @@ bool virtio_queue__should_signal(struct virt_queue *vq)
{
u16 old_idx, new_idx, event_idx;

+ u16 *new_ptr = vring_access_used(vq, &vq->vring.used->idx);
+ u16 *event_ptr = vring_access_avail(vq, &vring_used_event(&vq->vring));
+
old_idx = vq->last_used_signalled;
- new_idx = virtio_guest_to_host_u16(vq, vq->vring.used->idx);
- event_idx = virtio_guest_to_host_u16(vq, vring_used_event(&vq->vring));
+ new_idx = virtio_guest_to_host_u16(vq, *new_ptr);
+ event_idx = virtio_guest_to_host_u16(vq, *event_ptr);

if (vring_need_event(event_idx, new_idx, old_idx)) {
vq->last_used_signalled = new_idx;
@@ -238,6 +301,62 @@ int virtio__iommu_detach(void *priv, struct virtio_device *vdev)
return 0;
}

+void *virtio_guest_access(struct kvm *kvm, struct virtio_device *vdev,
+ u64 addr, size_t size, size_t *out_size, int prot)
+{
+ u64 paddr;
+
+ if (!vdev->iotlb) {
+ *out_size = size;
+ paddr = addr;
+ } else {
+ paddr = iommu_access(vdev->iotlb, addr, size, out_size, prot);
+ }
+
+ return guest_flat_to_host(kvm, paddr);
+}
+
+/*
+ * Fill @iov starting at index @cur_vec with translations of the (@addr, @size)
+ * range. If @vdev doesn't have a tlb, fill a single vector with the
+ * corresponding HVA. Otherwise, fill vectors with GVA->GPA->HVA translations.
+ * Since the IOVA range may span over multiple IOMMU mappings, there may need to
+ * be multiple vectors. @nr_vec is the size of the @iov array.
+ */
+int virtio_populate_sg(struct kvm *kvm, struct virtio_device *vdev, u64 addr,
+ size_t size, int prot, u16 cur_vec, u16 nr_vec,
+ struct iovec iov[])
+{
+ void *ptr;
+ int vec = cur_vec;
+ size_t consumed = 0;
+
+ while (size > 0 && vec < nr_vec) {
+ ptr = virtio_guest_access(kvm, vdev, addr, size, &consumed,
+ prot);
+ if (!ptr)
+ break;
+
+ iov[vec].iov_len = consumed;
+ iov[vec].iov_base = ptr;
+
+ size -= consumed;
+ addr += consumed;
+ vec++;
+ }
+
+ if (cur_vec == nr_vec && size)
+ /*
+ * This is bad. Devices used to offer as many iovecs as vring
+ * descriptors, so there was no chance of filling up the array.
+ * But with the IOMMU, buffers may be fragmented and use
+ * multiple iovecs per descriptor.
+ */
+ pr_err("reached end of iovec, incomplete buffer");
+
+ return vec - cur_vec;
+}
+
int virtio_init(struct kvm *kvm, void *dev, struct virtio_device *vdev,
struct virtio_ops *ops, enum virtio_trans trans,
int device_id, int subsys_id, int class)
--
2.12.1
Jean-Philippe Brucker
2017-04-07 19:24:50 UTC
Permalink
When the virtio device is behind a virtual IOMMU, the doorbell address
written into the MSI-X table by the guest is an IOVA, not a physical one.
When injecting an MSI, KVM needs a physical address to recognize the
doorbell and the associated IRQ chip. Translate the address given by the
guest into a physical one, and store it in a secondary table for easy
access.

Signed-off-by: Jean-Philippe Brucker <jean-***@arm.com>
---
include/kvm/iommu.h | 4 ++++
include/kvm/virtio-pci.h | 1 +
iommu.c | 23 +++++++++++++++++++++++
virtio/pci.c | 33 ++++++++++++++++++++++++---------
4 files changed, 52 insertions(+), 9 deletions(-)

diff --git a/include/kvm/iommu.h b/include/kvm/iommu.h
index 4164ba20..8f87ce5a 100644
--- a/include/kvm/iommu.h
+++ b/include/kvm/iommu.h
@@ -70,4 +70,8 @@ int iommu_unmap(void *address_space, u64 virt_addr, u64 size, int flags);
u64 iommu_access(void *address_space, u64 addr, size_t size, size_t *out_size,
int prot);

+struct msi_msg;
+
+int iommu_translate_msi(void *address_space, struct msi_msg *msi);
+
#endif /* KVM_IOMMU_H */
diff --git a/include/kvm/virtio-pci.h b/include/kvm/virtio-pci.h
index 26772f74..cb5225d6 100644
--- a/include/kvm/virtio-pci.h
+++ b/include/kvm/virtio-pci.h
@@ -47,6 +47,7 @@ struct virtio_pci {
u32 msix_io_block;
u64 msix_pba;
struct msix_table msix_table[VIRTIO_PCI_MAX_VQ + VIRTIO_PCI_MAX_CONFIG];
+ struct msi_msg msix_msgs[VIRTIO_PCI_MAX_VQ + VIRTIO_PCI_MAX_CONFIG];

/* virtio queue */
u16 queue_selector;
diff --git a/iommu.c b/iommu.c
index 0a662404..c10a3f0b 100644
--- a/iommu.c
+++ b/iommu.c
@@ -5,6 +5,7 @@

#include "kvm/iommu.h"
#include "kvm/kvm.h"
+#include "kvm/msi.h"
#include "kvm/mutex.h"
#include "kvm/rbtree-interval.h"

@@ -160,3 +161,25 @@ out_unlock:

return out_addr;
}
+
+int iommu_translate_msi(void *address_space, struct msi_msg *msg)
+{
+ size_t size = 4, out_size;
+ u64 addr = ((u64)msg->address_hi << 32) | msg->address_lo;
+
+ if (!address_space)
+ return 0;
+
+ addr = iommu_access(address_space, addr, size, &out_size,
+ IOMMU_PROT_WRITE);
+
+ if (!addr || out_size != size) {
+ pr_err("could not translate MSI doorbell");
+ return -EFAULT;
+ }
+
+ msg->address_lo = addr & 0xffffffff;
+ msg->address_hi = addr >> 32;
+
+ return 0;
+}
diff --git a/virtio/pci.c b/virtio/pci.c
index 674d5143..88b1a129 100644
--- a/virtio/pci.c
+++ b/virtio/pci.c
@@ -156,6 +156,7 @@ static void update_msix_map(struct virtio_pci *vpci,
struct msix_table *msix_entry, u32 vecnum)
{
u32 gsi, i;
+ struct msi_msg *msg;

/* Find the GSI number used for that vector */
if (vecnum == vpci->config_vector) {
@@ -172,14 +173,20 @@ static void update_msix_map(struct virtio_pci *vpci,
if (gsi == 0)
return;

- msix_entry = &msix_entry[vecnum];
- irq__update_msix_route(vpci->kvm, gsi, &msix_entry->msg);
+ msg = &vpci->msix_msgs[vecnum];
+ *msg = msix_entry[vecnum].msg;
+
+ if (iommu_translate_msi(vpci->vdev->iotlb, msg))
+ return;
+
+ irq__update_msix_route(vpci->kvm, gsi, msg);
}

static bool virtio_pci__specific_io_out(struct kvm *kvm, struct virtio_device *vdev, u16 port,
void *data, int size, int offset)
{
struct virtio_pci *vpci = vdev->virtio;
+ struct msi_msg *msg;
u32 config_offset, vec;
int gsi;
int type = virtio__get_dev_specific_field(offset - 20, virtio_pci__msix_enabled(vpci),
@@ -191,8 +198,12 @@ static bool virtio_pci__specific_io_out(struct kvm *kvm, struct virtio_device *v
if (vec == VIRTIO_MSI_NO_VECTOR)
break;

- gsi = irq__add_msix_route(kvm,
- &vpci->msix_table[vec].msg,
+ msg = &vpci->msix_msgs[vec];
+ *msg = vpci->msix_table[vec].msg;
+ if (iommu_translate_msi(vdev->iotlb, msg))
+ break;
+
+ gsi = irq__add_msix_route(kvm, msg,
vpci->dev_hdr.dev_num << 3);
if (gsi >= 0) {
vpci->config_gsi = gsi;
@@ -210,8 +221,12 @@ static bool virtio_pci__specific_io_out(struct kvm *kvm, struct virtio_device *v
if (vec == VIRTIO_MSI_NO_VECTOR)
break;

- gsi = irq__add_msix_route(kvm,
- &vpci->msix_table[vec].msg,
+ msg = &vpci->msix_msgs[vec];
+ *msg = vpci->msix_table[vec].msg;
+ if (iommu_translate_msi(vdev->iotlb, msg))
+ break;
+
+ gsi = irq__add_msix_route(kvm, msg,
vpci->dev_hdr.dev_num << 3);
if (gsi < 0) {
if (gsi == -ENXIO &&
@@ -328,9 +343,9 @@ static void virtio_pci__signal_msi(struct kvm *kvm, struct virtio_pci *vpci,
{
static int needs_devid = 0;
struct kvm_msi msi = {
- .address_lo = vpci->msix_table[vec].msg.address_lo,
- .address_hi = vpci->msix_table[vec].msg.address_hi,
- .data = vpci->msix_table[vec].msg.data,
+ .address_lo = vpci->msix_msgs[vec].address_lo,
+ .address_hi = vpci->msix_msgs[vec].address_hi,
+ .data = vpci->msix_msgs[vec].data,
};

if (needs_devid == 0) {
--
2.12.1
Jean-Philippe Brucker
2017-04-07 19:24:51 UTC
Permalink
Pass the VIRTIO_F_IOMMU_PLATFORM to tell the guest when a device is behind
an IOMMU.

Other feature bits in virtio do not depend on the device type and could be
factored the same way. For instance our vring implementation always
supports indirect descriptors (VIRTIO_RING_F_INDIRECT_DESC), so we could
advertise it for all devices at once (only net, scsi and blk at the
moment). However, this might modify guest behaviour: in Linux whenever the
driver attempts to add a chain of descriptors, it will allocate an
indirect table and use a single ring descriptor, which might slightly
reduce performance. Cowardly ignore this.

VIRTIO_RING_F_EVENT_IDX is another feature of the vring, but that one
needs the device to call virtio_queue__should_signal before signaling to
the guest. Arguably we could factor all calls to signal_vq, but let's keep
this patch simple.

Signed-off-by: Jean-Philippe Brucker <jean-***@arm.com>
---
include/kvm/virtio.h | 2 ++
virtio/core.c | 6 ++++++
virtio/mmio.c | 4 +++-
virtio/pci.c | 1 +
4 files changed, 12 insertions(+), 1 deletion(-)

diff --git a/include/kvm/virtio.h b/include/kvm/virtio.h
index cdc960cd..97bd5bdb 100644
--- a/include/kvm/virtio.h
+++ b/include/kvm/virtio.h
@@ -293,4 +293,6 @@ virtio__iommu_get_properties(struct device_header *dev);
int virtio__iommu_attach(void *, struct virtio_device *vdev, int flags);
int virtio__iommu_detach(void *, struct virtio_device *vdev);

+u32 virtio_get_common_features(struct kvm *kvm, struct virtio_device *vdev);
+
#endif /* KVM__VIRTIO_H */
diff --git a/virtio/core.c b/virtio/core.c
index ba35e5f1..66e0cecb 100644
--- a/virtio/core.c
+++ b/virtio/core.c
@@ -1,3 +1,4 @@
+#include <linux/virtio_config.h>
#include <linux/virtio_ring.h>
#include <linux/types.h>
#include <sys/uio.h>
@@ -266,6 +267,11 @@ bool virtio_queue__should_signal(struct virt_queue *vq)
return false;
}

+u32 virtio_get_common_features(struct kvm *kvm, struct virtio_device *vdev)
+{
+ return vdev->use_iommu ? VIRTIO_F_IOMMU_PLATFORM : 0;
+}
+
const struct iommu_properties *
virtio__iommu_get_properties(struct device_header *dev)
{
diff --git a/virtio/mmio.c b/virtio/mmio.c
index 24a14a71..699d4403 100644
--- a/virtio/mmio.c
+++ b/virtio/mmio.c
@@ -127,9 +127,11 @@ static void virtio_mmio_config_in(struct kvm_cpu *vcpu,
ioport__write32(data, *(u32 *)(((void *)&vmmio->hdr) + addr));
break;
case VIRTIO_MMIO_HOST_FEATURES:
- if (vmmio->hdr.host_features_sel == 0)
+ if (vmmio->hdr.host_features_sel == 0) {
val = vdev->ops->get_host_features(vmmio->kvm,
vmmio->dev);
+ val |= virtio_get_common_features(vmmio->kvm, vdev);
+ }
ioport__write32(data, val);
break;
case VIRTIO_MMIO_QUEUE_PFN:
diff --git a/virtio/pci.c b/virtio/pci.c
index 88b1a129..c9f0e558 100644
--- a/virtio/pci.c
+++ b/virtio/pci.c
@@ -126,6 +126,7 @@ static bool virtio_pci__io_in(struct ioport *ioport, struct kvm_cpu *vcpu, u16 p
switch (offset) {
case VIRTIO_PCI_HOST_FEATURES:
val = vdev->ops->get_host_features(kvm, vpci->dev);
+ val |= virtio_get_common_features(kvm, vdev);
ioport__write32(data, val);
break;
case VIRTIO_PCI_QUEUE_PFN:
--
2.12.1
Jean-Philippe Brucker
2017-04-07 19:24:52 UTC
Permalink
Currently all passed-through devices must access the same guest-physical
address space. Register an IOMMU to offer individual address spaces to
devices. The way we do it is allocate one container per group, and add
mappings on demand.

Since guest cannot access devices unless it is attached to a container,
and we cannot change container at runtime without resetting the device,
this implementation is limited. To implement bypass mode, we'd need to map
the whole guest physical memory first, and unmap everything when attaching
to a new address space. It is also not possible for devices to be attached
to the same address space, they all have different page tables.

Signed-off-by: Jean-Philippe Brucker <jean-***@arm.com>
---
include/kvm/iommu.h | 6 ++
include/kvm/vfio.h | 2 +
iommu.c | 7 +-
vfio.c | 281 ++++++++++++++++++++++++++++++++++++++++++++++++----
4 files changed, 273 insertions(+), 23 deletions(-)

diff --git a/include/kvm/iommu.h b/include/kvm/iommu.h
index 8f87ce5a..45a20f3b 100644
--- a/include/kvm/iommu.h
+++ b/include/kvm/iommu.h
@@ -10,6 +10,12 @@
#define IOMMU_PROT_WRITE 0x2
#define IOMMU_PROT_EXEC 0x4

+/*
+ * Test if mapping is present. If not, return an error but do not report it to
+ * stderr
+ */
+#define IOMMU_UNMAP_SILENT 0x1
+
struct iommu_ops {
const struct iommu_properties *(*get_properties)(struct device_header *);

diff --git a/include/kvm/vfio.h b/include/kvm/vfio.h
index 71dfa8f7..84126eb9 100644
--- a/include/kvm/vfio.h
+++ b/include/kvm/vfio.h
@@ -55,6 +55,7 @@ struct vfio_device {
struct device_header dev_hdr;

int fd;
+ struct vfio_group *group;
struct vfio_device_info info;
struct vfio_irq_info irq_info;
struct vfio_region *regions;
@@ -65,6 +66,7 @@ struct vfio_device {
struct vfio_group {
unsigned long id; /* iommu_group number in sysfs */
int fd;
+ struct vfio_guest_container *container;
};

int vfio_group_parser(const struct option *opt, const char *arg, int unset);
diff --git a/iommu.c b/iommu.c
index c10a3f0b..2220e4b2 100644
--- a/iommu.c
+++ b/iommu.c
@@ -85,6 +85,7 @@ int iommu_unmap(void *address_space, u64 virt_addr, u64 size, int flags)
struct rb_int_node *node;
struct iommu_mapping *map;
struct iommu_ioas *ioas = address_space;
+ bool silent = flags & IOMMU_UNMAP_SILENT;

if (!ioas)
return -ENODEV;
@@ -97,7 +98,8 @@ int iommu_unmap(void *address_space, u64 virt_addr, u64 size, int flags)
map = container_of(node, struct iommu_mapping, iova_range);

if (node_size > size) {
- pr_debug("cannot split mapping");
+ if (!silent)
+ pr_debug("cannot split mapping");
ret = -EINVAL;
break;
}
@@ -111,7 +113,8 @@ int iommu_unmap(void *address_space, u64 virt_addr, u64 size, int flags)
}

if (size && !ret) {
- pr_debug("mapping not found");
+ if (!silent)
+ pr_debug("mapping not found");
ret = -ENXIO;
}
mutex_unlock(&ioas->mutex);
diff --git a/vfio.c b/vfio.c
index f4fd4090..406d0781 100644
--- a/vfio.c
+++ b/vfio.c
@@ -1,10 +1,13 @@
+#include "kvm/iommu.h"
#include "kvm/irq.h"
#include "kvm/kvm.h"
#include "kvm/kvm-cpu.h"
#include "kvm/pci.h"
#include "kvm/util.h"
#include "kvm/vfio.h"
+#include "kvm/virtio-iommu.h"

+#include <linux/bitops.h>
#include <linux/kvm.h>
#include <linux/pci_regs.h>

@@ -25,7 +28,16 @@ struct vfio_irq_eventfd {
int fd;
};

-static int vfio_container;
+struct vfio_guest_container {
+ struct kvm *kvm;
+ int fd;
+
+ void *msi_doorbells;
+};
+
+static void *viommu = NULL;
+
+static int vfio_host_container;

int vfio_group_parser(const struct option *opt, const char *arg, int unset)
{
@@ -43,6 +55,7 @@ int vfio_group_parser(const struct option *opt, const char *arg, int unset)

cur = strtok(buf, ",");
group->id = strtoul(cur, NULL, 0);
+ group->container = NULL;

kvm->cfg.num_vfio_groups = ++idx;
free(buf);
@@ -68,11 +81,13 @@ static void vfio_pci_msix_pba_access(struct kvm_cpu *vcpu, u64 addr, u8 *data,
static void vfio_pci_msix_table_access(struct kvm_cpu *vcpu, u64 addr, u8 *data,
u32 len, u8 is_write, void *ptr)
{
+ struct msi_msg msg;
struct kvm *kvm = vcpu->kvm;
struct vfio_pci_device *pdev = ptr;
struct vfio_pci_msix_entry *entry;
struct vfio_pci_msix_table *table = &pdev->msix_table;
struct vfio_device *device = container_of(pdev, struct vfio_device, pci);
+ struct vfio_guest_container *container = device->group->container;

u64 offset = addr - table->guest_phys_addr;

@@ -88,11 +103,16 @@ static void vfio_pci_msix_table_access(struct kvm_cpu *vcpu, u64 addr, u8 *data,

memcpy((void *)&entry->config + field, data, len);

- if (field != PCI_MSIX_ENTRY_VECTOR_CTRL)
+ if (field != PCI_MSIX_ENTRY_VECTOR_CTRL || entry->config.ctrl & 1)
+ return;
+
+ msg = entry->config.msg;
+
+ if (container && iommu_translate_msi(container->msi_doorbells, &msg))
return;

if (entry->gsi < 0) {
- int ret = irq__add_msix_route(kvm, &entry->config.msg,
+ int ret = irq__add_msix_route(kvm, &msg,
device->dev_hdr.dev_num << 3);
if (ret < 0) {
pr_err("cannot create MSI-X route");
@@ -111,7 +131,7 @@ static void vfio_pci_msix_table_access(struct kvm_cpu *vcpu, u64 addr, u8 *data,
return;
}

- irq__update_msix_route(kvm, entry->gsi, &entry->config.msg);
+ irq__update_msix_route(kvm, entry->gsi, &msg);
}

static void vfio_pci_msi_write(struct kvm *kvm, struct vfio_device *device,
@@ -122,6 +142,7 @@ static void vfio_pci_msi_write(struct kvm *kvm, struct vfio_device *device,
struct msi_msg msi;
struct vfio_pci_msix_entry *entry;
struct vfio_pci_device *pdev = &device->pci;
+ struct vfio_guest_container *container = device->group->container;
struct msi_cap_64 *msi_cap_64 = (void *)&pdev->hdr + pdev->msi.pos;

/* Only modify routes when guest sets the enable bit */
@@ -144,6 +165,9 @@ static void vfio_pci_msi_write(struct kvm *kvm, struct vfio_device *device,
msi.data = msi_cap_32->data;
}

+ if (container && iommu_translate_msi(container->msi_doorbells, &msi))
+ return;
+
for (i = 0; i < nr_vectors; i++) {
u32 devid = device->dev_hdr.dev_num << 3;

@@ -870,6 +894,154 @@ static int vfio_configure_dev_irqs(struct kvm *kvm, struct vfio_device *device)
return ret;
}

+static struct iommu_properties vfio_viommu_props = {
+ .name = "viommu-vfio",
+
+ .input_addr_size = 64,
+};
+
+static const struct iommu_properties *
+vfio_viommu_get_properties(struct device_header *dev)
+{
+ return &vfio_viommu_props;
+}
+
+static void *vfio_viommu_alloc(struct device_header *dev_hdr)
+{
+ struct vfio_device *vdev = container_of(dev_hdr, struct vfio_device,
+ dev_hdr);
+ struct vfio_guest_container *container = vdev->group->container;
+
+ container->msi_doorbells = iommu_alloc_address_space(NULL);
+ if (!container->msi_doorbells) {
+ pr_err("Failed to create MSI address space");
+ return NULL;
+ }
+
+ return container;
+}
+
+static void vfio_viommu_free(void *priv)
+{
+ struct vfio_guest_container *container = priv;
+
+ /* Half the address space */
+ size_t size = 1UL << (BITS_PER_LONG - 1);
+ unsigned long virt_addr = 0;
+ int i;
+
+ /*
+ * Remove all mappings in two times, since 2^64 doesn't fit in
+ * unmap.size
+ */
+ for (i = 0; i < 2; i++, virt_addr += size) {
+ struct vfio_iommu_type1_dma_unmap unmap = {
+ .argsz = sizeof(unmap),
+ .iova = virt_addr,
+ .size = size,
+ };
+ }
+
+ iommu_free_address_space(container->msi_doorbells);
+ container->msi_doorbells = NULL;
+}
+
+static int vfio_viommu_attach(void *priv, struct device_header *dev_hdr, int flags)
+{
+ struct vfio_guest_container *container = priv;
+ struct vfio_device *vdev = container_of(dev_hdr, struct vfio_device,
+ dev_hdr);
+
+ if (!container)
+ return -ENODEV;
+
+ if (container->fd != vdev->group->container->fd)
+ /*
+ * TODO: We don't support multiple devices in the same address
+ * space at the moment. It should be easy to implement, just
+ * create an address space structure that holds multiple
+ * container fds and multiplex map/unmap requests.
+ */
+ return -EINVAL;
+
+ return 0;
+}
+
+static int vfio_viommu_detach(void *priv, struct device_header *dev_hdr)
+{
+ return 0;
+}
+
+static int vfio_viommu_map(void *priv, u64 virt_addr, u64 phys_addr, u64 size,
+ int prot)
+{
+ int ret;
+ struct vfio_guest_container *container = priv;
+ struct vfio_iommu_type1_dma_map map = {
+ .argsz = sizeof(map),
+ .iova = virt_addr,
+ .size = size,
+ };
+
+ map.vaddr = (u64)guest_flat_to_host(container->kvm, phys_addr);
+ if (!map.vaddr) {
+ if (irq__addr_is_msi_doorbell(container->kvm, phys_addr)) {
+ ret = iommu_map(container->msi_doorbells, virt_addr,
+ phys_addr, size, prot);
+ if (ret) {
+ pr_err("could not map MSI");
+ return ret;
+ }
+
+ // TODO: silence guest_flat_to_host
+ pr_info("Nevermind, all is well. Mapped MSI %llx->%llx",
+ virt_addr, phys_addr);
+ return 0;
+ } else {
+ return -ERANGE;
+ }
+ }
+
+ if (prot & IOMMU_PROT_READ)
+ map.flags |= VFIO_DMA_MAP_FLAG_READ;
+
+ if (prot & IOMMU_PROT_WRITE)
+ map.flags |= VFIO_DMA_MAP_FLAG_WRITE;
+
+ if (prot & IOMMU_PROT_EXEC) {
+ pr_err("VFIO does not support PROT_EXEC");
+ return -ENOSYS;
+ }
+
+ return ioctl(container->fd, VFIO_IOMMU_MAP_DMA, &map);
+}
+
+static int vfio_viommu_unmap(void *priv, u64 virt_addr, u64 size, int flags)
+{
+ struct vfio_guest_container *container = priv;
+ struct vfio_iommu_type1_dma_unmap unmap = {
+ .argsz = sizeof(unmap),
+ .iova = virt_addr,
+ .size = size,
+ };
+
+ if (!iommu_unmap(container->msi_doorbells, virt_addr, size,
+ flags | IOMMU_UNMAP_SILENT))
+ return 0;
+
+ return ioctl(container->fd, VFIO_IOMMU_UNMAP_DMA, &unmap);
+}
+
+static struct iommu_ops vfio_iommu_ops = {
+ .get_properties = vfio_viommu_get_properties,
+ .alloc_address_space = vfio_viommu_alloc,
+ .free_address_space = vfio_viommu_free,
+ .attach = vfio_viommu_attach,
+ .detach = vfio_viommu_detach,
+ .map = vfio_viommu_map,
+ .unmap = vfio_viommu_unmap,
+};
+
static int vfio_configure_reserved_regions(struct kvm *kvm,
struct vfio_group *group)
{
@@ -912,6 +1084,8 @@ static int vfio_configure_device(struct kvm *kvm, struct vfio_group *group,
return -ENOMEM;
}

+ device->group = group;
+
device->fd = ioctl(group->fd, VFIO_GROUP_GET_DEVICE_FD, dirent->d_name);
if (device->fd < 0) {
pr_err("Failed to get FD for device %s in group %lu",
@@ -945,6 +1119,7 @@ static int vfio_configure_device(struct kvm *kvm, struct vfio_group *group,
device->dev_hdr = (struct device_header) {
.bus_type = DEVICE_BUS_PCI,
.data = &device->pci.hdr,
+ .iommu_ops = viommu ? &vfio_iommu_ops : NULL,
};

ret = device__register(&device->dev_hdr);
@@ -1009,13 +1184,13 @@ static int vfio_configure_iommu_groups(struct kvm *kvm)
/* TODO: this should be an arch callback, so arm can return HYP only if vsmmu */
static int vfio_get_iommu_type(void)
{
- if (ioctl(vfio_container, VFIO_CHECK_EXTENSION, VFIO_TYPE1_NESTING_IOMMU))
+ if (ioctl(vfio_host_container, VFIO_CHECK_EXTENSION, VFIO_TYPE1_NESTING_IOMMU))
return VFIO_TYPE1_NESTING_IOMMU;

- if (ioctl(vfio_container, VFIO_CHECK_EXTENSION, VFIO_TYPE1v2_IOMMU))
+ if (ioctl(vfio_host_container, VFIO_CHECK_EXTENSION, VFIO_TYPE1v2_IOMMU))
return VFIO_TYPE1v2_IOMMU;

- if (ioctl(vfio_container, VFIO_CHECK_EXTENSION, VFIO_TYPE1_IOMMU))
+ if (ioctl(vfio_host_container, VFIO_CHECK_EXTENSION, VFIO_TYPE1_IOMMU))
return VFIO_TYPE1_IOMMU;

return -ENODEV;
@@ -1033,7 +1208,7 @@ static int vfio_map_mem_bank(struct kvm *kvm, struct kvm_mem_bank *bank, void *d
};

/* Map the guest memory for DMA (i.e. provide isolation) */
- if (ioctl(vfio_container, VFIO_IOMMU_MAP_DMA, &dma_map)) {
+ if (ioctl(vfio_host_container, VFIO_IOMMU_MAP_DMA, &dma_map)) {
ret = -errno;
pr_err("Failed to map 0x%llx -> 0x%llx (%llu) for DMA",
dma_map.iova, dma_map.vaddr, dma_map.size);
@@ -1050,14 +1225,15 @@ static int vfio_unmap_mem_bank(struct kvm *kvm, struct kvm_mem_bank *bank, void
.iova = bank->guest_phys_addr,
};

- ioctl(vfio_container, VFIO_IOMMU_UNMAP_DMA, &dma_unmap);
+ ioctl(vfio_host_container, VFIO_IOMMU_UNMAP_DMA, &dma_unmap);

return 0;
}

static int vfio_group_init(struct kvm *kvm, struct vfio_group *group)
{
- int ret;
+ int ret = 0;
+ int container;
char group_node[VFIO_PATH_MAX_LEN];
struct vfio_group_status group_status = {
.argsz = sizeof(group_status),
@@ -1066,6 +1242,25 @@ static int vfio_group_init(struct kvm *kvm, struct vfio_group *group)
snprintf(group_node, VFIO_PATH_MAX_LEN, VFIO_DEV_DIR "/%lu",
group->id);

+ if (kvm->cfg.viommu) {
+ container = open(VFIO_DEV_NODE, O_RDWR);
+ if (container < 0) {
+ ret = -errno;
+ pr_err("cannot initialize private container\n");
+ return ret;
+ }
+
+ group->container = malloc(sizeof(struct vfio_guest_container));
+ if (!group->container)
+ return -ENOMEM;
+
+ group->container->fd = container;
+ group->container->kvm = kvm;
+ group->container->msi_doorbells = NULL;
+ } else {
+ container = vfio_host_container;
+ }
+
group->fd = open(group_node, O_RDWR);
if (group->fd == -1) {
ret = -errno;
@@ -1085,29 +1280,52 @@ static int vfio_group_init(struct kvm *kvm, struct vfio_group *group)
return -EINVAL;
}

- if (ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &vfio_container)) {
+ if (ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &container)) {
ret = -errno;
pr_err("Failed to add IOMMU group %s to VFIO container",
group_node);
return ret;
}

- return 0;
+ if (container != vfio_host_container) {
+ struct vfio_iommu_type1_info info = {
+ .argsz = sizeof(info),
+ };
+
+ /* We really need v2 semantics for unmap-all */
+ ret = ioctl(container, VFIO_SET_IOMMU, VFIO_TYPE1v2_IOMMU);
+ if (ret) {
+ ret = -errno;
+ pr_err("Failed to set IOMMU");
+ return ret;
+ }
+
+ ret = ioctl(container, VFIO_IOMMU_GET_INFO, &info);
+ if (ret)
+ pr_err("Failed to get IOMMU info");
+ else if (info.flags & VFIO_IOMMU_INFO_PGSIZES)
+ vfio_viommu_props.pgsize_mask = info.iova_pgsizes;
+ }
+
+ return ret;
}

-static int vfio_container_init(struct kvm *kvm)
+static int vfio_groups_init(struct kvm *kvm)
{
int api, i, ret, iommu_type;;

- /* Create a container for our IOMMU groups */
- vfio_container = open(VFIO_DEV_NODE, O_RDWR);
- if (vfio_container == -1) {
+ /*
+ * Create a container for our IOMMU groups. Even when using a viommu, we
+ * still use this one for probing capabilities.
+ */
+ vfio_host_container = open(VFIO_DEV_NODE, O_RDWR);
+ if (vfio_host_container == -1) {
ret = errno;
pr_err("Failed to open %s", VFIO_DEV_NODE);
return ret;
}

- api = ioctl(vfio_container, VFIO_GET_API_VERSION);
+ api = ioctl(vfio_host_container, VFIO_GET_API_VERSION);
if (api != VFIO_API_VERSION) {
pr_err("Unknown VFIO API version %d", api);
return -ENODEV;
@@ -1119,15 +1337,20 @@ static int vfio_container_init(struct kvm *kvm)
return iommu_type;
}

- /* Sanity check our groups and add them to the container */
for (i = 0; i < kvm->cfg.num_vfio_groups; ++i) {
ret = vfio_group_init(kvm, &kvm->cfg.vfio_group[i]);
if (ret)
return ret;
}

+ if (kvm->cfg.viommu) {
+ close(vfio_host_container);
+ vfio_host_container = -1;
+ return 0;
+ }
+
/* Finalise the container */
- if (ioctl(vfio_container, VFIO_SET_IOMMU, iommu_type)) {
+ if (ioctl(vfio_host_container, VFIO_SET_IOMMU, iommu_type)) {
ret = -errno;
pr_err("Failed to set IOMMU type %d for VFIO container",
iommu_type);
@@ -1147,10 +1370,16 @@ static int vfio__init(struct kvm *kvm)
if (!kvm->cfg.num_vfio_groups)
return 0;

- ret = vfio_container_init(kvm);
+ ret = vfio_groups_init(kvm);
if (ret)
return ret;

+ if (kvm->cfg.viommu) {
+ viommu = viommu_register(kvm, &vfio_viommu_props);
+ if (!viommu)
+ pr_err("could not register viommu");
+ }
+
ret = vfio_configure_iommu_groups(kvm);
if (ret)
return ret;
@@ -1162,17 +1391,27 @@ dev_base_init(vfio__init);
static int vfio__exit(struct kvm *kvm)
{
int i, fd;
+ struct vfio_guest_container *container;

if (!kvm->cfg.num_vfio_groups)
return 0;

for (i = 0; i < kvm->cfg.num_vfio_groups; ++i) {
+ container = kvm->cfg.vfio_group[i].container;
fd = kvm->cfg.vfio_group[i].fd;
ioctl(fd, VFIO_GROUP_UNSET_CONTAINER);
close(fd);
+
+ if (container != NULL) {
+ close(container->fd);
+ free(container);
+ }
}

+ if (vfio_host_container == -1)
+ return 0;
+
kvm__for_each_mem_bank(kvm, KVM_MEM_TYPE_RAM, vfio_unmap_mem_bank, NULL);
- return close(vfio_container);
+ return close(vfio_host_container);
}
dev_base_exit(vfio__exit);
--
2.12.1
Jean-Philippe Brucker
2017-04-07 19:24:53 UTC
Permalink
Add a new parameter to lkvm debug, '-i' or '--iommu'. Commands will be
added later. For the moment, rework the debug builtin to share dump
facilities with the '-d'/'--dump' parameter.

Signed-off-by: Jean-Philippe Brucker <jean-***@arm.com>
---
builtin-debug.c | 8 +++++++-
include/kvm/builtin-debug.h | 6 ++++++
include/kvm/iommu.h | 5 +++++
include/kvm/virtio-iommu.h | 5 +++++
kvm-ipc.c | 43 ++++++++++++++++++++++++-------------------
virtio/iommu.c | 14 ++++++++++++++
6 files changed, 61 insertions(+), 20 deletions(-)

diff --git a/builtin-debug.c b/builtin-debug.c
index 4ae51d20..e39e2d09 100644
--- a/builtin-debug.c
+++ b/builtin-debug.c
@@ -5,6 +5,7 @@
#include <kvm/parse-options.h>
#include <kvm/kvm-ipc.h>
#include <kvm/read-write.h>
+#include <kvm/virtio-iommu.h>

#include <stdio.h>
#include <string.h>
@@ -17,6 +18,7 @@ static int nmi = -1;
static bool dump;
static const char *instance_name;
static const char *sysrq;
+static const char *iommu;

static const char * const debug_usage[] = {
"lkvm debug [--all] [-n name] [-d] [-m vcpu]",
@@ -28,6 +30,7 @@ static const struct option debug_options[] = {
OPT_BOOLEAN('d', "dump", &dump, "Generate a debug dump from guest"),
OPT_INTEGER('m', "nmi", &nmi, "Generate NMI on VCPU"),
OPT_STRING('s', "sysrq", &sysrq, "sysrq", "Inject a sysrq"),
+ OPT_STRING('i', "iommu", &iommu, "params", "Debug virtual IOMMU"),
OPT_GROUP("Instance options:"),
OPT_BOOLEAN('a', "all", &all, "Debug all instances"),
OPT_STRING('n', "name", &instance_name, "name", "Instance name"),
@@ -68,11 +71,14 @@ static int do_debug(const char *name, int sock)
cmd.sysrq = sysrq[0];
}

+ if (iommu && !viommu_parse_debug_string(iommu, &cmd.iommu))
+ cmd.dbg_type |= KVM_DEBUG_CMD_TYPE_IOMMU;
+
r = kvm_ipc__send_msg(sock, KVM_IPC_DEBUG, sizeof(cmd), (u8 *)&cmd);
if (r < 0)
return r;

- if (!dump)
+ if (!(cmd.dbg_type & KVM_DEBUG_CMD_DUMP_MASK))
return 0;

do {
diff --git a/include/kvm/builtin-debug.h b/include/kvm/builtin-debug.h
index efa02684..cd2155ae 100644
--- a/include/kvm/builtin-debug.h
+++ b/include/kvm/builtin-debug.h
@@ -2,16 +2,22 @@
#define KVM__DEBUG_H

#include <kvm/util.h>
+#include <kvm/iommu.h>
#include <linux/types.h>

#define KVM_DEBUG_CMD_TYPE_DUMP (1 << 0)
#define KVM_DEBUG_CMD_TYPE_NMI (1 << 1)
#define KVM_DEBUG_CMD_TYPE_SYSRQ (1 << 2)
+#define KVM_DEBUG_CMD_TYPE_IOMMU (1 << 3)
+
+#define KVM_DEBUG_CMD_DUMP_MASK \
+ (KVM_DEBUG_CMD_TYPE_IOMMU | KVM_DEBUG_CMD_TYPE_DUMP)

struct debug_cmd_params {
u32 dbg_type;
u32 cpu;
char sysrq;
+ struct iommu_debug_params iommu;
};

int kvm_cmd_debug(int argc, const char **argv, const char *prefix);
diff --git a/include/kvm/iommu.h b/include/kvm/iommu.h
index 45a20f3b..60857fa5 100644
--- a/include/kvm/iommu.h
+++ b/include/kvm/iommu.h
@@ -1,6 +1,7 @@
#ifndef KVM_IOMMU_H
#define KVM_IOMMU_H

+#include <stdbool.h>
#include <stdlib.h>

#include "devices.h"
@@ -10,6 +11,10 @@
#define IOMMU_PROT_WRITE 0x2
#define IOMMU_PROT_EXEC 0x4

+struct iommu_debug_params {
+ bool print_enabled;
+};
+
/*
* Test if mapping is present. If not, return an error but do not report it to
* stderr
diff --git a/include/kvm/virtio-iommu.h b/include/kvm/virtio-iommu.h
index 5532c82b..c9e36fb6 100644
--- a/include/kvm/virtio-iommu.h
+++ b/include/kvm/virtio-iommu.h
@@ -7,4 +7,9 @@ const struct iommu_properties *viommu_get_properties(void *dev);
void *viommu_register(struct kvm *kvm, struct iommu_properties *props);
void viommu_unregister(struct kvm *kvm, void *cookie);

+struct iommu_debug_params;
+
+int viommu_parse_debug_string(const char *options, struct iommu_debug_params *);
+int viommu_debug(int fd, struct iommu_debug_params *);
+
#endif
diff --git a/kvm-ipc.c b/kvm-ipc.c
index e07ad105..a8b56543 100644
--- a/kvm-ipc.c
+++ b/kvm-ipc.c
@@ -14,6 +14,7 @@
#include "kvm/strbuf.h"
#include "kvm/kvm-cpu.h"
#include "kvm/8250-serial.h"
+#include "kvm/virtio-iommu.h"

struct kvm_ipc_head {
u32 type;
@@ -424,31 +425,35 @@ static void handle_debug(struct kvm *kvm, int fd, u32 type, u32 len, u8 *msg)
pthread_kill(kvm->cpus[vcpu]->thread, SIGUSR1);
}

- if (!(dbg_type & KVM_DEBUG_CMD_TYPE_DUMP))
- return;
+ if (dbg_type & KVM_DEBUG_CMD_TYPE_IOMMU)
+ viommu_debug(fd, &params->iommu);

- for (i = 0; i < kvm->nrcpus; i++) {
- struct kvm_cpu *cpu = kvm->cpus[i];
+ if (dbg_type & KVM_DEBUG_CMD_TYPE_DUMP) {
+ for (i = 0; i < kvm->nrcpus; i++) {
+ struct kvm_cpu *cpu = kvm->cpus[i];

- if (!cpu)
- continue;
+ if (!cpu)
+ continue;

- printout_done = 0;
+ printout_done = 0;
+
+ kvm_cpu__set_debug_fd(fd);
+ pthread_kill(cpu->thread, SIGUSR1);
+ /*
+ * Wait for the vCPU to dump state before signalling
+ * the next thread. Since this is debug code it does
+ * not matter that we are burning CPU time a bit:
+ */
+ while (!printout_done)
+ sleep(0);
+ }

- kvm_cpu__set_debug_fd(fd);
- pthread_kill(cpu->thread, SIGUSR1);
- /*
- * Wait for the vCPU to dump state before signalling
- * the next thread. Since this is debug code it does
- * not matter that we are burning CPU time a bit:
- */
- while (!printout_done)
- sleep(0);
+ serial8250__inject_sysrq(kvm, 'p');
}

- close(fd);
-
- serial8250__inject_sysrq(kvm, 'p');
+ if (dbg_type & KVM_DEBUG_CMD_DUMP_MASK)
+ /* builtin-debug is reading, signal EOT */
+ close(fd);
}

int kvm_ipc__init(struct kvm *kvm)
diff --git a/virtio/iommu.c b/virtio/iommu.c
index 2e5a23ee..5973cef1 100644
--- a/virtio/iommu.c
+++ b/virtio/iommu.c
@@ -620,3 +620,17 @@ void viommu_unregister(struct kvm *kvm, void *viommu)
{
free(viommu);
}
+
+int viommu_parse_debug_string(const char *cmdline, struct iommu_debug_params *params)
+{
+ /* show instances numbers */
+ /* send command to instance */
+ /* - dump mappings */
+ /* - statistics */
+ return -ENOSYS;
+}
+
+int viommu_debug(int sock, struct iommu_debug_params *params)
+{
+ return -ENOSYS;
+}
--
2.12.1
Jean-Philippe Brucker
2017-04-07 19:24:54 UTC
Permalink
Using debug printf with the virtual IOMMU can be extremely verbose. To
ease debugging, add a few commands that can be sent via IPC. Format for
commands is "cmd [iommu [address_space]]" (or cmd:[iommu:[address_space]])

$ lkvm debug -a -i list
iommu 0 "viommu-vfio"
ioas 1
device 0x2 # PCI bus
ioas 2
device 0x3
iommu 1 "viommu-virtio"
ioas 3
device 0x10003 # MMIO bus
ioas 4
device 0x6

$ lkvm debug -a -i stats:0 # stats for viommu-vfio
iommu 0 "viommu-virtio"
kicks 510 # virtio kicks from driver
requests 510 # requests received
ioas 3
maps 1 # number of map requests
unmaps 0 # " unmap "
resident 8192 # bytes currently mapped
accesses 1 # number of device accesses
ioas 4
maps 290
unmaps 4
resident 1335296
accesses 982

$ lkvm debug -a -i "print 1, 2" # Start debug print for
... # ioas 2 in iommu 1
...
Info: VIOMMU map 0xffffffff000 -> 0x8f4e0000 (4096) to IOAS 2
...
$ lkvm debug -a -i noprint # Stop all debug print

We don't use atomics for statistics at the moment, since there is no
concurrent write on most of them. Only 'accesses' might be incremented
concurrently, so we might get imprecise values.

Signed-off-by: Jean-Philippe Brucker <jean-***@arm.com>
---
include/kvm/iommu.h | 17 +++
iommu.c | 56 +++++++++-
virtio/iommu.c | 312 ++++++++++++++++++++++++++++++++++++++++++++++++----
virtio/mmio.c | 1 +
virtio/pci.c | 1 +
5 files changed, 362 insertions(+), 25 deletions(-)

diff --git a/include/kvm/iommu.h b/include/kvm/iommu.h
index 60857fa5..70a09306 100644
--- a/include/kvm/iommu.h
+++ b/include/kvm/iommu.h
@@ -11,7 +11,20 @@
#define IOMMU_PROT_WRITE 0x2
#define IOMMU_PROT_EXEC 0x4

+enum iommu_debug_action {
+ IOMMU_DEBUG_LIST,
+ IOMMU_DEBUG_STATS,
+ IOMMU_DEBUG_SET_PRINT,
+ IOMMU_DEBUG_DUMP,
+
+ IOMMU_DEBUG_NUM_ACTIONS,
+};
+
+#define IOMMU_DEBUG_SELECTOR_INVALID ((unsigned int)-1)
+
struct iommu_debug_params {
+ enum iommu_debug_action action;
+ unsigned int selector[2];
bool print_enabled;
};

@@ -31,6 +44,8 @@ struct iommu_ops {
int (*detach)(void *, struct device_header *);
int (*map)(void *, u64 virt_addr, u64 phys_addr, u64 size, int prot);
int (*unmap)(void *, u64 virt_addr, u64 size, int flags);
+
+ int (*debug_address_space)(void *, int fd, struct iommu_debug_params *);
};

struct iommu_properties {
@@ -74,6 +89,8 @@ static inline struct device_header *iommu_get_device(u32 device_id)

void *iommu_alloc_address_space(struct device_header *dev);
void iommu_free_address_space(void *address_space);
+int iommu_debug_address_space(void *address_space, int fd,
+ struct iommu_debug_params *params);

int iommu_map(void *address_space, u64 virt_addr, u64 phys_addr, u64 size,
int prot);
diff --git a/iommu.c b/iommu.c
index 2220e4b2..bc9fc631 100644
--- a/iommu.c
+++ b/iommu.c
@@ -9,6 +9,10 @@
#include "kvm/mutex.h"
#include "kvm/rbtree-interval.h"

+struct iommu_ioas_stats {
+ u64 accesses;
+};
+
struct iommu_mapping {
struct rb_int_node iova_range;
u64 phys;
@@ -18,8 +22,31 @@ struct iommu_mapping {
struct iommu_ioas {
struct rb_root mappings;
struct mutex mutex;
+
+ struct iommu_ioas_stats stats;
+ bool debug_enabled;
};

+static void iommu_dump(struct iommu_ioas *ioas, int fd)
+{
+ struct rb_node *node;
+ struct iommu_mapping *map;
+
+ mutex_lock(&ioas->mutex);
+
+ dprintf(fd, "START IOMMU DUMP [[[\n"); /* You did ask for it. */
+ for (node = rb_first(&ioas->mappings); node; node = rb_next(node)) {
+ struct rb_int_node *int_node = rb_int(node);
+ map = container_of(int_node, struct iommu_mapping, iova_range);
+
+ dprintf(fd, "%#llx-%#llx -> %#llx %#x\n", int_node->low,
+ int_node->high, map->phys, map->prot);
+ }
+ dprintf(fd, "]]] END IOMMU DUMP\n");
+
+ mutex_unlock(&ioas->mutex);
+}
+
void *iommu_alloc_address_space(struct device_header *unused)
{
struct iommu_ioas *ioas = calloc(1, sizeof(*ioas));
@@ -33,6 +60,27 @@ void *iommu_alloc_address_space(struct device_header *unused)
return ioas;
}

+int iommu_debug_address_space(void *address_space, int fd,
+ struct iommu_debug_params *params)
+{
+ struct iommu_ioas *ioas = address_space;
+
+ switch (params->action) {
+ case IOMMU_DEBUG_STATS:
+ dprintf(fd, " accesses %llu\n", ioas->stats.accesses);
+ break;
+ case IOMMU_DEBUG_SET_PRINT:
+ ioas->debug_enabled = params->print_enabled;
+ break;
+ case IOMMU_DEBUG_DUMP:
+ iommu_dump(ioas, fd);
+ default:
+ break;
+ }
+
+ return 0;
+}
+
void iommu_free_address_space(void *address_space)
{
struct iommu_ioas *ioas = address_space;
@@ -157,8 +205,12 @@ u64 iommu_access(void *address_space, u64 addr, size_t size, size_t *out_size,
out_addr = map->phys + (addr - node->low);
*out_size = min_t(size_t, node->high - addr + 1, size);

- pr_debug("access %llx %zu/%zu %x -> %#llx", addr, *out_size, size,
- prot, out_addr);
+ if (ioas->debug_enabled)
+ pr_info("access %llx %zu/%zu %s%s -> %#llx", addr, *out_size,
+ size, prot & IOMMU_PROT_READ ? "R" : "",
+ prot & IOMMU_PROT_WRITE ? "W" : "", out_addr);
+
+ ioas->stats.accesses++;
out_unlock:
mutex_unlock(&ioas->mutex);

diff --git a/virtio/iommu.c b/virtio/iommu.c
index 5973cef1..153b537a 100644
--- a/virtio/iommu.c
+++ b/virtio/iommu.c
@@ -20,6 +20,17 @@
/* Max size */
#define VIOMMU_DEFAULT_QUEUE_SIZE 256

+struct viommu_ioas_stats {
+ u64 map;
+ u64 unmap;
+ u64 resident;
+};
+
+struct viommu_stats {
+ u64 kicks;
+ u64 requests;
+};
+
struct viommu_endpoint {
struct device_header *dev;
struct viommu_ioas *ioas;
@@ -36,9 +47,14 @@ struct viommu_ioas {

struct iommu_ops *ops;
void *priv;
+
+ bool debug_enabled;
+ struct viommu_ioas_stats stats;
};

struct viommu_dev {
+ u32 id;
+
struct virtio_device vdev;
struct virtio_iommu_config config;

@@ -49,29 +65,77 @@ struct viommu_dev {
struct thread_pool__job job;

struct rb_root address_spaces;
+ struct mutex address_spaces_mutex;
struct kvm *kvm;
+
+ struct list_head list;
+
+ bool debug_enabled;
+ struct viommu_stats stats;
};

static int compat_id = -1;

+static long long viommu_ids;
+static LIST_HEAD(viommus);
+static DEFINE_MUTEX(viommus_mutex);
+
+#define ioas_debug(ioas, fmt, ...) \
+ do { \
+ if ((ioas)->debug_enabled) \
+ pr_info("ioas[%d] " fmt, (ioas)->id, ##__VA_ARGS__); \
+ } while (0)
+
static struct viommu_ioas *viommu_find_ioas(struct viommu_dev *viommu,
u32 ioasid)
{
struct rb_node *node;
- struct viommu_ioas *ioas;
+ struct viommu_ioas *ioas, *found = NULL;

+ mutex_lock(&viommu->address_spaces_mutex);
node = viommu->address_spaces.rb_node;
while (node) {
ioas = container_of(node, struct viommu_ioas, node);
- if (ioas->id > ioasid)
+ if (ioas->id > ioasid) {
node = node->rb_left;
- else if (ioas->id < ioasid)
+ } else if (ioas->id < ioasid) {
node = node->rb_right;
- else
- return ioas;
+ } else {
+ found = ioas;
+ break;
+ }
}
+ mutex_unlock(&viommu->address_spaces_mutex);

- return NULL;
+ return found;
+}
+
+static int viommu_for_each_ioas(struct viommu_dev *viommu,
+ int (*fun)(struct viommu_dev *viommu,
+ struct viommu_ioas *ioas,
+ void *data),
+ void *data)
+{
+ int ret;
+ struct viommu_ioas *ioas;
+ struct rb_node *node, *next;
+
+ mutex_lock(&viommu->address_spaces_mutex);
+ node = rb_first(&viommu->address_spaces);
+ while (node) {
+ next = rb_next(node);
+ ioas = container_of(node, struct viommu_ioas, node);
+
+ ret = fun(viommu, ioas, data);
+ if (ret)
+ break;
+
+ node = next;
+ }
+
+ mutex_unlock(&viommu->address_spaces_mutex);
+
+ return ret;
}

static struct viommu_ioas *viommu_alloc_ioas(struct viommu_dev *viommu,
@@ -99,9 +163,12 @@ static struct viommu_ioas *viommu_alloc_ioas(struct viommu_dev *viommu,
new_ioas->id = ioasid;
new_ioas->ops = ops;
new_ioas->priv = ops->alloc_address_space(device);
+ new_ioas->debug_enabled = viommu->debug_enabled;

/* A NULL priv pointer is valid. */

+ mutex_lock(&viommu->address_spaces_mutex);
+
node = &viommu->address_spaces.rb_node;
while (*node) {
ioas = container_of(*node, struct viommu_ioas, node);
@@ -114,6 +181,7 @@ static struct viommu_ioas *viommu_alloc_ioas(struct viommu_dev *viommu,
} else {
pr_err("IOAS exists!");
free(new_ioas);
+ mutex_unlock(&viommu->address_spaces_mutex);
return NULL;
}
}
@@ -121,6 +189,8 @@ static struct viommu_ioas *viommu_alloc_ioas(struct viommu_dev *viommu,
rb_link_node(&new_ioas->node, parent, node);
rb_insert_color(&new_ioas->node, &viommu->address_spaces);

+ mutex_unlock(&viommu->address_spaces_mutex);
+
return new_ioas;
}

@@ -130,7 +200,9 @@ static void viommu_free_ioas(struct viommu_dev *viommu,
if (ioas->priv)
ioas->ops->free_address_space(ioas->priv);

+ mutex_lock(&viommu->address_spaces_mutex);
rb_erase(&ioas->node, &viommu->address_spaces);
+ mutex_unlock(&viommu->address_spaces_mutex);
free(ioas);
}

@@ -178,8 +250,7 @@ static int viommu_detach_device(struct viommu_dev *viommu,
if (!ioas)
return -EINVAL;

- pr_debug("detaching device %#lx from IOAS %u",
- device_to_iommu_id(device), ioas->id);
+ ioas_debug(ioas, "detaching device %#lx", device_to_iommu_id(device));

ret = device->iommu_ops->detach(ioas->priv, device);
if (!ret)
@@ -208,8 +279,6 @@ static int viommu_handle_attach(struct viommu_dev *viommu,
return -ENODEV;
}

- pr_debug("attaching device %#x to IOAS %u", device_id, ioasid);
-
vdev = device->iommu_data;
if (!vdev) {
vdev = viommu_alloc_device(device);
@@ -240,6 +309,9 @@ static int viommu_handle_attach(struct viommu_dev *viommu,
if (ret && ioas->nr_devices == 0)
viommu_free_ioas(viommu, ioas);

+ if (!ret)
+ ioas_debug(ioas, "attached device %#x", device_id);
+
return ret;
}

@@ -267,6 +339,7 @@ static int viommu_handle_detach(struct viommu_dev *viommu,
static int viommu_handle_map(struct viommu_dev *viommu,
struct virtio_iommu_req_map *map)
{
+ int ret;
int prot = 0;
struct viommu_ioas *ioas;

@@ -294,15 +367,21 @@ static int viommu_handle_map(struct viommu_dev *viommu,
if (flags & VIRTIO_IOMMU_MAP_F_EXEC)
prot |= IOMMU_PROT_EXEC;

- pr_debug("map %#llx -> %#llx (%llu) to IOAS %u", virt_addr,
- phys_addr, size, ioasid);
+ ioas_debug(ioas, "map %#llx -> %#llx (%llu)", virt_addr, phys_addr, size);
+
+ ret = ioas->ops->map(ioas->priv, virt_addr, phys_addr, size, prot);
+ if (!ret) {
+ ioas->stats.resident += size;
+ ioas->stats.map++;
+ }

- return ioas->ops->map(ioas->priv, virt_addr, phys_addr, size, prot);
+ return ret;
}

static int viommu_handle_unmap(struct viommu_dev *viommu,
struct virtio_iommu_req_unmap *unmap)
{
+ int ret;
struct viommu_ioas *ioas;

u32 ioasid = le32_to_cpu(unmap->address_space);
@@ -315,10 +394,15 @@ static int viommu_handle_unmap(struct viommu_dev *viommu,
return -ESRCH;
}

- pr_debug("unmap %#llx (%llu) from IOAS %u", virt_addr, size,
- ioasid);
+ ioas_debug(ioas, "unmap %#llx (%llu)", virt_addr, size);
+
+ ret = ioas->ops->unmap(ioas->priv, virt_addr, size, 0);
+ if (!ret) {
+ ioas->stats.resident -= size;
+ ioas->stats.unmap++;
+ }

- return ioas->ops->unmap(ioas->priv, virt_addr, size, 0);
+ return ret;
}

static size_t viommu_get_req_len(union virtio_iommu_req *req)
@@ -407,6 +491,8 @@ static ssize_t viommu_dispatch_commands(struct viommu_dev *viommu,
continue;
}

+ viommu->stats.requests++;
+
req = iov[i].iov_base;
op = req->head.type;
expected_len = viommu_get_req_len(req) - sizeof(*tail);
@@ -458,6 +544,8 @@ static void viommu_command(struct kvm *kvm, void *dev)

vq = &viommu->vq;

+ viommu->stats.kicks++;
+
while (virt_queue__available(vq)) {
head = virt_queue__get_iov(vq, iov, &out, &in, kvm);

@@ -594,6 +682,7 @@ void *viommu_register(struct kvm *kvm, struct iommu_properties *props)

viommu->queue_size = VIOMMU_DEFAULT_QUEUE_SIZE;
viommu->address_spaces = (struct rb_root)RB_ROOT;
+ viommu->address_spaces_mutex = (struct mutex)MUTEX_INITIALIZER;
viommu->properties = props;

viommu->config.page_sizes = props->pgsize_mask ?: pgsize_mask;
@@ -607,6 +696,11 @@ void *viommu_register(struct kvm *kvm, struct iommu_properties *props)
return NULL;
}

+ mutex_lock(&viommus_mutex);
+ viommu->id = viommu_ids++;
+ list_add_tail(&viommu->list, &viommus);
+ mutex_unlock(&viommus_mutex);
+
pr_info("Loaded virtual IOMMU %s", props->name);

if (compat_id == -1)
@@ -616,21 +710,193 @@ void *viommu_register(struct kvm *kvm, struct iommu_properties *props)
return viommu;
}

-void viommu_unregister(struct kvm *kvm, void *viommu)
+void viommu_unregister(struct kvm *kvm, void *dev)
{
+ struct viommu_dev *viommu = dev;
+
+ mutex_lock(&viommus_mutex);
+ list_del(&viommu->list);
+ mutex_unlock(&viommus_mutex);
+
free(viommu);
}

+const char *debug_usage =
+" list [iommu [ioas]] list iommus and address spaces\n"
+" stats [iommu [ioas]] display statistics\n"
+" dump [iommu [ioas]] dump mappings\n"
+" print [iommu [ioas]] enable debug print\n"
+" noprint [iommu [ioas]] disable debug print\n"
+;
+
int viommu_parse_debug_string(const char *cmdline, struct iommu_debug_params *params)
{
- /* show instances numbers */
- /* send command to instance */
- /* - dump mappings */
- /* - statistics */
- return -ENOSYS;
+ int pos = 0;
+ int ret = -EINVAL;
+ char *cur, *args = strdup(cmdline);
+ params->action = IOMMU_DEBUG_NUM_ACTIONS;
+
+ if (!args)
+ return -ENOMEM;
+
+ params->selector[0] = IOMMU_DEBUG_SELECTOR_INVALID;
+ params->selector[1] = IOMMU_DEBUG_SELECTOR_INVALID;
+
+ cur = strtok(args, " ,:");
+ while (cur) {
+ if (pos > 2)
+ break;
+
+ if (pos > 0) {
+ errno = 0;
+ params->selector[pos - 1] = strtoul(cur, NULL, 0);
+ if (errno) {
+ ret = -errno;
+ pr_err("Invalid number '%s'", cur);
+ break;
+ }
+ } else if (strncmp(cur, "list", 4) == 0) {
+ params->action = IOMMU_DEBUG_LIST;
+ } else if (strncmp(cur, "stats", 5) == 0) {
+ params->action = IOMMU_DEBUG_STATS;
+ } else if (strncmp(cur, "dump", 4) == 0) {
+ params->action = IOMMU_DEBUG_DUMP;
+ } else if (strncmp(cur, "print", 5) == 0) {
+ params->action = IOMMU_DEBUG_SET_PRINT;
+ params->print_enabled = true;
+ } else if (strncmp(cur, "noprint", 7) == 0) {
+ params->action = IOMMU_DEBUG_SET_PRINT;
+ params->print_enabled = false;
+ } else {
+ pr_err("Invalid command '%s'", cur);
+ break;
+ }
+
+ cur = strtok(NULL, " ,:");
+ pos++;
+ ret = 0;
+ }
+
+ free(args);
+
+ if (cur && cur[0])
+ pr_err("Ignoring argument '%s'", cur);
+
+ if (ret)
+ pr_info("Usage:\n%s", debug_usage);
+
+ return ret;
+}
+
+struct viommu_debug_context {
+ int sock;
+ struct iommu_debug_params *params;
+ bool disp;
+};
+
+static int viommu_debug_ioas(struct viommu_dev *viommu,
+ struct viommu_ioas *ioas,
+ void *data)
+{
+ int ret = 0;
+ struct viommu_endpoint *vdev;
+ struct viommu_debug_context *ctx = data;
+
+ if (ctx->disp)
+ dprintf(ctx->sock, " ioas %u\n", ioas->id);
+
+ switch (ctx->params->action) {
+ case IOMMU_DEBUG_LIST:
+ mutex_lock(&ioas->devices_mutex);
+ list_for_each_entry(vdev, &ioas->devices, list) {
+ dprintf(ctx->sock, " device 0x%lx\n",
+ device_to_iommu_id(vdev->dev));
+ }
+ mutex_unlock(&ioas->devices_mutex);
+ break;
+ case IOMMU_DEBUG_STATS:
+ dprintf(ctx->sock, " maps %llu\n",
+ ioas->stats.map);
+ dprintf(ctx->sock, " unmaps %llu\n",
+ ioas->stats.unmap);
+ dprintf(ctx->sock, " resident %llu\n",
+ ioas->stats.resident);
+ break;
+ case IOMMU_DEBUG_SET_PRINT:
+ ioas->debug_enabled = ctx->params->print_enabled;
+ break;
+ default:
+ ret = -ENOSYS;
+
+ }
+
+ if (ioas->ops->debug_address_space)
+ ret = ioas->ops->debug_address_space(ioas->priv, ctx->sock,
+ ctx->params);
+
+ return ret;
+}
+
+static int viommu_debug_iommu(struct viommu_dev *viommu,
+ struct viommu_debug_context *ctx)
+{
+ struct viommu_ioas *ioas;
+
+ if (ctx->disp)
+ dprintf(ctx->sock, "iommu %u \"%s\"\n", viommu->id,
+ viommu->properties->name);
+
+ if (ctx->params->selector[1] != IOMMU_DEBUG_SELECTOR_INVALID) {
+ ioas = viommu_find_ioas(viommu, ctx->params->selector[1]);
+ return ioas ? viommu_debug_ioas(viommu, ioas, ctx) : -ESRCH;
+ }
+
+ switch (ctx->params->action) {
+ case IOMMU_DEBUG_STATS:
+ dprintf(ctx->sock, " kicks %llu\n",
+ viommu->stats.kicks);
+ dprintf(ctx->sock, " requests %llu\n",
+ viommu->stats.requests);
+ break;
+ case IOMMU_DEBUG_SET_PRINT:
+ viommu->debug_enabled = ctx->params->print_enabled;
+ break;
+ default:
+ break;
+ }
+
+ return viommu_for_each_ioas(viommu, viommu_debug_ioas, ctx);
}

int viommu_debug(int sock, struct iommu_debug_params *params)
{
- return -ENOSYS;
+ int ret = -ESRCH;
+ bool match;
+ struct viommu_dev *viommu;
+ bool any = (params->selector[0] == IOMMU_DEBUG_SELECTOR_INVALID);
+
+ struct viommu_debug_context ctx = {
+ .sock = sock,
+ .params = params,
+ };
+
+ if (params->action == IOMMU_DEBUG_LIST ||
+ params->action == IOMMU_DEBUG_STATS)
+ ctx.disp = true;
+
+ mutex_lock(&viommus_mutex);
+ list_for_each_entry(viommu, &viommus, list) {
+ match = (params->selector[0] == viommu->id);
+ if (match || any) {
+ ret = viommu_debug_iommu(viommu, &ctx);
+ if (ret || match)
+ break;
+ }
+ }
+ mutex_unlock(&viommus_mutex);
+
+ if (ret)
+ dprintf(sock, "error: %s\n", strerror(-ret));
+
+ return ret;
}
diff --git a/virtio/mmio.c b/virtio/mmio.c
index 699d4403..7d39120a 100644
--- a/virtio/mmio.c
+++ b/virtio/mmio.c
@@ -307,6 +307,7 @@ static struct iommu_ops virtio_mmio_iommu_ops = {
.get_properties = virtio__iommu_get_properties,
.alloc_address_space = iommu_alloc_address_space,
.free_address_space = iommu_free_address_space,
+ .debug_address_space = iommu_debug_address_space,
.attach = virtio_mmio_iommu_attach,
.detach = virtio_mmio_iommu_detach,
.map = iommu_map,
diff --git a/virtio/pci.c b/virtio/pci.c
index c9f0e558..c5d30eb2 100644
--- a/virtio/pci.c
+++ b/virtio/pci.c
@@ -442,6 +442,7 @@ static struct iommu_ops virtio_pci_iommu_ops = {
.get_properties = virtio__iommu_get_properties,
.alloc_address_space = iommu_alloc_address_space,
.free_address_space = iommu_free_address_space,
+ .debug_address_space = iommu_debug_address_space,
.attach = virtio_pci_iommu_attach,
.detach = virtio_pci_iommu_detach,
.map = iommu_map,
--
2.12.1
Jean-Philippe Brucker
2017-04-07 19:24:55 UTC
Permalink
This is for development only. Virtual devices might blow up unexpectedly.
In general it seems to work (slowing devices down by a factor of two of
course). virtio-scsi, virtio-rng and virtio-balloon are still untested.

Signed-off-by: Jean-Philippe Brucker <jean-***@arm.com>
---
virtio/core.c | 3 +++
1 file changed, 3 insertions(+)

diff --git a/virtio/core.c b/virtio/core.c
index 66e0cecb..4ca632f9 100644
--- a/virtio/core.c
+++ b/virtio/core.c
@@ -1,4 +1,5 @@
#include <linux/virtio_config.h>
+#include <linux/virtio_ids.h>
#include <linux/virtio_ring.h>
#include <linux/types.h>
#include <sys/uio.h>
@@ -369,6 +370,8 @@ int virtio_init(struct kvm *kvm, void *dev, struct virtio_device *vdev,
{
void *virtio;

+ vdev->use_iommu = kvm->cfg.viommu && subsys_id != VIRTIO_ID_IOMMU;
+
switch (trans) {
case VIRTIO_PCI:
virtio = calloc(sizeof(struct virtio_pci), 1);
--
2.12.1
Bharat Bhushan
2017-05-22 08:26:12 UTC
Permalink
Hi Jean,

I am trying to run and review on my side but I see Linux patches are not with latest kernel version.
Will it be possible for you to share your Linux and kvmtool git repository reference?

Thanks
-Bharat
-----Original Message-----
Philippe Brucker
Sent: Saturday, April 08, 2017 12:55 AM
Subject: [RFC PATCH kvmtool 00/15] Add virtio-iommu
Implement a virtio-iommu device and translate DMA traffic from vfio and
virtio devices. Virtio needed some rework to support scatter-gather accesses
to vring and buffers at page granularity. Patch 3 implements the actual virtio-
iommu device.
Adding --viommu on the command-line now inserts a virtual IOMMU in front
$ lkvm run -k Image --console virtio -p console=hvc0 \
--viommu --vfio 0 --vfio 4 --irqchip gicv3-its
...
[ 2.998949] virtio_iommu virtio0: probe successful
[ 3.007739] virtio_iommu virtio1: probe successful
...
[ 3.165023] iommu: Adding device 0000:00:00.0 to group 0
[ 3.536480] iommu: Adding device 10200.virtio to group 1
[ 3.553643] iommu: Adding device 10600.virtio to group 2
[ 3.570687] iommu: Adding device 10800.virtio to group 3
[ 3.627425] iommu: Adding device 10a00.virtio to group 4
[ 7.823689] iommu: Adding device 0000:00:01.0 to group 5
...
Patches 13 and 14 add debug facilities. Some statistics are gathered for each
$ lkvm debug -n guest-1210 --iommu stats
iommu 0 "viommu-vfio"
kicks 1255
requests 1256
ioas 1
maps 7
unmaps 4
resident 2101248
ioas 6
maps 623
unmaps 620
resident 16384
iommu 1 "viommu-virtio"
kicks 11426
requests 11431
ioas 2
maps 2836
unmaps 2835
resident 8192
accesses 2836
...
This is based on the VFIO patchset[1], itself based on Andre's ITS work.
The VFIO bits have only been tested on a software model and are unlikely to
work on actual hardware, but I also tested virtio on an ARM Juno.
[1] http://www.spinics.net/lists/kvm/msg147624.html
virtio: synchronize virtio-iommu headers with Linux
FDT: (re)introduce a dynamic phandle allocator
virtio: add virtio-iommu
Add a simple IOMMU
iommu: describe IOMMU topology in device-trees
irq: register MSI doorbell addresses
virtio: factor virtqueue initialization
virtio: add vIOMMU instance for virtio devices
virtio: access vring and buffers through IOMMU mappings
virtio-pci: translate MSIs with the virtual IOMMU
virtio: set VIRTIO_F_IOMMU_PLATFORM when necessary
vfio: add support for virtual IOMMU
virtio-iommu: debug via IPC
virtio-iommu: implement basic debug commands
virtio: use virtio-iommu when available
Makefile | 3 +
arm/gic.c | 4 +
arm/include/arm-common/fdt-arch.h | 2 +-
arm/pci.c | 49 ++-
builtin-debug.c | 8 +-
builtin-run.c | 2 +
fdt.c | 35 ++
include/kvm/builtin-debug.h | 6 +
include/kvm/devices.h | 4 +
include/kvm/fdt.h | 20 +
include/kvm/iommu.h | 105 +++++
include/kvm/irq.h | 3 +
include/kvm/kvm-config.h | 1 +
include/kvm/vfio.h | 2 +
include/kvm/virtio-iommu.h | 15 +
include/kvm/virtio-mmio.h | 1 +
include/kvm/virtio-pci.h | 2 +
include/kvm/virtio.h | 137 +++++-
include/linux/virtio_config.h | 74 ++++
include/linux/virtio_ids.h | 4 +
include/linux/virtio_iommu.h | 142 ++++++
iommu.c | 240 ++++++++++
irq.c | 35 ++
kvm-ipc.c | 43 +-
mips/include/kvm/fdt-arch.h | 2 +-
powerpc/include/kvm/fdt-arch.h | 2 +-
vfio.c | 281 +++++++++++-
virtio/9p.c | 7 +-
virtio/balloon.c | 7 +-
virtio/blk.c | 10 +-
virtio/console.c | 7 +-
virtio/core.c | 240 ++++++++--
virtio/iommu.c | 902
++++++++++++++++++++++++++++++++++++++
virtio/mmio.c | 44 +-
virtio/net.c | 8 +-
virtio/pci.c | 61 ++-
virtio/rng.c | 6 +-
virtio/scsi.c | 6 +-
x86/include/kvm/fdt-arch.h | 2 +-
39 files changed, 2389 insertions(+), 133 deletions(-) create mode 100644
fdt.c create mode 100644 include/kvm/iommu.h create mode 100644
include/kvm/virtio-iommu.h create mode 100644
include/linux/virtio_config.h create mode 100644
include/linux/virtio_iommu.h create mode 100644 iommu.c create mode
100644 virtio/iommu.c
--
2.12.1
_______________________________________________
Virtualization mailing list
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Jean-Philippe Brucker
2017-05-22 14:01:45 UTC
Permalink
Hi Bharat,
Post by Bharat Bhushan
Hi Jean,
I am trying to run and review on my side but I see Linux patches are not with latest kernel version.
Will it be possible for you to share your Linux and kvmtool git repository reference?
Please find linux and kvmtool patches at the following repos:

git://linux-arm.org/kvmtool-jpb.git virtio-iommu/base
git://linux-arm.org/linux-jpb.git virtio-iommu/base

Note that these branches are unstable, subject to fixes and rebase. I'll
try to keep them in sync with upstream.

Thanks,
Jean
Michael S. Tsirkin
2017-04-07 21:19:22 UTC
Permalink
Post by Jean-Philippe Brucker
There are a number of advantages in a paravirtualized IOMMU over a full
emulation. It is portable and could be reused on different architectures.
It is easier to implement than a full emulation, with less state tracking.
It might be more efficient in some cases, with less context switches to
the host and the possibility of in-kernel emulation.
Thanks, this is very interesting. I am read to read it all, but I really
would like you to expand some more on the motivation for this work.
Productising this would be quite a bit of work. Spending just 6 lines on
motivation seems somewhat disproportionate. In particular, do you have
any specific efficiency measurements or estimates that you can share?
--
MST
Jean-Philippe Brucker
2017-04-10 18:39:24 UTC
Permalink
Post by Michael S. Tsirkin
Post by Jean-Philippe Brucker
There are a number of advantages in a paravirtualized IOMMU over a full
emulation. It is portable and could be reused on different architectures.
It is easier to implement than a full emulation, with less state tracking.
It might be more efficient in some cases, with less context switches to
the host and the possibility of in-kernel emulation.
Thanks, this is very interesting. I am read to read it all, but I really
would like you to expand some more on the motivation for this work.
Productising this would be quite a bit of work. Spending just 6 lines on
motivation seems somewhat disproportionate. In particular, do you have
any specific efficiency measurements or estimates that you can share?
The main motivation for this work is to bring IOMMU virtualization to the
ARM world. We don't have any at the moment, and a full ARM SMMU
virtualization solution would be counter-productive. We would have to do
it for SMMUv2, for the completely orthogonal SMMUv3, and for any future
version of the architecture. Doing so in userspace might be acceptable,
but then for performance reasons people will want in-kernel emulation of
every IOMMU variant out there, which is a maintenance and security
nightmare. A single generic vIOMMU is preferable because it reduces
maintenance cost and attack surface.

The transport code is the same as any virtio device, both for userspace
and in-kernel implementations. So instead of rewriting everything from
scratch (and the lot of bugs that go with it) for each IOMMU variation, we
reuse well-tested code for transport and write the emulation layer once
and for all.

Note that this work applies to any architecture with an IOMMU, not only
ARM and their partners'. Introducing an IOMMU specially designed for
virtualization allows us to get rid of complex state tracking inherent to
full IOMMU emulations. With a full emulation, all guest accesses to page
table and configuration structures have to be trapped and interpreted. A
Virtio interface provides well-defined semantics and doesn't need to guess
what the guest is trying to do. It transmits requests made from guest
device drivers to host IOMMU almost unaltered, removing the intermediate
layer of arch-specific configuration structures and page tables.

Using a portable standard like Virtio also allows for efficient IOMMU
virtualization when guest and host are built for different architectures
(for instance when using Qemu TCG.) In-kernel emulation would still work
with vhost-iommu, but a platform-specific vIOMMUs would have to stay in
userspace.

I don't have any measurements at the moment, it is a bit early for that.
The kvmtool example was developed on a software model and is mostly here
for illustrative purpose, a Qemu implementation would be more suitable for
performance analysis. I wouldn't be able to give meaning to these numbers
anyway, since on ARM we don't have any existing solution to compare it
against. One could compare the complexity of handling guest accesses and
parsing page tables in Qemu's VT-d emulation with reading a chain of
buffers in Virtio, for a very rough estimate.

Thanks,
Jean-Philippe
Michael S. Tsirkin
2017-04-10 20:04:45 UTC
Permalink
Post by Jean-Philippe Brucker
Post by Michael S. Tsirkin
Post by Jean-Philippe Brucker
There are a number of advantages in a paravirtualized IOMMU over a full
emulation. It is portable and could be reused on different architectures.
It is easier to implement than a full emulation, with less state tracking.
It might be more efficient in some cases, with less context switches to
the host and the possibility of in-kernel emulation.
Thanks, this is very interesting. I am read to read it all, but I really
would like you to expand some more on the motivation for this work.
Productising this would be quite a bit of work. Spending just 6 lines on
motivation seems somewhat disproportionate. In particular, do you have
any specific efficiency measurements or estimates that you can share?
The main motivation for this work is to bring IOMMU virtualization to the
ARM world. We don't have any at the moment, and a full ARM SMMU
virtualization solution would be counter-productive. We would have to do
it for SMMUv2, for the completely orthogonal SMMUv3, and for any future
version of the architecture. Doing so in userspace might be acceptable,
but then for performance reasons people will want in-kernel emulation of
every IOMMU variant out there, which is a maintenance and security
nightmare. A single generic vIOMMU is preferable because it reduces
maintenance cost and attack surface.
The transport code is the same as any virtio device, both for userspace
and in-kernel implementations. So instead of rewriting everything from
scratch (and the lot of bugs that go with it) for each IOMMU variation, we
reuse well-tested code for transport and write the emulation layer once
and for all.
Note that this work applies to any architecture with an IOMMU, not only
ARM and their partners'. Introducing an IOMMU specially designed for
virtualization allows us to get rid of complex state tracking inherent to
full IOMMU emulations. With a full emulation, all guest accesses to page
table and configuration structures have to be trapped and interpreted. A
Virtio interface provides well-defined semantics and doesn't need to guess
what the guest is trying to do. It transmits requests made from guest
device drivers to host IOMMU almost unaltered, removing the intermediate
layer of arch-specific configuration structures and page tables.
Using a portable standard like Virtio also allows for efficient IOMMU
virtualization when guest and host are built for different architectures
(for instance when using Qemu TCG.) In-kernel emulation would still work
with vhost-iommu, but a platform-specific vIOMMUs would have to stay in
userspace.
I don't have any measurements at the moment, it is a bit early for that.
The kvmtool example was developed on a software model and is mostly here
for illustrative purpose, a Qemu implementation would be more suitable for
performance analysis. I wouldn't be able to give meaning to these numbers
anyway, since on ARM we don't have any existing solution to compare it
against. One could compare the complexity of handling guest accesses and
parsing page tables in Qemu's VT-d emulation with reading a chain of
buffers in Virtio, for a very rough estimate.
Thanks,
Jean-Philippe
This last suggestion sounds very reasonable.
--
MST
Alex Williamson
2017-04-10 04:19:45 UTC
Permalink
On Mon, 10 Apr 2017 08:00:45 +0530
Hi All,
We have drivers/vfio/vfio_iommu_type1.c. what is type1 iommu? Is it
w.r.t vfio layer it is being referred?
Is there type 2 IOMMU w.r.t vfio? If so what is it?
type1 is the 1st type. It's an arbitrary name. There is no type2, yet.
Jason Wang
2017-04-12 09:06:43 UTC
Permalink
Post by Jean-Philippe Brucker
This is the initial proposal for a paravirtualized IOMMU device using
virtio transport. It contains a description of the device, a Linux driver,
and a toy implementation in kvmtool. With this prototype, you can
translate DMA to guest memory from emulated (virtio), or passed-through
(VFIO) devices.
In its simplest form, implemented here, the device handles map/unmap
requests from the guest. Future extensions proposed in "RFC 3/3" should
allow to bind page tables to devices.
There are a number of advantages in a paravirtualized IOMMU over a full
emulation. It is portable and could be reused on different architectures.
It is easier to implement than a full emulation, with less state tracking.
It might be more efficient in some cases, with less context switches to
the host and the possibility of in-kernel emulation.
I like the idea. Consider the complexity of IOMMU hardware. I believe we
don't want to have and fight for bugs of three or more different IOMMU
implementations in either userspace or kernel.

Thanks
Post by Jean-Philippe Brucker
When designing it and writing the kvmtool device, I considered two main
scenarios, illustrated below.
Scenario 1: a hardware device passed through twice via VFIO
MEM____pIOMMU________PCI device________________________ HARDWARE
| (2b) \
----------|-------------+-------------+------------------\-------------
| : KVM : \
| : : \
pIOMMU drv : _______virtio-iommu drv \ KERNEL
| : | : | \
VFIO : | : VFIO \
| : | : | \
| : | : | /
----------|-------------+--------|----+----------|------------/--------
| | : | /
| (1c) (1b) | : (1a) | / (2a)
| | : | /
| | : | / USERSPACE
|___virtio-iommu dev___| : net drv___/
--------------------------------------+--------------------------------
HOST : GUEST
(1) a. Guest userspace is running a net driver (e.g. DPDK). It allocates a
buffer with mmap, obtaining virtual address VA. It then send a
VFIO_IOMMU_MAP_DMA request to map VA to an IOVA (possibly VA=IOVA).
b. The maping request is relayed to the host through virtio
(VIRTIO_IOMMU_T_MAP).
c. The mapping request is relayed to the physical IOMMU through VFIO.
(2) a. The guest userspace driver can now instruct the device to directly
access the buffer at IOVA
b. IOVA accesses from the device are translated into physical
addresses by the IOMMU.
Scenario 2: a virtual net device behind a virtual IOMMU.
MEM__pIOMMU___PCI device HARDWARE
| |
-------|---------|------+-------------+-------------------------------
\ | : _____________virtio-net drv KERNEL
\_net drv : | : / (1a)
| : | : /
tap : | ________virtio-iommu drv
| : | | : (1b)
-----------------|------+-----|---|---+-------------------------------
/ | : USERSPACE
--------------------------------------+-------------------------------
HOST : GUEST
(1) a. Guest virtio-net driver maps the virtio ring and a buffer
b. The mapping requests are relayed to the host through virtio.
(2) The virtio-net device now needs to access any guest memory via the
IOMMU.
Physical and virtual IOMMUs are completely dissociated. The net driver is
mapping its own buffers via DMA/IOMMU API, and buffers are copied between
virtio-net and tap.
The description itself seemed too long for a single email, so I split it
into three documents, and will attach Linux and kvmtool patches to this
email.
1. Firmware note,
2. device operations (draft for the virtio specification),
3. future work/possible improvements.
pIOMMU physical IOMMU, controlling DMA accesses from physical devices
vIOMMU virtual IOMMU (virtio-iommu), controlling DMA accesses from
physical and virtual devices to guest memory.
GVA, GPA, HVA, HPA
Guest/Host Virtual/Physical Address
IOVA I/O Virtual Address, the address accessed by a device doing DMA
through an IOMMU. In the context of a guest OS, IOVA is GVA.
Note: kvmtool is GPLv2. Linux patches are GPLv2, except for UAPI
virtio-iommu.h header, which is BSD 3-clause. For the time being, the
specification draft in RFC 2/3 is also BSD 3-clause.
This proposal may be involuntarily centered around ARM architectures at
times. Any feedback would be appreciated, especially regarding other IOMMU
architectures.
Thanks,
Jean-Philippe
Tian, Kevin
2017-04-13 08:16:26 UTC
Permalink
From: Jason Wang
Sent: Wednesday, April 12, 2017 5:07 PM
Post by Jean-Philippe Brucker
This is the initial proposal for a paravirtualized IOMMU device using
virtio transport. It contains a description of the device, a Linux driver,
and a toy implementation in kvmtool. With this prototype, you can
translate DMA to guest memory from emulated (virtio), or passed-through
(VFIO) devices.
In its simplest form, implemented here, the device handles map/unmap
requests from the guest. Future extensions proposed in "RFC 3/3" should
allow to bind page tables to devices.
There are a number of advantages in a paravirtualized IOMMU over a full
emulation. It is portable and could be reused on different architectures.
It is easier to implement than a full emulation, with less state tracking.
It might be more efficient in some cases, with less context switches to
the host and the possibility of in-kernel emulation.
I like the idea. Consider the complexity of IOMMU hardware. I believe we
don't want to have and fight for bugs of three or more different IOMMU
implementations in either userspace or kernel.
Though there are definitely positive things around pvIOMMU approach,
it also has some limitations:

- Existing IOMMU implementations have been in old distros for quite some
time, while pvIOMMU driver will only land in future distros. Doing pvIOMMU
only means we completely drop support of old distros in VM;

- Similar situation on supporting other guest OSes e.g. Windows. IOMMU is
a key kernel component which I'm not sure pvIOMMU through virtio can be
recognized in those OSes (not like a virtio device driver);

I would image both full-emulated IOMMUs and pvIOMMU would co-exist
for some time due to above reasons. Someday when pvIOMMU is mature/
spread enough in the eco-system (and feature-wise comparable to full-emulated
IOMMUs for all vendors), then we may make a call.

Thanks,
Kevin
Jean-Philippe Brucker
2017-04-13 13:12:19 UTC
Permalink
Post by Tian, Kevin
From: Jason Wang
Sent: Wednesday, April 12, 2017 5:07 PM
Post by Jean-Philippe Brucker
This is the initial proposal for a paravirtualized IOMMU device using
virtio transport. It contains a description of the device, a Linux driver,
and a toy implementation in kvmtool. With this prototype, you can
translate DMA to guest memory from emulated (virtio), or passed-through
(VFIO) devices.
In its simplest form, implemented here, the device handles map/unmap
requests from the guest. Future extensions proposed in "RFC 3/3" should
allow to bind page tables to devices.
There are a number of advantages in a paravirtualized IOMMU over a full
emulation. It is portable and could be reused on different architectures.
It is easier to implement than a full emulation, with less state tracking.
It might be more efficient in some cases, with less context switches to
the host and the possibility of in-kernel emulation.
I like the idea. Consider the complexity of IOMMU hardware. I believe we
don't want to have and fight for bugs of three or more different IOMMU
implementations in either userspace or kernel.
Though there are definitely positive things around pvIOMMU approach,
- Existing IOMMU implementations have been in old distros for quite some
time, while pvIOMMU driver will only land in future distros. Doing pvIOMMU
only means we completely drop support of old distros in VM;
- Similar situation on supporting other guest OSes e.g. Windows. IOMMU is
a key kernel component which I'm not sure pvIOMMU through virtio can be
recognized in those OSes (not like a virtio device driver);
I can't talk about other OSes, but on Linux virtio-iommu is implemented
the same way as other IOMMU drivers and doesn't require core modifications.
Post by Tian, Kevin
I would image both full-emulated IOMMUs and pvIOMMU would co-exist
for some time due to above reasons. Someday when pvIOMMU is mature/
spread enough in the eco-system (and feature-wise comparable to full-emulated
IOMMUs for all vendors), then we may make a call.
Agreed. The main inconvenient of any paravirtualized device is that they
need additional support in the guest. It is not our intention to disrupt
all the work done on IOMMU virtualization for x86 and other architectures.
Even for ARM, people might want to provide SMMU emulations to unmodified
guests, implemented in userspace. What we intend to avoid, as detailed in
my other reply, is in-kernel emulation of all possible ARM-based IOMMU
variations for Linux. So we propose a generic alternative from the start,
that others can reuse later.

Thanks,
Jean-Philippe
Tian, Kevin
2017-04-13 08:41:01 UTC
Permalink
From: Jean-Philippe Brucker
Sent: Saturday, April 8, 2017 3:18 AM
This is the initial proposal for a paravirtualized IOMMU device using
virtio transport. It contains a description of the device, a Linux driver,
and a toy implementation in kvmtool. With this prototype, you can
translate DMA to guest memory from emulated (virtio), or passed-through
(VFIO) devices.
In its simplest form, implemented here, the device handles map/unmap
requests from the guest. Future extensions proposed in "RFC 3/3" should
allow to bind page tables to devices.
There are a number of advantages in a paravirtualized IOMMU over a full
emulation. It is portable and could be reused on different architectures.
It is easier to implement than a full emulation, with less state tracking.
It might be more efficient in some cases, with less context switches to
the host and the possibility of in-kernel emulation.
When designing it and writing the kvmtool device, I considered two main
scenarios, illustrated below.
Scenario 1: a hardware device passed through twice via VFIO
MEM____pIOMMU________PCI device________________________
HARDWARE
| (2b) \
----------|-------------+-------------+------------------\-------------
| : KVM : \
| : : \
pIOMMU drv : _______virtio-iommu drv \ KERNEL
| : | : | \
VFIO : | : VFIO \
| : | : | \
| : | : | /
----------|-------------+--------|----+----------|------------/--------
| | : | /
| (1c) (1b) | : (1a) | / (2a)
| | : | /
| | : | / USERSPACE
|___virtio-iommu dev___| : net drv___/
--------------------------------------+--------------------------------
HOST : GUEST
Usually people draw such layers in reverse order, e.g. hw in the
bottom then kernel in the middle then user in the top. :-)
(1) a. Guest userspace is running a net driver (e.g. DPDK). It allocates a
buffer with mmap, obtaining virtual address VA. It then send a
VFIO_IOMMU_MAP_DMA request to map VA to an IOVA (possibly VA=IOVA).
b. The maping request is relayed to the host through virtio
(VIRTIO_IOMMU_T_MAP).
c. The mapping request is relayed to the physical IOMMU through VFIO.
(2) a. The guest userspace driver can now instruct the device to directly
access the buffer at IOVA
b. IOVA accesses from the device are translated into physical
addresses by the IOMMU.
Scenario 2: a virtual net device behind a virtual IOMMU.
MEM__pIOMMU___PCI device HARDWARE
| |
-------|---------|------+-------------+-------------------------------
\ | : _____________virtio-net drv KERNEL
\_net drv : | : / (1a)
| : | : /
tap : | ________virtio-iommu drv
| : | | : (1b)
-----------------|------+-----|---|---+-------------------------------
/ | : USERSPACE
--------------------------------------+-------------------------------
HOST : GUEST
(1) a. Guest virtio-net driver maps the virtio ring and a buffer
b. The mapping requests are relayed to the host through virtio.
(2) The virtio-net device now needs to access any guest memory via the
IOMMU.
Physical and virtual IOMMUs are completely dissociated. The net driver is
mapping its own buffers via DMA/IOMMU API, and buffers are copied between
virtio-net and tap.
The description itself seemed too long for a single email, so I split it
into three documents, and will attach Linux and kvmtool patches to this
email.
1. Firmware note,
2. device operations (draft for the virtio specification),
3. future work/possible improvements.
pIOMMU physical IOMMU, controlling DMA accesses from physical
devices
vIOMMU virtual IOMMU (virtio-iommu), controlling DMA accesses
from
physical and virtual devices to guest memory.
maybe clearer to call controlling 'virtual' DMA access since we're
essentially doing DMA virtualization here. Otherwise I read it
a bit confusing since DMA accesses from physical device should
be controlled by pIOMMU.
GVA, GPA, HVA, HPA
Guest/Host Virtual/Physical Address
IOVA I/O Virtual Address, the address accessed by a device doing DMA
through an IOMMU. In the context of a guest OS, IOVA is GVA.
This statement is not accurate. For kernel DMA protection, it is
per-device standalone address space (definitely nothing to do
with GVA). For user DMA protection, user space driver decides
how it wants to construct IOVA address space. could be a
standalone one, or reuse GVA. In virtualization case it is either
GPA (w/o vIOMMU) or guest IOVA (w/ IOMMU and guest creates
IOVA space).

anyway IOVA concept is clear. possibly just removing the example
is still clear. :-)
Note: kvmtool is GPLv2. Linux patches are GPLv2, except for UAPI
virtio-iommu.h header, which is BSD 3-clause. For the time being, the
specification draft in RFC 2/3 is also BSD 3-clause.
This proposal may be involuntarily centered around ARM architectures at
times. Any feedback would be appreciated, especially regarding other IOMMU
architectures.
thanks for doing this. will definitely look them in detail and feedback.

Thanks
Kevin
Jean-Philippe Brucker
2017-04-13 13:12:59 UTC
Permalink
Post by Tian, Kevin
From: Jean-Philippe Brucker
Sent: Saturday, April 8, 2017 3:18 AM
This is the initial proposal for a paravirtualized IOMMU device using
virtio transport. It contains a description of the device, a Linux driver,
and a toy implementation in kvmtool. With this prototype, you can
translate DMA to guest memory from emulated (virtio), or passed-through
(VFIO) devices.
In its simplest form, implemented here, the device handles map/unmap
requests from the guest. Future extensions proposed in "RFC 3/3" should
allow to bind page tables to devices.
There are a number of advantages in a paravirtualized IOMMU over a full
emulation. It is portable and could be reused on different architectures.
It is easier to implement than a full emulation, with less state tracking.
It might be more efficient in some cases, with less context switches to
the host and the possibility of in-kernel emulation.
When designing it and writing the kvmtool device, I considered two main
scenarios, illustrated below.
Scenario 1: a hardware device passed through twice via VFIO
MEM____pIOMMU________PCI device________________________
HARDWARE
| (2b) \
----------|-------------+-------------+------------------\-------------
| : KVM : \
| : : \
pIOMMU drv : _______virtio-iommu drv \ KERNEL
| : | : | \
VFIO : | : VFIO \
| : | : | \
| : | : | /
----------|-------------+--------|----+----------|------------/--------
| | : | /
| (1c) (1b) | : (1a) | / (2a)
| | : | /
| | : | / USERSPACE
|___virtio-iommu dev___| : net drv___/
--------------------------------------+--------------------------------
HOST : GUEST
Usually people draw such layers in reverse order, e.g. hw in the
bottom then kernel in the middle then user in the top. :-)
Alright, I'll keep that in mind.
Post by Tian, Kevin
(1) a. Guest userspace is running a net driver (e.g. DPDK). It allocates a
buffer with mmap, obtaining virtual address VA. It then send a
VFIO_IOMMU_MAP_DMA request to map VA to an IOVA (possibly VA=IOVA).
b. The maping request is relayed to the host through virtio
(VIRTIO_IOMMU_T_MAP).
c. The mapping request is relayed to the physical IOMMU through VFIO.
(2) a. The guest userspace driver can now instruct the device to directly
access the buffer at IOVA
b. IOVA accesses from the device are translated into physical
addresses by the IOMMU.
Scenario 2: a virtual net device behind a virtual IOMMU.
MEM__pIOMMU___PCI device HARDWARE
| |
-------|---------|------+-------------+-------------------------------
\ | : _____________virtio-net drv KERNEL
\_net drv : | : / (1a)
| : | : /
tap : | ________virtio-iommu drv
| : | | : (1b)
-----------------|------+-----|---|---+-------------------------------
/ | : USERSPACE
--------------------------------------+-------------------------------
HOST : GUEST
(1) a. Guest virtio-net driver maps the virtio ring and a buffer
b. The mapping requests are relayed to the host through virtio.
(2) The virtio-net device now needs to access any guest memory via the
IOMMU.
Physical and virtual IOMMUs are completely dissociated. The net driver is
mapping its own buffers via DMA/IOMMU API, and buffers are copied between
virtio-net and tap.
The description itself seemed too long for a single email, so I split it
into three documents, and will attach Linux and kvmtool patches to this
email.
1. Firmware note,
2. device operations (draft for the virtio specification),
3. future work/possible improvements.
pIOMMU physical IOMMU, controlling DMA accesses from physical
devices
vIOMMU virtual IOMMU (virtio-iommu), controlling DMA accesses
from
physical and virtual devices to guest memory.
maybe clearer to call controlling 'virtual' DMA access since we're
essentially doing DMA virtualization here. Otherwise I read it
a bit confusing since DMA accesses from physical device should
be controlled by pIOMMU.
GVA, GPA, HVA, HPA
Guest/Host Virtual/Physical Address
IOVA I/O Virtual Address, the address accessed by a device doing DMA
through an IOMMU. In the context of a guest OS, IOVA is GVA.
This statement is not accurate. For kernel DMA protection, it is
per-device standalone address space (definitely nothing to do
with GVA). For user DMA protection, user space driver decides
how it wants to construct IOVA address space. could be a
standalone one, or reuse GVA. In virtualization case it is either
GPA (w/o vIOMMU) or guest IOVA (w/ IOMMU and guest creates
IOVA space).
anyway IOVA concept is clear. possibly just removing the example
is still clear. :-)
Ok, I dropped most IOVA references from the RFC to avoid ambiguity anyway.
I'll tidy up my so-called clarifications next time :)

Thanks,
Jean-Philippe
Post by Tian, Kevin
Note: kvmtool is GPLv2. Linux patches are GPLv2, except for UAPI
virtio-iommu.h header, which is BSD 3-clause. For the time being, the
specification draft in RFC 2/3 is also BSD 3-clause.
This proposal may be involuntarily centered around ARM architectures at
times. Any feedback would be appreciated, especially regarding other IOMMU
architectures.
thanks for doing this. will definitely look them in detail and feedback.
Thanks
Kevin
Loading...