rfc: vhost user enhancements for vm2vm communication

Discussion:

Michael S. Tsirkin

2015-08-31 14:11:02 UTC

Hello!
During the KVM forum, we discussed supporting virtio on top
of ivshmem. I have considered it, and came up with an alternative
that has several advantages over that - please see below.
Comments welcome.

-----

Existing solutions to userspace switching between VMs on the
same host are vhost-user and ivshmem.

vhost-user works by mapping memory of all VMs being bridged into the
switch memory space.

By comparison, ivshmem works by exposing a shared region of memory to all VMs.
VMs are required to use this region to store packets. The switch only
needs access to this region.

Another difference between vhost-user and ivshmem surfaces when polling
is used. With vhost-user, the switch is required to handle
data movement between VMs, if using polling, this means that 1 host CPU
needs to be sacrificed for this task.

This is easiest to understand when one of the VMs is
used with VF pass-through. This can be schematically shown below:

+-- VM1 --------------+ +---VM2-----------+
| virtio-pci +-vhost-user-+ virtio-pci -- VF | -- VFIO -- IOMMU -- NIC
+---------------------+ +-----------------+

With ivshmem in theory communication can happen directly, with two VMs
polling the shared memory region.

I won't spend time listing advantages of vhost-user over ivshmem.
Instead, having identified two advantages of ivshmem over vhost-user,
below is a proposal to extend vhost-user to gain the advantages
of ivshmem.

1: virtio in guest can be extended to allow support
for IOMMUs. This provides guest with full flexibility
about memory which is readable or write able by each device.
By setting up a virtio device for each other VM we need to
communicate to, guest gets full control of its security, from
mapping all memory (like with current vhost-user) to only
mapping buffers used for networking (like ivshmem) to
transient mappings for the duration of data transfer only.
This also allows use of VFIO within guests, for improved
security.

vhost user would need to be extended to send the
mappings programmed by guest IOMMU.

2. qemu can be extended to serve as a vhost-user client:
remote VM mappings over the vhost-user protocol, and
map them into another VM's memory.
This mapping can take, for example, the form of
a BAR of a pci device, which I'll call here vhost-pci -
with bus address allowed
by VM1's IOMMU mappings being translated into
offsets within this BAR within VM2's physical
memory space.

Since the translation can be a simple one, VM2
can perform it within its vhost-pci device driver.

While this setup would be the most useful with polling,
VM1's ioeventfd can also be mapped to
another VM2's irqfd, and vice versa, such that VMs
can trigger interrupts to each other without need
for a helper thread on the host.

The resulting channel might look something like the following:

+-- VM1 --------------+ +---VM2-----------+
| virtio-pci -- iommu +--+ vhost-pci -- VF | -- VFIO -- IOMMU -- NIC
+---------------------+ +-----------------+

comparing the two diagrams, a vhost-user thread on the host is
no longer required, reducing the host CPU utilization when
polling is active. At the same time, VM2 can not access all of VM1's
memory - it is limited by the iommu configuration setup by VM1.

Advantages over ivshmem:

- more flexibility, endpoint VMs do not have to place data at any
specific locations to use the device, in practice this likely
means less data copies.
- better standardization/code reuse
virtio changes within guests would be fairly easy to implement
and would also benefit other backends, besides vhost-user
standard hotplug interfaces can be used to add and remove these
channels as VMs are added or removed.
- migration support
It's easy to implement since ownership of memory is well defined.
For example, during migration VM2 can notify hypervisor of VM1
by updating dirty bitmap each time is writes into VM1 memory.

Thanks,

--
MST

Nakajima, Jun

2015-08-31 18:35:55 UTC

Permalink

Post by Michael S. Tsirkin
Hello!
During the KVM forum, we discussed supporting virtio on top
of ivshmem. I have considered it, and came up with an alternative
that has several advantages over that - please see below.
Comments welcome.

Hi Michael,

I like this, and it should be able to achieve what I presented at KVM
Forum (vhost-user-shmem).
Comments below.

I assume that you meant VFIO only for virtio by "use of VFIO". To get
VFIO working for general direct-I/O (including VFs) in guests, as you
know, we need to virtualize IOMMU (e.g. VT-d) and the interrupt
remapping table on x86 (i.e. nested VT-d).

Post by Michael S. Tsirkin
By setting up a virtio device for each other VM we need to
communicate to, guest gets full control of its security, from
mapping all memory (like with current vhost-user) to only
mapping buffers used for networking (like ivshmem) to
transient mappings for the duration of data transfer only.

And I think that we can use VMFUNC to have such transient mappings.

Post by Michael S. Tsirkin
This also allows use of VFIO within guests, for improved
security.
vhost user would need to be extended to send the
mappings programmed by guest IOMMU.

Right. We need to think about cases where other VMs (VM3, etc.) join
the group or some existing VM leaves.
PCI hot-plug should work there (as you point out at "Advantages over
ivshmem" below).

Post by Michael S. Tsirkin
remote VM mappings over the vhost-user protocol, and
map them into another VM's memory.
This mapping can take, for example, the form of
a BAR of a pci device, which I'll call here vhost-pci -
with bus address allowed
by VM1's IOMMU mappings being translated into
offsets within this BAR within VM2's physical
memory space.

I think it's sensible.

Post by Michael S. Tsirkin
Since the translation can be a simple one, VM2
can perform it within its vhost-pci device driver.
While this setup would be the most useful with polling,
VM1's ioeventfd can also be mapped to
another VM2's irqfd, and vice versa, such that VMs
can trigger interrupts to each other without need
for a helper thread on the host.
+-- VM1 --------------+ +---VM2-----------+
| virtio-pci -- iommu +--+ vhost-pci -- VF | -- VFIO -- IOMMU -- NIC
+---------------------+ +-----------------+
comparing the two diagrams, a vhost-user thread on the host is
no longer required, reducing the host CPU utilization when
polling is active. At the same time, VM2 can not access all of VM1's
memory - it is limited by the iommu configuration setup by VM1.
- more flexibility, endpoint VMs do not have to place data at any
specific locations to use the device, in practice this likely
means less data copies.
- better standardization/code reuse
virtio changes within guests would be fairly easy to implement
and would also benefit other backends, besides vhost-user
standard hotplug interfaces can be used to add and remove these
channels as VMs are added or removed.
- migration support
It's easy to implement since ownership of memory is well defined.
For example, during migration VM2 can notify hypervisor of VM1
by updating dirty bitmap each time is writes into VM1 memory.

Also, the ivshmem functionality could be implemented by this proposal:
- vswitch (or some VM) allocates memory regions in its address space, and
- it sets up that IOMMU mappings on the VMs be translated into the regions

Post by Michael S. Tsirkin
Thanks,
--
MST
_______________________________________________
Virtualization mailing list
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

--
Jun
Intel Open Source Technology Center

Varun Sethi

2015-09-01 03:03:12 UTC

Permalink

Hi Michael,
When you talk about VFIO in guest, is it with a purely emulated IOMMU in Qemu?
Also, I am not clear on the following points:
1. How transient memory would be mapped using BAR in the backend VM
2. How would the backend VM update the dirty page bitmap for the frontend VM

Regards
Varun

-----Original Message-----
Behalf Of Nakajima, Jun
Sent: Monday, August 31, 2015 1:36 PM
To: Michael S. Tsirkin
Subject: Re: [Qemu-devel] rfc: vhost user enhancements for vm2vm
communication

Post by Michael S. Tsirkin
Hello!
During the KVM forum, we discussed supporting virtio on top of
ivshmem. I have considered it, and came up with an alternative that
has several advantages over that - please see below.
Comments welcome.

Hi Michael,
I like this, and it should be able to achieve what I presented at KVM Forum
(vhost-user-shmem).
Comments below.

all VMs.

Post by Michael S. Tsirkin
VMs are required to use this region to store packets. The switch only
needs access to this region.
Another difference between vhost-user and ivshmem surfaces when
polling is used. With vhost-user, the switch is required to handle
data movement between VMs, if using polling, this means that 1 host
CPU needs to be sacrificed for this task.
This is easiest to understand when one of the VMs is used with VF
+-- VM1 --------------+ +---VM2-----------+
| virtio-pci +-vhost-user-+ virtio-pci -- VF | -- VFIO -- IOMMU -- NIC
+---------------------+ +-----------------+
With ivshmem in theory communication can happen directly, with two VMs
polling the shared memory region.
I won't spend time listing advantages of vhost-user over ivshmem.
Instead, having identified two advantages of ivshmem over vhost-user,
below is a proposal to extend vhost-user to gain the advantages of
ivshmem.
1: virtio in guest can be extended to allow support for IOMMUs. This
provides guest with full flexibility about memory which is readable or
write able by each device.

I assume that you meant VFIO only for virtio by "use of VFIO". To get VFIO
working for general direct-I/O (including VFs) in guests, as you know, we
need to virtualize IOMMU (e.g. VT-d) and the interrupt remapping table on
x86 (i.e. nested VT-d).

Post by Michael S. Tsirkin
By setting up a virtio device for each other VM we need to communicate
to, guest gets full control of its security, from mapping all memory
(like with current vhost-user) to only mapping buffers used for
networking (like ivshmem) to transient mappings for the duration of
data transfer only.

And I think that we can use VMFUNC to have such transient mappings.

Post by Michael S. Tsirkin
This also allows use of VFIO within guests, for improved security.
vhost user would need to be extended to send the mappings programmed
by guest IOMMU.

Right. We need to think about cases where other VMs (VM3, etc.) join the
group or some existing VM leaves.
PCI hot-plug should work there (as you point out at "Advantages over
ivshmem" below).

Post by Michael S. Tsirkin
remote VM mappings over the vhost-user protocol, and map them into
another VM's memory.
This mapping can take, for example, the form of a BAR of a pci device,
which I'll call here vhost-pci - with bus address allowed by VM1's
IOMMU mappings being translated into offsets within this BAR within
VM2's physical memory space.

I think it's sensible.

Post by Michael S. Tsirkin
Since the translation can be a simple one, VM2 can perform it within
its vhost-pci device driver.
While this setup would be the most useful with polling, VM1's
ioeventfd can also be mapped to another VM2's irqfd, and vice versa,
such that VMs can trigger interrupts to each other without need for a
helper thread on the host.
+-- VM1 --------------+ +---VM2-----------+
| virtio-pci -- iommu +--+ vhost-pci -- VF | -- VFIO -- IOMMU -- NIC
+---------------------+ +-----------------+
comparing the two diagrams, a vhost-user thread on the host is no
longer required, reducing the host CPU utilization when polling is
active. At the same time, VM2 can not access all of VM1's memory - it
is limited by the iommu configuration setup by VM1.
- more flexibility, endpoint VMs do not have to place data at any
specific locations to use the device, in practice this likely
means less data copies.
- better standardization/code reuse
virtio changes within guests would be fairly easy to implement
and would also benefit other backends, besides vhost-user
standard hotplug interfaces can be used to add and remove these
channels as VMs are added or removed.
- migration support
It's easy to implement since ownership of memory is well defined.
For example, during migration VM2 can notify hypervisor of VM1
by updating dirty bitmap each time is writes into VM1 memory.

- vswitch (or some VM) allocates memory regions in its address space, and
- it sets up that IOMMU mappings on the VMs be translated into the regions

Post by Michael S. Tsirkin
Thanks,
--
MST
_______________________________________________
Virtualization mailing list
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

--
Jun
Intel Open Source Technology Center

Michael S. Tsirkin

2015-09-01 08:30:02 UTC

Permalink

Post by Nakajima, Jun
Hi Michael,
When you talk about VFIO in guest, is it with a purely emulated IOMMU in Qemu?

This can use the emulated IOMMU in Qemu.
That's probably fast enough if mappings are mostly static.
We can also add a PV-IOMMU if necessary.

Post by Nakajima, Jun
1. How transient memory would be mapped using BAR in the backend VM

The simplest way is that
each update sends a vhost-user message. backend gets it and
mmaps it into backend QEMU and make it part of RAM memory slot.

Or - backend QEMU could detect a pagefault on access and get the
IOMMU from frontend QEMU - using vhost-user messages or
from shared memory.

Post by Nakajima, Jun
2. How would the backend VM update the dirty page bitmap for the frontend VM
Regards
Varun

The easiest to implement way is probably for backend QEMU to setup dirty tracking
for the relevant slot (upon getting vhost user message
from the frontend) then retrieve the dirty map
from kvm and record it in a shared memory region
(when do it? We could have an eventfd and/or vhost-user message to
trigger this from the frontend QEMU, or just use a timer).

An alternative is for backend VM to get access to dirty log
(e.g. map it within BAR) and update it directly in shared memory.
Seems like more work.

Marc-André Lureau recently sent patches to support passing
dirty log around, these would be useful.

Post by Nakajima, Jun

-----Original Message-----
Behalf Of Nakajima, Jun
Sent: Monday, August 31, 2015 1:36 PM
To: Michael S. Tsirkin
Subject: Re: [Qemu-devel] rfc: vhost user enhancements for vm2vm
communication

Post by Michael S. Tsirkin
Hello!
During the KVM forum, we discussed supporting virtio on top of
ivshmem. I have considered it, and came up with an alternative that
has several advantages over that - please see below.
Comments welcome.

Hi Michael,
I like this, and it should be able to achieve what I presented at KVM Forum
(vhost-user-shmem).
Comments below.

all VMs.

I assume that you meant VFIO only for virtio by "use of VFIO". To get VFIO
working for general direct-I/O (including VFs) in guests, as you know, we
need to virtualize IOMMU (e.g. VT-d) and the interrupt remapping table on
x86 (i.e. nested VT-d).

Post by Michael S. Tsirkin
By setting up a virtio device for each other VM we need to communicate
to, guest gets full control of its security, from mapping all memory
(like with current vhost-user) to only mapping buffers used for
networking (like ivshmem) to transient mappings for the duration of
data transfer only.

And I think that we can use VMFUNC to have such transient mappings.

Post by Michael S. Tsirkin
This also allows use of VFIO within guests, for improved security.
vhost user would need to be extended to send the mappings programmed
by guest IOMMU.

Right. We need to think about cases where other VMs (VM3, etc.) join the
group or some existing VM leaves.
PCI hot-plug should work there (as you point out at "Advantages over
ivshmem" below).

Post by Michael S. Tsirkin
remote VM mappings over the vhost-user protocol, and map them into
another VM's memory.
This mapping can take, for example, the form of a BAR of a pci device,
which I'll call here vhost-pci - with bus address allowed by VM1's
IOMMU mappings being translated into offsets within this BAR within
VM2's physical memory space.

I think it's sensible.

Post by Michael S. Tsirkin
Since the translation can be a simple one, VM2 can perform it within
its vhost-pci device driver.
While this setup would be the most useful with polling, VM1's
ioeventfd can also be mapped to another VM2's irqfd, and vice versa,
such that VMs can trigger interrupts to each other without need for a
helper thread on the host.
+-- VM1 --------------+ +---VM2-----------+
| virtio-pci -- iommu +--+ vhost-pci -- VF | -- VFIO -- IOMMU -- NIC
+---------------------+ +-----------------+
comparing the two diagrams, a vhost-user thread on the host is no
longer required, reducing the host CPU utilization when polling is
active. At the same time, VM2 can not access all of VM1's memory - it
is limited by the iommu configuration setup by VM1.
- more flexibility, endpoint VMs do not have to place data at any
specific locations to use the device, in practice this likely
means less data copies.
- better standardization/code reuse
virtio changes within guests would be fairly easy to implement
and would also benefit other backends, besides vhost-user
standard hotplug interfaces can be used to add and remove these
channels as VMs are added or removed.
- migration support
It's easy to implement since ownership of memory is well defined.
For example, during migration VM2 can notify hypervisor of VM1
by updating dirty bitmap each time is writes into VM1 memory.

- vswitch (or some VM) allocates memory regions in its address space, and
- it sets up that IOMMU mappings on the VMs be translated into the regions

Post by Michael S. Tsirkin
Thanks,
--
MST
_______________________________________________
Virtualization mailing list
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

--
Jun
Intel Open Source Technology Center

Michael S. Tsirkin

2015-09-01 08:17:20 UTC

Permalink

Post by Nakajima, Jun

Hi Michael,
I like this, and it should be able to achieve what I presented at KVM
Forum (vhost-user-shmem).
Comments below.

Not necessarily: if pmd is used, mappings stay mostly static,
and there are no interrupts, so existing IOMMU emulation in qemu
will do the job.

Post by Nakajima, Jun

And I think that we can use VMFUNC to have such transient mappings.

Interesting. There are two points to make here:

1. To create transient mappings, VMFUNC isn't strictly required.
Instead, mappings can be created when first access by VM2
within BAR triggers a page fault.
I guess VMFUNC could remove this first pagefault by hypervisor mapping
host PTE into the alternative view, then VMFUNC making
VM2 PTE valid - might be important if mappings are very dynamic
so there are many pagefaults.

2. To invalidate mappings, VMFUNC isn't sufficient since
translation cache of other CPUs needs to be invalidated.
I don't think VMFUNC can do this.

Post by Nakajima, Jun

Post by Michael S. Tsirkin
This also allows use of VFIO within guests, for improved
security.
vhost user would need to be extended to send the
mappings programmed by guest IOMMU.

Right. We need to think about cases where other VMs (VM3, etc.) join
the group or some existing VM leaves.
PCI hot-plug should work there (as you point out at "Advantages over
ivshmem" below).

I think it's sensible.

- vswitch (or some VM) allocates memory regions in its address space, and
- it sets up that IOMMU mappings on the VMs be translated into the regions

I agree it's possible, but that's not something that exists on real
hardware. It's not clear to me what are the security implications
of having VM2 control IOMMU of VM1. Having each VM control its own IOMMU
seems more straight-forward.

Post by Nakajima, Jun

Post by Michael S. Tsirkin
Thanks,
--
MST
_______________________________________________
Virtualization mailing list
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

--
Jun
Intel Open Source Technology Center

Nakajima, Jun

2015-09-01 22:56:32 UTC

Permalink

My previous email has been bounced by virtio-***@lists.oasis-open.org.
I tried to subscribed it, but to no avail...