Discussion:
packed ring layout proposal v2
(too old to reply)
Michael S. Tsirkin
2017-02-08 03:20:14 UTC
Permalink
This is an update from v1 version.
Changes:
- update event suppression mechanism
- separate options for indirect and direct s/g
- lots of new features

---

Performance analysis of this is in my kvm forum 2016 presentation.
The idea is to have a r/w descriptor in a ring structure,
replacing the used and available ring, index and descriptor
buffer.

* Descriptor ring:

Guest adds descriptors with unique index values and DESC_HW set in flags.
Host overwrites used descriptors with correct len, index, and DESC_HW
clear. Flags are always set/cleared last.

#define DESC_HW 0x0080

struct desc {
__le64 addr;
__le32 len;
__le16 index;
__le16 flags;
};

When DESC_HW is set, descriptor belongs to device. When it is clear,
it belongs to the driver.

We can use 1 bit to set direction
/* This marks a buffer as write-only (otherwise read-only). */
#define VRING_DESC_F_WRITE 2

* Scatter/gather support

We can use 1 bit to chain s/g entries in a request, same as virtio 1.0:

/* This marks a buffer as continuing via the next field. */
#define VRING_DESC_F_NEXT 1

Unlike virtio 1.0, all descriptors must have distinct ID values.

Also unlike virtio 1.0, use of this flag will be an optional feature
(e.g. VIRTIO_F_DESC_NEXT) so both devices and drivers can opt out of it.

* Indirect buffers

Can be marked like in virtio 1.0:

/* This means the buffer contains a table of buffer descriptors. */
#define VRING_DESC_F_INDIRECT 4

Unlike virtio 1.0, this is a table, not a list:
struct indirect_descriptor_table {
/* The actual descriptors (16 bytes each) */
struct virtq_desc desc[len / 16];
};

The first descriptor is located at start of the indirect descriptor
table, additional indirect descriptors come immediately afterwards.
DESC_F_WRITE is the only valid flag for descriptors in the indirect
table. Others should be set to 0 and are ignored. id is also set to 0
and should be ignored.

virtio 1.0 seems to allow a s/g entry followed by
an indirect descriptor. This does not seem useful,
so we do not allow that anymore.

This support would be an optional feature, same as in virtio 1.0

* Batching descriptors:

virtio 1.0 allows passing a batch of descriptors in both directions, by
incrementing the used/avail index by values > 1. We can support this by
chaining a list of descriptors through a bit the flags field.
To allow use together with s/g, a different bit will be used.

#define VRING_DESC_F_BATCH_NEXT 0x0010

Batching works for both driver and device descriptors.



* Processing descriptors in and out of order

Device processing all descriptors in order can simply flip
the DESC_HW bit as it is done with descriptors.

Device can write descriptors out in order as they are used, overwriting
descriptors that are there.

Device must not use a descriptor until DESC_HW is set.
It is only required to look at the first descriptor
submitted.

Driver must not overwrite a descriptor until DESC_HW is clear.
It is only required to look at the first descriptor
submitted.

* Device specific descriptor flags
We have a lot of unused space in the descriptor. This can be put to
good use by reserving some flag bits for device use.
For example, network device can set a bit to request
that header in the descriptor is suppressed
(in case it's all 0s anyway). This reduces cache utilization.

Note: this feature can be supported in virtio 1.0 as well,
as we have unused bits in both descriptor and used ring there.

* Descriptor length in device descriptors

virtio 1.0 places strict requirements on descriptor length. For example
it must be 0 in used ring of TX VQ of a network device since nothing is
written. In practice guests do not seem to use this, so we can simplify
devices a bit by removing this requirement - if length is unused it
should be ignored by driver.

Some devices use identically-sized buffers in all descriptors.
Ignoring length for driver descriptors there could be an option too.

* Writing at an offset

Some devices might want to write into some descriptors
at an offset, the offset would be in config space,
and a descriptor flag could indicate this:

#define VRING_DESC_F_OFFSET 0x0020

How exactly to use the offset would be device specific,
for example it can be used to align beginning of packet
in the 1st buffer for mergeable buffers to cache line boundary
while also aligning rest of buffers.

* Non power-of-2 ring sizes

As the ring simply wraps around, there's no reason to
require ring size to be power of two.
It can be made a separate feature though.


* Interrupt/event suppression

virtio 1.0 has two mechanisms for suppression but only
one can be used at a time. we pack them together
in a structure - one for interrupts, one for notifications:

struct event {
__le16 idx;
__le16 flags;
}

Both fields would be optional, with a feature bit:
VIRTIO_F_EVENT_IDX
VIRTIO_F_EVENT_FLAGS

* Flags can be used like in virtio 1.0, by storing a special
value there:

#define VRING_F_EVENT_ENABLE 0x0

#define VRING_F_EVENT_DISABLE 0x1

* Event index would be in the range 0 to 2 * Queue Size
(to detect wrap arounds) and wrap to 0 after that.

The assumption is that each side maintains an internal
descriptor counter 0 to 2 * Queue Size that wraps to 0.
In that case, interrupt triggers when counter reaches
the given value.

* If both features are negotiated, a special flags value
can be used to switch to event idx:

#define VRING_F_EVENT_IDX 0x2


* Prototype

A partial prototype can be found under
tools/virtio/ringtest/ring.c

Test run:
[***@tuck ringtest]$ time ./ring
real 0m0.399s
user 0m0.791s
sys 0m0.000s
[***@tuck ringtest]$ time ./virtio_ring_0_9
real 0m0.503s
user 0m0.999s
sys 0m0.000s

It is planned to update it to this interface. Future changes and
enhancements can (and should) be tested against this prototype.

* Feature sets
In particular are we going overboard with feature bits? It becomes hard
to support all combinations in drivers and devices. Maybe we should
document reasonable feature sets to be supported for each device.

* Known issues/ideas

This layout is optimized for host/guest communication,
in a sense even more aggressively than virtio 1.0.
It might be suboptimal for PCI hardware implementations.
However, one notes that current virtio pci drivers are
unlikely to work with PCI hardware implementations anyway
(e.g. due to use of SMP barriers for ordering).

Suggestions for improving this are welcome but need to be tested to make
sure our main use case does not regress. Of course some improvements
might be made optional, but if we add too many of these driver becomes
unmanageable.

---

Note: should this proposal be accepted and approved, one or more
claims disclosed to the TC admin and listed on the Virtio TC
IPR page https://www.oasis-open.org/committees/virtio/ipr.php
might become Essential Claims.
--
MST
Christian Borntraeger
2017-02-08 13:37:57 UTC
Permalink
On 02/08/2017 04:20 AM, Michael S. Tsirkin wrote:
[...]
Post by Michael S. Tsirkin
* Prototype
A partial prototype can be found under
tools/virtio/ringtest/ring.c
real 0m0.399s
user 0m0.791s
sys 0m0.000s
real 0m0.503s
user 0m0.999s
sys 0m0.000s
I see similar improvements on s390, so I think this would be a very nice
improvement.

[...]
Post by Michael S. Tsirkin
Note: should this proposal be accepted and approved, one or more
claims disclosed to the TC admin and listed on the Virtio TC
IPR page https://www.oasis-open.org/committees/virtio/ipr.php
might become Essential Claims.
I have trouble parsing that legal stuff. Do I read that right, that
these claims can be implemented as part of a virtio implementation without
any worries (e.g. non open source HW implementation or non open source
hypervisor)?
Michael S. Tsirkin
2017-02-09 17:43:09 UTC
Permalink
Post by Christian Borntraeger
[...]
Post by Michael S. Tsirkin
* Prototype
A partial prototype can be found under
tools/virtio/ringtest/ring.c
real 0m0.399s
user 0m0.791s
sys 0m0.000s
real 0m0.503s
user 0m0.999s
sys 0m0.000s
I see similar improvements on s390, so I think this would be a very nice
improvement.
[...]
Post by Michael S. Tsirkin
Note: should this proposal be accepted and approved, one or more
claims disclosed to the TC admin and listed on the Virtio TC
IPR page https://www.oasis-open.org/committees/virtio/ipr.php
might become Essential Claims.
I have trouble parsing that legal stuff. Do I read that right, that
these claims can be implemented as part of a virtio implementation without
any worries (e.g. non open source HW implementation or non open source
hypervisor)?
By that legal stuff do you mean the IPR or the Note?

Not representing Red Hat here, and definitely not legal advice, below is
just my personal understanding of the IPR requirements.

Virtio TC operates under the Non-Assertion Mode of the OASIS IPR Policy:
https://www.oasis-open.org/policies-guidelines/ipr#Non-Assertion-Mode

As far as I can tell, the hardware and software question is covered
by that policy, since it states:

Covered Product - includes only those specific portions of a
product (hardware, software or combinations thereof)

Also as far as I can tell IPR does not mention source at all, so I think
virtio IPR would apply to open and closed source software equally.

The Note is included to satisfy the disclosure requirements.

Does this answer the question?
--
MST
Christian Borntraeger
2017-02-09 18:27:39 UTC
Permalink
On 02/09/2017 06:43 PM, Michael S. Tsirkin wrote:
[...]
Post by Michael S. Tsirkin
Post by Christian Borntraeger
[...]
Post by Michael S. Tsirkin
Note: should this proposal be accepted and approved, one or more
claims disclosed to the TC admin and listed on the Virtio TC
IPR page https://www.oasis-open.org/committees/virtio/ipr.php
might become Essential Claims.
I have trouble parsing that legal stuff. Do I read that right, that
these claims can be implemented as part of a virtio implementation without
any worries (e.g. non open source HW implementation or non open source
hypervisor)?
By that legal stuff do you mean the IPR or the Note?
Both in combination.
Post by Michael S. Tsirkin
Not representing Red Hat here, and definitely not legal advice, below is
just my personal understanding of the IPR requirements.
https://www.oasis-open.org/policies-guidelines/ipr#Non-Assertion-Mode
As far as I can tell, the hardware and software question is covered
Covered Product - includes only those specific portions of a
product (hardware, software or combinations thereof)
Also as far as I can tell IPR does not mention source at all, so I think
virtio IPR would apply to open and closed source software equally.
The Note is included to satisfy the disclosure requirements.
Does this answer the question?
Yes, thanks.
Paolo Bonzini
2017-02-08 17:41:40 UTC
Permalink
Post by Michael S. Tsirkin
* Scatter/gather support
/* This marks a buffer as continuing via the next field. */
#define VRING_DESC_F_NEXT 1
Unlike virtio 1.0, all descriptors must have distinct ID values.
Also unlike virtio 1.0, use of this flag will be an optional feature
(e.g. VIRTIO_F_DESC_NEXT) so both devices and drivers can opt out of it.
I would still prefer that we had _either_ single-direct or
multiple-indirect descriptors, i.e. no VRING_DESC_F_NEXT. I can propose
my idea for this in a separate message.
Post by Michael S. Tsirkin
virtio 1.0 allows passing a batch of descriptors in both directions, by
incrementing the used/avail index by values > 1. We can support this by
chaining a list of descriptors through a bit the flags field.
To allow use together with s/g, a different bit will be used.
#define VRING_DESC_F_BATCH_NEXT 0x0010
Batching works for both driver and device descriptors.
I'm still not sure how this would be useful. It cannot be mandatory to
set the bit, I think, because you don't know when the host/guest is
going to read descriptors. So both host and guest always have to look
ahead one element in any case.
Post by Michael S. Tsirkin
* Non power-of-2 ring sizes
As the ring simply wraps around, there's no reason to
require ring size to be power of two.
It can be made a separate feature though.
Power of 2 ring sizes are required in order to ignore the high bits of
the indices. With non-power-of-2 sizes you are forced to keep the
Post by Michael S. Tsirkin
* Event index would be in the range 0 to 2 * Queue Size
(to detect wrap arounds) and wrap to 0 after that.
The assumption is that each side maintains an internal
descriptor counter 0 to 2 * Queue Size that wraps to 0.
In that case, interrupt triggers when counter reaches
the given value.
but it seems more complicated than just forcing power-of-2 and ignoring
the high bits.

Thanks,

Paolo
Michael S. Tsirkin
2017-02-08 19:59:09 UTC
Permalink
Post by Paolo Bonzini
Post by Michael S. Tsirkin
* Scatter/gather support
/* This marks a buffer as continuing via the next field. */
#define VRING_DESC_F_NEXT 1
Unlike virtio 1.0, all descriptors must have distinct ID values.
Also unlike virtio 1.0, use of this flag will be an optional feature
(e.g. VIRTIO_F_DESC_NEXT) so both devices and drivers can opt out of it.
I would still prefer that we had _either_ single-direct or
multiple-indirect descriptors, i.e. no VRING_DESC_F_NEXT. I can propose
my idea for this in a separate message.
All it costs us spec-wise is a single bit :)

The cost of indirect is an extra cache miss.

We couldn't decide what's better for everyone in 1.0 days and I doubt
we'll be able to now, but yes, benchmarking is needed to make
sire it's required. Very easy to remove or not to use/support in
drivers/devices though.
Post by Paolo Bonzini
Post by Michael S. Tsirkin
virtio 1.0 allows passing a batch of descriptors in both directions, by
incrementing the used/avail index by values > 1. We can support this by
chaining a list of descriptors through a bit the flags field.
To allow use together with s/g, a different bit will be used.
#define VRING_DESC_F_BATCH_NEXT 0x0010
Batching works for both driver and device descriptors.
I'm still not sure how this would be useful.
So this is used at least by virtio-net mergeable buffers to combine
many buffers into a single packet.

Similarly, on transmit linux sometimes supplies packets in batches
(XMIT_MORE flag) if the other side processes them it seems nice to tell
it: there's more to come soon, if you see this it is wise to poll now.

That's why I kind of felt it's better as a standard bit.
Post by Paolo Bonzini
It cannot be mandatory to
set the bit, I think, because you don't know when the host/guest is
going to read descriptors. So both host and guest always have to look
ahead one element in any case.
Right but the point is what to do if you find nothing there?
If you saw VRING_DESC_F_BATCH_NEXT it's a hint that
you should poll, there's more to come soon.
Post by Paolo Bonzini
Post by Michael S. Tsirkin
* Non power-of-2 ring sizes
As the ring simply wraps around, there's no reason to
require ring size to be power of two.
It can be made a separate feature though.
Power of 2 ring sizes are required in order to ignore the high bits of
the indices. With non-power-of-2 sizes you are forced to keep the
indices less than the ring size.
Right. So

if (unlikely(idx++ > size))
idx = 0;

OTOH ring size that's twice larger than necessary
because of power of two requirements wastes cache.
Post by Paolo Bonzini
Post by Michael S. Tsirkin
* Event index would be in the range 0 to 2 * Queue Size
(to detect wrap arounds) and wrap to 0 after that.
The assumption is that each side maintains an internal
descriptor counter 0 to 2 * Queue Size that wraps to 0.
In that case, interrupt triggers when counter reaches
the given value.
but it seems more complicated than just forcing power-of-2 and ignoring
the high bits.
Thanks,
Paolo
Absolutely power of 2 lets you save a branch.
At this stage I'm just recording all the ideas
and then as a next step we can micro-benchmark prototypes
and compare.
--
MST
Paolo Bonzini
2017-02-09 15:48:53 UTC
Permalink
Post by Michael S. Tsirkin
We couldn't decide what's better for everyone in 1.0 days and I doubt
we'll be able to now, but yes, benchmarking is needed to make
sire it's required. Very easy to remove or not to use/support in
drivers/devices though.
Fair enough, but of course then we must specify that devices MUST
support either VRING_DESC_F_NEXT or VRING_DESC_F_INDIRECT, and drivers
SHOULD support both (or use neither).

The drivers' part adds to the implementation burden, which is why I
wanted to remove it. Alternatively we can say that indirect is
mandatory for both devices and drivers (and save a feature bit), while
VRING_DESC_F_NEXT is optional.
Post by Michael S. Tsirkin
Post by Paolo Bonzini
Post by Michael S. Tsirkin
* Non power-of-2 ring sizes
As the ring simply wraps around, there's no reason to
require ring size to be power of two.
It can be made a separate feature though.
Power of 2 ring sizes are required in order to ignore the high bits of
the indices. With non-power-of-2 sizes you are forced to keep the
indices less than the ring size.
Right. So
if (unlikely(idx++ > size))
idx = 0;
OTOH ring size that's twice larger than necessary
because of power of two requirements wastes cache.
I don't know. Power of 2 ring size is pretty standard, I'd rather avoid
the complication and the gratuitous difference with 1.0.

If batching is mostly advisory (with exceptions such as mrg-rxbuf) I
don't have any problem with it.

Paolo
Cornelia Huck
2017-02-09 16:11:05 UTC
Permalink
On Thu, 9 Feb 2017 16:48:53 +0100
Post by Paolo Bonzini
Post by Michael S. Tsirkin
We couldn't decide what's better for everyone in 1.0 days and I doubt
we'll be able to now, but yes, benchmarking is needed to make
sire it's required. Very easy to remove or not to use/support in
drivers/devices though.
Fair enough, but of course then we must specify that devices MUST
support either VRING_DESC_F_NEXT or VRING_DESC_F_INDIRECT, and drivers
SHOULD support both (or use neither).
The drivers' part adds to the implementation burden, which is why I
wanted to remove it. Alternatively we can say that indirect is
mandatory for both devices and drivers (and save a feature bit), while
VRING_DESC_F_NEXT is optional.
I think this (INDIRECT mandatory, NEXT optional) makes sense. But we
really need some benchmarking.
Post by Paolo Bonzini
Post by Michael S. Tsirkin
Post by Paolo Bonzini
Post by Michael S. Tsirkin
* Non power-of-2 ring sizes
As the ring simply wraps around, there's no reason to
require ring size to be power of two.
It can be made a separate feature though.
Power of 2 ring sizes are required in order to ignore the high bits of
the indices. With non-power-of-2 sizes you are forced to keep the
indices less than the ring size.
Right. So
if (unlikely(idx++ > size))
idx = 0;
OTOH ring size that's twice larger than necessary
because of power of two requirements wastes cache.
I don't know. Power of 2 ring size is pretty standard, I'd rather avoid
the complication and the gratuitous difference with 1.0.
I agree. I don't think dropping the power of 2 requirement buys us so
much that it makes up for the added complexity.
Michael S. Tsirkin
2017-02-22 16:43:05 UTC
Permalink
Post by Cornelia Huck
Post by Paolo Bonzini
Post by Michael S. Tsirkin
Post by Paolo Bonzini
Post by Michael S. Tsirkin
* Non power-of-2 ring sizes
As the ring simply wraps around, there's no reason to
require ring size to be power of two.
It can be made a separate feature though.
Power of 2 ring sizes are required in order to ignore the high bits of
the indices. With non-power-of-2 sizes you are forced to keep the
indices less than the ring size.
Right. So
if (unlikely(idx++ > size))
idx = 0;
OTOH ring size that's twice larger than necessary
because of power of two requirements wastes cache.
I don't know. Power of 2 ring size is pretty standard, I'd rather avoid
the complication and the gratuitous difference with 1.0.
I agree. I don't think dropping the power of 2 requirement buys us so
much that it makes up for the added complexity.
I recalled why I came up with this. The issue is cache associativity.
Recall that besides the ring we have event suppression
structures - if we are lucky and things run at the same speed
everything can work by polling keeping events disabled, then
event suppression structures are never written to, they are read-only.

However if ring and event suppression share a cache line ring accesses
have a chance to push the event suppression out of cache, causing
misses on read.

This can happen if they are at the same offset in the set.
E.g. with L1 cache 4Kbyte sets are common, so same offset
within a 4K page.

We can fix this by making event suppression adjacent in memory, e.g.:


[interrupt suppress]
[descriptor ring]
[kick suppress]

If this whole structure fits in a single set, ring accesses will
not push kick or interrupt suppress out of cache.
Specific layout can be left for drivers, but as set size is
a power of two this might require a non-power of two ring size.

I conclude that this is an optimization that needs to be
benchmarked.

I also note that the generic description does not have to force
powers of two *even if devices actually require it*.
I would be inclined to word the text in a way that makes
relaxing the restriction easier.

For example, we can say "free running 16 bit index" and this forces a
power of two, but we can also say "free running index wrapping to 0
after (N*queue-size - 1) with N chosen such that the value fits in 16
bit" and this is exactly the same if queue size is a power of 2.

So we can add text saying "ring size MUST be a power of two"
and later it will be easy to relax just by adding a feature bit.
--
MST
Cornelia Huck
2017-03-07 15:53:53 UTC
Permalink
On Wed, 22 Feb 2017 18:43:05 +0200
Post by Michael S. Tsirkin
Post by Cornelia Huck
Post by Paolo Bonzini
Post by Michael S. Tsirkin
Post by Paolo Bonzini
Post by Michael S. Tsirkin
* Non power-of-2 ring sizes
As the ring simply wraps around, there's no reason to
require ring size to be power of two.
It can be made a separate feature though.
Power of 2 ring sizes are required in order to ignore the high bits of
the indices. With non-power-of-2 sizes you are forced to keep the
indices less than the ring size.
Right. So
if (unlikely(idx++ > size))
idx = 0;
OTOH ring size that's twice larger than necessary
because of power of two requirements wastes cache.
I don't know. Power of 2 ring size is pretty standard, I'd rather avoid
the complication and the gratuitous difference with 1.0.
I agree. I don't think dropping the power of 2 requirement buys us so
much that it makes up for the added complexity.
I recalled why I came up with this. The issue is cache associativity.
Recall that besides the ring we have event suppression
structures - if we are lucky and things run at the same speed
everything can work by polling keeping events disabled, then
event suppression structures are never written to, they are read-only.
However if ring and event suppression share a cache line ring accesses
have a chance to push the event suppression out of cache, causing
misses on read.
This can happen if they are at the same offset in the set.
E.g. with L1 cache 4Kbyte sets are common, so same offset
within a 4K page.
[interrupt suppress]
[descriptor ring]
[kick suppress]
If this whole structure fits in a single set, ring accesses will
not push kick or interrupt suppress out of cache.
Specific layout can be left for drivers, but as set size is
a power of two this might require a non-power of two ring size.
I conclude that this is an optimization that needs to be
benchmarked.
This makes sense. But wouldn't the optimum layout not depend on the
platform?
Post by Michael S. Tsirkin
I also note that the generic description does not have to force
powers of two *even if devices actually require it*.
I would be inclined to word the text in a way that makes
relaxing the restriction easier.
For example, we can say "free running 16 bit index" and this forces a
power of two, but we can also say "free running index wrapping to 0
after (N*queue-size - 1) with N chosen such that the value fits in 16
bit" and this is exactly the same if queue size is a power of 2.
So we can add text saying "ring size MUST be a power of two"
and later it will be easy to relax just by adding a feature bit.
A later feature bit sounds good.
Michael S. Tsirkin
2017-03-07 20:33:57 UTC
Permalink
Post by Cornelia Huck
On Wed, 22 Feb 2017 18:43:05 +0200
Post by Michael S. Tsirkin
Post by Cornelia Huck
Post by Paolo Bonzini
Post by Michael S. Tsirkin
Post by Paolo Bonzini
Post by Michael S. Tsirkin
* Non power-of-2 ring sizes
As the ring simply wraps around, there's no reason to
require ring size to be power of two.
It can be made a separate feature though.
Power of 2 ring sizes are required in order to ignore the high bits of
the indices. With non-power-of-2 sizes you are forced to keep the
indices less than the ring size.
Right. So
if (unlikely(idx++ > size))
idx = 0;
OTOH ring size that's twice larger than necessary
because of power of two requirements wastes cache.
I don't know. Power of 2 ring size is pretty standard, I'd rather avoid
the complication and the gratuitous difference with 1.0.
I agree. I don't think dropping the power of 2 requirement buys us so
much that it makes up for the added complexity.
I recalled why I came up with this. The issue is cache associativity.
Recall that besides the ring we have event suppression
structures - if we are lucky and things run at the same speed
everything can work by polling keeping events disabled, then
event suppression structures are never written to, they are read-only.
However if ring and event suppression share a cache line ring accesses
have a chance to push the event suppression out of cache, causing
misses on read.
This can happen if they are at the same offset in the set.
E.g. with L1 cache 4Kbyte sets are common, so same offset
within a 4K page.
[interrupt suppress]
[descriptor ring]
[kick suppress]
If this whole structure fits in a single set, ring accesses will
not push kick or interrupt suppress out of cache.
Specific layout can be left for drivers, but as set size is
a power of two this might require a non-power of two ring size.
I conclude that this is an optimization that needs to be
benchmarked.
This makes sense. But wouldn't the optimum layout not depend on the
platform?
There's generally a tradeoff between performance and portability.
Whether it's worth it would need to be tested.
Further, it might be better to have platform-specific optimization
tied to a given platform rather than a feature bit.
Post by Cornelia Huck
Post by Michael S. Tsirkin
I also note that the generic description does not have to force
powers of two *even if devices actually require it*.
I would be inclined to word the text in a way that makes
relaxing the restriction easier.
For example, we can say "free running 16 bit index" and this forces a
power of two, but we can also say "free running index wrapping to 0
after (N*queue-size - 1) with N chosen such that the value fits in 16
bit" and this is exactly the same if queue size is a power of 2.
So we can add text saying "ring size MUST be a power of two"
and later it will be easy to relax just by adding a feature bit.
A later feature bit sounds good.
No need to delay benchmarking if someone has the time though :)
--
MST
Amnon Ilan
2017-07-10 16:27:17 UTC
Permalink
+Lior

----- Original Message -----
Sent: Tuesday, March 7, 2017 10:33:57 PM
Subject: Re: [virtio-dev] packed ring layout proposal v2
Post by Cornelia Huck
On Wed, 22 Feb 2017 18:43:05 +0200
Post by Michael S. Tsirkin
Post by Cornelia Huck
Post by Paolo Bonzini
Post by Michael S. Tsirkin
Post by Paolo Bonzini
Post by Michael S. Tsirkin
* Non power-of-2 ring sizes
As the ring simply wraps around, there's no reason to
require ring size to be power of two.
It can be made a separate feature though.
Power of 2 ring sizes are required in order to ignore the high
bits of
the indices. With non-power-of-2 sizes you are forced to keep the
indices less than the ring size.
Right. So
if (unlikely(idx++ > size))
idx = 0;
OTOH ring size that's twice larger than necessary
because of power of two requirements wastes cache.
I don't know. Power of 2 ring size is pretty standard, I'd rather avoid
the complication and the gratuitous difference with 1.0.
I agree. I don't think dropping the power of 2 requirement buys us so
much that it makes up for the added complexity.
I recalled why I came up with this. The issue is cache associativity.
Recall that besides the ring we have event suppression
structures - if we are lucky and things run at the same speed
everything can work by polling keeping events disabled, then
event suppression structures are never written to, they are read-only.
However if ring and event suppression share a cache line ring accesses
have a chance to push the event suppression out of cache, causing
misses on read.
This can happen if they are at the same offset in the set.
E.g. with L1 cache 4Kbyte sets are common, so same offset
within a 4K page.
[interrupt suppress]
[descriptor ring]
[kick suppress]
If this whole structure fits in a single set, ring accesses will
not push kick or interrupt suppress out of cache.
Specific layout can be left for drivers, but as set size is
a power of two this might require a non-power of two ring size.
I conclude that this is an optimization that needs to be
benchmarked.
This makes sense. But wouldn't the optimum layout not depend on the
platform?
There's generally a tradeoff between performance and portability.
Whether it's worth it would need to be tested.
Further, it might be better to have platform-specific optimization
tied to a given platform rather than a feature bit.
Post by Cornelia Huck
Post by Michael S. Tsirkin
I also note that the generic description does not have to force
powers of two *even if devices actually require it*.
I would be inclined to word the text in a way that makes
relaxing the restriction easier.
For example, we can say "free running 16 bit index" and this forces a
power of two, but we can also say "free running index wrapping to 0
after (N*queue-size - 1) with N chosen such that the value fits in 16
bit" and this is exactly the same if queue size is a power of 2.
So we can add text saying "ring size MUST be a power of two"
and later it will be easy to relax just by adding a feature bit.
A later feature bit sounds good.
No need to delay benchmarking if someone has the time though :)
--
MST
_______________________________________________
Virtualization mailing list
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Michael S. Tsirkin
2017-02-09 18:24:06 UTC
Permalink
Post by Paolo Bonzini
Post by Michael S. Tsirkin
We couldn't decide what's better for everyone in 1.0 days and I doubt
we'll be able to now, but yes, benchmarking is needed to make
sire it's required. Very easy to remove or not to use/support in
drivers/devices though.
Fair enough, but of course then we must specify that devices MUST
support either VRING_DESC_F_NEXT or VRING_DESC_F_INDIRECT, and drivers
SHOULD support both (or use neither).
The drivers' part adds to the implementation burden, which is why I
wanted to remove it. Alternatively we can say that indirect is
mandatory for both devices and drivers (and save a feature bit), while
VRING_DESC_F_NEXT is optional.
Post by Michael S. Tsirkin
Post by Paolo Bonzini
Post by Michael S. Tsirkin
* Non power-of-2 ring sizes
As the ring simply wraps around, there's no reason to
require ring size to be power of two.
It can be made a separate feature though.
Power of 2 ring sizes are required in order to ignore the high bits of
the indices. With non-power-of-2 sizes you are forced to keep the
indices less than the ring size.
Right. So
if (unlikely(idx++ > size))
idx = 0;
OTOH ring size that's twice larger than necessary
because of power of two requirements wastes cache.
I don't know. Power of 2 ring size is pretty standard, I'd rather avoid
the complication and the gratuitous difference with 1.0.
I thought originally there's a reason 1.0 rings had to be powers of two
but now I don't see why. OK, we can make it a feature flag later if we
want to.
Post by Paolo Bonzini
If batching is mostly advisory (with exceptions such as mrg-rxbuf) I
don't have any problem with it.
Paolo
Paolo Bonzini
2017-02-10 11:32:49 UTC
Permalink
Post by Michael S. Tsirkin
Post by Paolo Bonzini
I don't know. Power of 2 ring size is pretty standard, I'd rather avoid
the complication and the gratuitous difference with 1.0.
I thought originally there's a reason 1.0 rings had to be powers of two
but now I don't see why. OK, we can make it a feature flag later if we
want to.
The reason is that it allows indices to be free running. This is an
example of QEMU code that requires that:

nheads = vring_avail_idx(&vdev->vq[i]) - vdev->vq[i].last_avail_idx;
/* Check it isn't doing strange things with descriptor numbers. */
if (nheads > vdev->vq[i].vring.num) {
error_report("VQ %d size 0x%x Guest index 0x%x "
"inconsistent with Host index 0x%x: delta 0x%x",
i, vdev->vq[i].vring.num,
vring_avail_idx(&vdev->vq[i]),
vdev->vq[i].last_avail_idx, nheads);
return -1;
}

Paolo
Michael S. Tsirkin
2017-02-10 15:20:17 UTC
Permalink
Post by Paolo Bonzini
Post by Michael S. Tsirkin
Post by Paolo Bonzini
I don't know. Power of 2 ring size is pretty standard, I'd rather avoid
the complication and the gratuitous difference with 1.0.
I thought originally there's a reason 1.0 rings had to be powers of two
but now I don't see why. OK, we can make it a feature flag later if we
want to.
The reason is that it allows indices to be free running.
Well what I meant is that with qsize not a power of 2 you can still do
this but have to do everything mod N*qsize as opposed to mod 2^16. So
you need a branch there - easiest to do if you do signed math.

int nheads = avail - last_avail;
/*Check and handle index wrap-around */
if (unlikely(nheads < 0)) {
nheads += N_qsize;
}

if (nheads < 0 || nheads > vdev->vq[i].vring.num) {
error_report(...);
return -1;
}

This can only catch bugs if N > 1
Post by Paolo Bonzini
This is an
nheads = vring_avail_idx(&vdev->vq[i]) - vdev->vq[i].last_avail_idx;
/* Check it isn't doing strange things with descriptor numbers. */
if (nheads > vdev->vq[i].vring.num) {
error_report("VQ %d size 0x%x Guest index 0x%x "
"inconsistent with Host index 0x%x: delta 0x%x",
i, vdev->vq[i].vring.num,
vring_avail_idx(&vdev->vq[i]),
vdev->vq[i].last_avail_idx, nheads);
return -1;
}
Paolo
Same thing here, this never triggers if vring.num == 2^16
--
MST
Paolo Bonzini
2017-02-10 16:17:30 UTC
Permalink
----- Original Message -----
Sent: Friday, February 10, 2017 4:20:17 PM
Subject: Re: [virtio-dev] packed ring layout proposal v2
Post by Paolo Bonzini
Post by Michael S. Tsirkin
Post by Paolo Bonzini
I don't know. Power of 2 ring size is pretty standard, I'd rather avoid
the complication and the gratuitous difference with 1.0.
I thought originally there's a reason 1.0 rings had to be powers of two
but now I don't see why. OK, we can make it a feature flag later if we
want to.
The reason is that it allows indices to be free running.
Well what I meant is that with qsize not a power of 2 you can still do
this but have to do everything mod N*qsize as opposed to mod 2^16. So
you need a branch there - easiest to do if you do signed math.
int nheads = avail - last_avail;
/*Check and handle index wrap-around */
if (unlikely(nheads < 0)) {
nheads += N_qsize;
}
if (nheads < 0 || nheads > vdev->vq[i].vring.num) {
error_report(...);
return -1;
}
This can only catch bugs if N > 1
Agreed.
Same thing here, this never triggers if vring.num == 2^16
Free running indices require the counter range to be bigger than the
size of the vring (otherwise you confuse an empty vring with a full
one), so the maximum size of the vring in virtio <=1.0 is 2^15.

Paolo
Michael S. Tsirkin
2017-02-22 04:27:11 UTC
Permalink
Here is an attempt to summarise the list of things
we need to do with respect to this proposal.


Stage 1: update prototype and finalize proposal

At this stage we need to update the prototype under
tools/virtio/ringtest/ring.c to make it match
latest proposal, adding in verious options under discussion.
Then we will measure performance. While more ideas are welcome
there won't be useful without ability to test!

Tasks:

- update tools/virtio/ringtest/ring.c to proposal, adding
options as listed below.
- compare performance with and without indirect buffers
issue: how to estimate cost of memory allocations?
- batching descriptors - is it worth it?
- three ways to suppress interrupts/exits
which works better?
issue: how to estimate cost of interrupts/exits?


More ideas:

- current tests all indicate cache synchronizations due to r/w descriptors
cost as much as read-only/write-only descriptors,
and there are less of these synchronizations.
Disagree? Write a patch and benchmark.
- some devices could use smaller descriptors. For example,
if you don't need length, id (e.g. using in-order, fixed length) or flags, you can
use a single 64 bit pointer as a descriptor. This can sometimes work
e.g. for networking rx rings.
Not clear whether this gains/costs us anything. Disagree? Write a patch and benchmark.

Stage 2: prototype guest/host drivers

At this stage we need real guest and host drivers
to be able to test real life performance.
I suggest dpdk drivers + munimal hack in qemu to
pass features around.

Tasks:

- implement vhost-user support in dpdk
- implement virtio support in dpdk
- implement minimal stub in qemu
- test performance. Possibly revisit questions from stage 2
if any are left open


Stage 3: complete guest/host drivers

At this stage we need to add linux support in virtio and vhost,
and complete qemu support.

Tasks:

- implement vhost support in kernel
- implement virtio support in kernel
- complete virtio support in qemu
- test performance. Possibly revisit questions from stage 2
if any are left open



Stage X: Finalizing the spec

When are we ready to put out a spec draft? Surely not before Stage 1 is
done. Surely no later than stage 3. A driver could be wish for some
party to productize an implementation.

Interested? Join TC and start discussion on timing and which
features should be included.
--
MST
Michael S. Tsirkin
2017-02-22 15:13:19 UTC
Permalink
open.org] On Behalf Of Michael S. Tsirkin
Here is an attempt to summarise the list of things we need to do with respect
to this proposal.
Thanks for outlining this, it makes it a lot clearer.
Stage 1: update prototype and finalize proposal
At this stage we need to update the prototype under
tools/virtio/ringtest/ring.c to make it match latest proposal, adding in verious
Would a DPDK based prototype be good aswell or would you rather
everything be prototyped here?
DPDK is good.

You would then
- code up current proposal
- code up an alternative you are suggesting
- compare

tools/virtio/ringtest/ is only there to make prototyping
easier. I like keeping it alive for this purpose but it
is not a must.
options under discussion.
Then we will measure performance. While more ideas are welcome there
won't be useful without ability to test!
- update tools/virtio/ringtest/ring.c to proposal, adding
options as listed below.
- compare performance with and without indirect buffers
issue: how to estimate cost of memory allocations?
- batching descriptors - is it worth it?
- three ways to suppress interrupts/exits
which works better?
issue: how to estimate cost of interrupts/exits?
We need to ensure we have a proposal that is suitable for
implementation in hardware.
Right. Obviously the system has to work well as a whole though, e.g. if
you save express bandwidth speeding up hardware but pay cache misses
slowing down software it's not clear you have won anything at all.
- current tests all indicate cache synchronizations due to r/w descriptors
cost as much as read-only/write-only descriptors,
and there are less of these synchronizations.
Disagree? Write a patch and benchmark.
- some devices could use smaller descriptors. For example,
if you don't need length, id (e.g. using in-order, fixed length) or flags, you can
use a single 64 bit pointer as a descriptor. This can sometimes work
e.g. for networking rx rings.
Not clear whether this gains/costs us anything. Disagree? Write a patch and
benchmark.
Stage 2: prototype guest/host drivers
At this stage we need real guest and host drivers to be able to test real life
performance.
I suggest dpdk drivers + munimal hack in qemu to pass features around.
- implement vhost-user support in dpdk
- implement virtio support in dpdk
- implement minimal stub in qemu
- test performance. Possibly revisit questions from stage 2
if any are left open
Stage 3: complete guest/host drivers
At this stage we need to add linux support in virtio and vhost, and complete
qemu support.
- implement vhost support in kernel
- implement virtio support in kernel
- complete virtio support in qemu
- test performance. Possibly revisit questions from stage 2
if any are left open
Stage X: Finalizing the spec
When are we ready to put out a spec draft? Surely not before Stage 1 is
done. Surely no later than stage 3. A driver could be wish for some party to
productize an implementation.
Interested? Join TC and start discussion on timing and which features should
be included.
--
MST
---------------------------------------------------------------------
Michael S. Tsirkin
2017-03-01 01:07:29 UTC
Permalink
Hi Michael,
Again, as usual, sorry for being late :/
Post by Michael S. Tsirkin
Stage 2: prototype guest/host drivers
At this stage we need real guest and host drivers
to be able to test real life performance.
I suggest dpdk drivers + munimal hack in qemu to
pass features around.
I have already done that in last Nov. I made a very rough (yet hacky)
version (only with Tx path) in one day while companying my wife in
hospital.
Any performance data?
If someone are interested in, I could share the code soon. I could
even cleanup the code a bit if necessary.
Especially if you don't have time to benchmark, I think sharing it
might help.
Post by Michael S. Tsirkin
- implement vhost-user support in dpdk
- implement virtio support in dpdk
- implement minimal stub in qemu
I didn't hack the QEMU, instead, I hacked the DPDK virtio-user, yet
another virtio-net emulation. It's simpler and quicker for me.
Sure, I merely meant a stub for negotiating new feature bits
between host and guest. But I guess an environment set
the same way in host and guest would serve too.
- Look deeper inside the virtio net performance issues (WIP)
It's basically a job about digging the DPDK vhost/virtio code
deeper, something like how exactly the cache acts while Tx/Rx
pkts, what can be optimized by implementation, and what could
be improved with the help of spec extension.
Please note that I often got interrupted on this task: it didn't
go smooth as I would have expected.
- Try to accelerate vhost/virtio with vector instructions.
That's interesting. What kind of optimizations would you say
do vector instructions enable, and why?
Something I will look into when above item is done. Currently,
* what kind of vring and desc layout could make the vector
implementation easier.
* what kind of hint we need from virtio spec for (dynamically)
enabling the vector path.
Besides that, I don't have too much clue yet.
--yliu
Yuanhan Liu
2017-03-08 07:09:48 UTC
Permalink
Post by Michael S. Tsirkin
Hi Michael,
Again, as usual, sorry for being late :/
Post by Michael S. Tsirkin
Stage 2: prototype guest/host drivers
At this stage we need real guest and host drivers
to be able to test real life performance.
I suggest dpdk drivers + munimal hack in qemu to
pass features around.
I have already done that in last Nov. I made a very rough (yet hacky)
version (only with Tx path) in one day while companying my wife in
hospital.
Any performance data?
A straightfoward implementation only brings 10% performance boost in a
txonly micro benchmarking. But I'm sure there are still plenty of room
for improvement.
Post by Michael S. Tsirkin
If someone are interested in, I could share the code soon. I could
even cleanup the code a bit if necessary.
Especially if you don't have time to benchmark, I think sharing it
might help.
Here it is (check the README-virtio-1.1 for howto):

git://fridaybit.com/git/dpdk.git virtio-1.1-v0.1
Post by Michael S. Tsirkin
- Try to accelerate vhost/virtio with vector instructions.
That's interesting. What kind of optimizations would you say
do vector instructions enable, and why?
If we have made the cache impact being minimum, the left thing could
be optimized is the instruction cycles. SIMD instructions (like AVX)
then should help on this.

--yliu
Post by Michael S. Tsirkin
Something I will look into when above item is done. Currently,
* what kind of vring and desc layout could make the vector
implementation easier.
* what kind of hint we need from virtio spec for (dynamically)
enabling the vector path.
Besides that, I don't have too much clue yet.
--yliu
Yuanhan Liu
2017-03-08 07:56:24 UTC
Permalink
Post by Yuanhan Liu
Post by Michael S. Tsirkin
Hi Michael,
Again, as usual, sorry for being late :/
Post by Michael S. Tsirkin
Stage 2: prototype guest/host drivers
At this stage we need real guest and host drivers
to be able to test real life performance.
I suggest dpdk drivers + munimal hack in qemu to
pass features around.
I have already done that in last Nov. I made a very rough (yet hacky)
version (only with Tx path) in one day while companying my wife in
hospital.
Any performance data?
A straightfoward implementation only brings 10% performance boost in a
txonly micro benchmarking. But I'm sure there are still plenty of room
for improvement.
Post by Michael S. Tsirkin
If someone are interested in, I could share the code soon. I could
even cleanup the code a bit if necessary.
Especially if you don't have time to benchmark, I think sharing it
might help.
git://fridaybit.com/git/dpdk.git virtio-1.1-v0.1
Well, I was told it maybe not proper to share code like this way. So
this channel is closed. I will check how to find a proper way. Sorry
for the inconvenience!

--yliu
Michael S. Tsirkin
2017-03-29 12:39:20 UTC
Permalink
Post by Yuanhan Liu
Post by Yuanhan Liu
Post by Michael S. Tsirkin
Hi Michael,
Again, as usual, sorry for being late :/
Post by Michael S. Tsirkin
Stage 2: prototype guest/host drivers
At this stage we need real guest and host drivers
to be able to test real life performance.
I suggest dpdk drivers + munimal hack in qemu to
pass features around.
I have already done that in last Nov. I made a very rough (yet hacky)
version (only with Tx path) in one day while companying my wife in
hospital.
Any performance data?
A straightfoward implementation only brings 10% performance boost in a
txonly micro benchmarking. But I'm sure there are still plenty of room
for improvement.
Post by Michael S. Tsirkin
If someone are interested in, I could share the code soon. I could
even cleanup the code a bit if necessary.
Especially if you don't have time to benchmark, I think sharing it
might help.
git://fridaybit.com/git/dpdk.git virtio-1.1-v0.1
Well, I was told it maybe not proper to share code like this way. So
this channel is closed. I will check how to find a proper way. Sorry
for the inconvenience!
--yliu
Where you going to re-post it?
I don't see what the issue could be frankly.
Care to elaborate?
--
MST
Yuanhan Liu
2017-04-01 07:30:11 UTC
Permalink
Post by Michael S. Tsirkin
Post by Yuanhan Liu
Post by Yuanhan Liu
Post by Michael S. Tsirkin
Hi Michael,
Again, as usual, sorry for being late :/
Post by Michael S. Tsirkin
Stage 2: prototype guest/host drivers
At this stage we need real guest and host drivers
to be able to test real life performance.
I suggest dpdk drivers + munimal hack in qemu to
pass features around.
I have already done that in last Nov. I made a very rough (yet hacky)
version (only with Tx path) in one day while companying my wife in
hospital.
Any performance data?
A straightfoward implementation only brings 10% performance boost in a
txonly micro benchmarking. But I'm sure there are still plenty of room
for improvement.
Post by Michael S. Tsirkin
If someone are interested in, I could share the code soon. I could
even cleanup the code a bit if necessary.
Especially if you don't have time to benchmark, I think sharing it
might help.
git://fridaybit.com/git/dpdk.git virtio-1.1-v0.1
Well, I was told it maybe not proper to share code like this way. So
this channel is closed. I will check how to find a proper way. Sorry
for the inconvenience!
--yliu
Where you going to re-post it?
It's at

git://dpdk.org/next/dpdk-next-virtio for-testing
Post by Michael S. Tsirkin
I don't see what the issue could be frankly.
Care to elaborate?
Honestly, I don't know, neither.

--yliu
Chien, Roger S
2017-02-22 14:46:47 UTC
Permalink
Here are some feedbacks w.r.t virtio v1.1 proposal from FPGA implementation view point.

1) As the Gary presented, depends on flags to determine ownership of descriptors is not efficiency. And for QPI FPGA platform, it may also lead to race condition because current design only support whole cache line read/write and no partial byte enabled. Even with partial byte enable, it will cause the CPU/FPGA to invalidate the whole cache line even when you just toggle one bit (For RX, return the descriptor to SW....)

2) S/G is a nightmare to HW, because next descriptor is only available when you have it. So the HW is unable to parallelize descriptor fetching & buffer fetching. Indirect table is better (compare to indirect linked list). But it even better if we can put chained descriptors inside the original descriptor table and using a flag to indicate that their buffers belong to the a large piece (but may be conflict to flexibility to increase/decrease descriptors in run time)

3) Better to keep power of 2 ring size from HW viewpoint. But not so painful if we need to support non power of 2 ring size. (Single AND mask versus comparator)


-----Original Message-----
From: virtio-***@lists.oasis-open.org [mailto:virtio-***@lists.oasis-open.org] On Behalf Of Michael S. Tsirkin
Sent: Wednesday, February 8, 2017 11:20 AM
To: virtio-***@lists.oasis-open.org
Cc: ***@lists.linux-foundation.org
Subject: [virtio-dev] packed ring layout proposal v2

This is an update from v1 version.
Changes:
- update event suppression mechanism
- separate options for indirect and direct s/g
- lots of new features

---

Performance analysis of this is in my kvm forum 2016 presentation.
The idea is to have a r/w descriptor in a ring structure, replacing the used and available ring, index and descriptor buffer.

* Descriptor ring:

Guest adds descriptors with unique index values and DESC_HW set in flags.
Host overwrites used descriptors with correct len, index, and DESC_HW clear. Flags are always set/cleared last.

#define DESC_HW 0x0080

struct desc {
__le64 addr;
__le32 len;
__le16 index;
__le16 flags;
};

When DESC_HW is set, descriptor belongs to device. When it is clear, it belongs to the driver.

We can use 1 bit to set direction
/* This marks a buffer as write-only (otherwise read-only). */
#define VRING_DESC_F_WRITE 2

* Scatter/gather support

We can use 1 bit to chain s/g entries in a request, same as virtio 1.0:

/* This marks a buffer as continuing via the next field. */
#define VRING_DESC_F_NEXT 1

Unlike virtio 1.0, all descriptors must have distinct ID values.

Also unlike virtio 1.0, use of this flag will be an optional feature (e.g. VIRTIO_F_DESC_NEXT) so both devices and drivers can opt out of it.

* Indirect buffers

Can be marked like in virtio 1.0:

/* This means the buffer contains a table of buffer descriptors. */
#define VRING_DESC_F_INDIRECT 4

Unlike virtio 1.0, this is a table, not a list:
struct indirect_descriptor_table {
/* The actual descriptors (16 bytes each) */
struct virtq_desc desc[len / 16]; };

The first descriptor is located at start of the indirect descriptor table, additional indirect descriptors come immediately afterwards.
DESC_F_WRITE is the only valid flag for descriptors in the indirect table. Others should be set to 0 and are ignored. id is also set to 0 and should be ignored.

virtio 1.0 seems to allow a s/g entry followed by an indirect descriptor. This does not seem useful, so we do not allow that anymore.

This support would be an optional feature, same as in virtio 1.0

* Batching descriptors:

virtio 1.0 allows passing a batch of descriptors in both directions, by incrementing the used/avail index by values > 1. We can support this by chaining a list of descriptors through a bit the flags field.
To allow use together with s/g, a different bit will be used.

#define VRING_DESC_F_BATCH_NEXT 0x0010

Batching works for both driver and device descriptors.



* Processing descriptors in and out of order

Device processing all descriptors in order can simply flip the DESC_HW bit as it is done with descriptors.

Device can write descriptors out in order as they are used, overwriting descriptors that are there.

Device must not use a descriptor until DESC_HW is set.
It is only required to look at the first descriptor submitted.

Driver must not overwrite a descriptor until DESC_HW is clear.
It is only required to look at the first descriptor submitted.

* Device specific descriptor flags
We have a lot of unused space in the descriptor. This can be put to good use by reserving some flag bits for device use.
For example, network device can set a bit to request that header in the descriptor is suppressed (in case it's all 0s anyway). This reduces cache utilization.

Note: this feature can be supported in virtio 1.0 as well, as we have unused bits in both descriptor and used ring there.

* Descriptor length in device descriptors

virtio 1.0 places strict requirements on descriptor length. For example it must be 0 in used ring of TX VQ of a network device since nothing is written. In practice guests do not seem to use this, so we can simplify devices a bit by removing this requirement - if length is unused it should be ignored by driver.

Some devices use identically-sized buffers in all descriptors.
Ignoring length for driver descriptors there could be an option too.

* Writing at an offset

Some devices might want to write into some descriptors at an offset, the offset would be in config space, and a descriptor flag could indicate this:

#define VRING_DESC_F_OFFSET 0x0020

How exactly to use the offset would be device specific, for example it can be used to align beginning of packet in the 1st buffer for mergeable buffers to cache line boundary while also aligning rest of buffers.

* Non power-of-2 ring sizes

As the ring simply wraps around, there's no reason to require ring size to be power of two.
It can be made a separate feature though.


* Interrupt/event suppression

virtio 1.0 has two mechanisms for suppression but only one can be used at a time. we pack them together in a structure - one for interrupts, one for notifications:

struct event {
__le16 idx;
__le16 flags;
}

Both fields would be optional, with a feature bit:
VIRTIO_F_EVENT_IDX
VIRTIO_F_EVENT_FLAGS

* Flags can be used like in virtio 1.0, by storing a special value there:

#define VRING_F_EVENT_ENABLE 0x0

#define VRING_F_EVENT_DISABLE 0x1

* Event index would be in the range 0 to 2 * Queue Size (to detect wrap arounds) and wrap to 0 after that.

The assumption is that each side maintains an internal descriptor counter 0 to 2 * Queue Size that wraps to 0.
In that case, interrupt triggers when counter reaches the given value.

* If both features are negotiated, a special flags value can be used to switch to event idx:

#define VRING_F_EVENT_IDX 0x2


* Prototype

A partial prototype can be found under
tools/virtio/ringtest/ring.c

Test run:
[***@tuck ringtest]$ time ./ring
real 0m0.399s
user 0m0.791s
sys 0m0.000s
[***@tuck ringtest]$ time ./virtio_ring_0_9
real 0m0.503s
user 0m0.999s
sys 0m0.000s

It is planned to update it to this interface. Future changes and enhancements can (and should) be tested against this prototype.

* Feature sets
In particular are we going overboard with feature bits? It becomes hard to support all combinations in drivers and devices. Maybe we should document reasonable feature sets to be supported for each device.

* Known issues/ideas

This layout is optimized for host/guest communication, in a sense even more aggressively than virtio 1.0.
It might be suboptimal for PCI hardware implementations.
However, one notes that current virtio pci drivers are unlikely to work with PCI hardware implementations anyway (e.g. due to use of SMP barriers for ordering).

Suggestions for improving this are welcome but need to be tested to make sure our main use case does not regress. Of course some improvements might be made optional, but if we add too many of these driver becomes unmanageable.

---

Note: should this proposal be accepted and approved, one or more
claims disclosed to the TC admin and listed on the Virtio TC
IPR page https://www.oasis-open.org/committees/virtio/ipr.php
might become Essential Claims.

--
MST

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-***@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-***@lists.oasis-open.org
Michael S. Tsirkin
2017-03-01 01:02:29 UTC
Permalink
Post by Michael S. Tsirkin
This is an update from v1 version.
- update event suppression mechanism
- separate options for indirect and direct s/g
- lots of new features
---
Performance analysis of this is in my kvm forum 2016 presentation.
The idea is to have a r/w descriptor in a ring structure,
replacing the used and available ring, index and descriptor
buffer.
Guest adds descriptors with unique index values and DESC_HW set in flags.
Host overwrites used descriptors with correct len, index, and DESC_HW
clear. Flags are always set/cleared last.
May I know what's the index intended for? Back referencing a pkt buffer?
Yes and generally identify which descriptor completed. Recall that
even though vhost net completes in order at the moment,
virtio rings serve devices (e.g. storage) that complete out of order.
Post by Michael S. Tsirkin
#define DESC_HW 0x0080
struct desc {
__le64 addr;
__le32 len;
__le16 index;
__le16 flags;
};
...
Post by Michael S. Tsirkin
virtio 1.0 allows passing a batch of descriptors in both directions, by
incrementing the used/avail index by values > 1. We can support this by
chaining a list of descriptors through a bit the flags field.
To allow use together with s/g, a different bit will be used.
#define VRING_DESC_F_BATCH_NEXT 0x0010
Batching works for both driver and device descriptors.
Honestly, I don't think it's an efficient way for batching. Neither the
DESC_F_NEXT nor the BATCH_NEXT tells us how many new descs are there
for processing: it's just a hint that there is one more. And you have
to follow the link one by one.
I was thinking, maybe we could sub-divide some fields of desc, thus
we could introduce more. For example, 32 bits "len" maybe too much,
at least to virtio-net: the biggest pkt size is 64K, which is 16 bits.
If we use 16 bits for len, we could use the extra 16 bits for telling
how telling the batch number.
struct desc {
__le64 addr;
__le16 len;
__le16 batch;
__le16 index;
__le16 flags;
};
Only the heading desc need set the batch count and DESC_HW flag. With
the two used together, we don't have to set/clear the DESC_HW flag on
driver/device.
If 64K is small enough for other devices (say, BLK), we may re-allocate
the bits to something like "24 : 8", whereas 24 for len (16M at most)
and 8 for batch. OTOH, 8 bit of batch should be pretty enough, judging
that the default vring size is 256 and a typical burst size normally is
way less than that.
That said, if it's "16: 16" and if we use only 8 bits for batch, we
could still have another 8 bit for anything else, say the number of
desc for a single pkt. With that, the num_buffers of mergeable Rx
header could be replaced. More importantly, we could reduce a cache
line write if non offload are supported in mergeable Rx path.
Why do you bother with mergeable Rx without offloads? Make each buffer
MTU sized and it'll fit without merging. Linux used not to, it only
started doing this to save memory aggressively. I don't think
DPDK cares about this.
struct desc {
__le64 addr;
__le16 len;
__le8 batch;
__le8 num_buffers;
__le16 index;
__le16 flags;
};
Interesting. How about a benchmark for these ideas?
Post by Michael S. Tsirkin
* Device specific descriptor flags
We have a lot of unused space in the descriptor. This can be put to
good use by reserving some flag bits for device use.
For example, network device can set a bit to request
that header in the descriptor is suppressed
(in case it's all 0s anyway). This reduces cache utilization.
Good proposal. But I think it could only work with Tx, where the driver
knows whether the headers are all 0s while filling the desc. But for
Rx, whereas the driver has to pre-fill the desc, it doesn't know. Thus
it still requires filling an header desc for storing it.
I don't see why - I don't think drivers pre-fill buffers in header for RX
right now. Why would they start?
Maybe we could introduce a global feature? When that's negotiated, no
header desc need filled and processed? I'm thinking this could also
help the vector implementation I mentioned in another email.
It's possible of course - it's a subset of what I said.
Though it makes it less useful in the general case.
Post by Michael S. Tsirkin
Note: this feature can be supported in virtio 1.0 as well,
as we have unused bits in both descriptor and used ring there.
Agreed.
--yliu
Michael S. Tsirkin
2017-03-01 04:14:21 UTC
Permalink
Post by Michael S. Tsirkin
Post by Michael S. Tsirkin
Guest adds descriptors with unique index values and DESC_HW set in flags.
Host overwrites used descriptors with correct len, index, and DESC_HW
clear. Flags are always set/cleared last.
May I know what's the index intended for? Back referencing a pkt buffer?
Yes and generally identify which descriptor completed. Recall that
even though vhost net completes in order at the moment,
virtio rings serve devices (e.g. storage) that complete out of order.
I see, and thanks.
Post by Michael S. Tsirkin
That said, if it's "16: 16" and if we use only 8 bits for batch, we
could still have another 8 bit for anything else, say the number of
desc for a single pkt. With that, the num_buffers of mergeable Rx
header could be replaced. More importantly, we could reduce a cache
line write if non offload are supported in mergeable Rx path.
Why do you bother with mergeable Rx without offloads?
Oh, my bad. I actually meant "without offloads __being used__". Just
assume the pkt size is 64B and no offloads are used. When mergeable
Rx is negotiated (which is the default case), num_buffers has to be
set 1. That means an extra cache line write. For the case of non
mergeable, the cache line write could be avoid by a trick like what
http://dpdk.org/browse/dpdk/commit/?id=c9ea670c1dc7e3f111d8139f915082b60c9c1ffe
the case when no offloads are used.
So while writing this email, I was thinking maybe we could not set
num_buffers to 1 when there is only one desc (let it be 0 and let
num_buffers == 0 imply num_buffers = 1). I'm not quite sure we can
do that now, thus I checked the DPDK code and found it's Okay.
896 seg_num = header->num_buffers;
897
898 if (seg_num == 0)
899 seg_num = 1;
I then also checked linux kernel code, and found it's not okay as
==> 365 u16 num_buf = virtio16_to_cpu(vi->vdev, hdr->num_buffers);
366 struct page *page = virt_to_head_page(buf);
367 int offset = buf - page_address(page);
368 unsigned int truesize = max(len, mergeable_ctx_to_buf_truesize(ctx));
369
370 struct sk_buff *head_skb = page_to_skb(vi, rq, page, offset, len,
371 truesize);
372 struct sk_buff *curr_skb = head_skb;
373
374 if (unlikely(!curr_skb))
375 goto err_skb;
==> 376 while (--num_buf) {
That means, if we want to do that, it needs an extra feature flag
(either a global feature flag or a desc flag), something like
ALLOW_ZERO_NUM_BUFFERS. Or even, make it allowable in virtio 1.1
(virtio 0.95/1.0 won't benifit from it though).
Does it make sense to you?
Right and then we could use a descriptor flag "header is all 0s".
For virtio 1.0 we could put these in the used ring instead.
Post by Michael S. Tsirkin
Make each buffer
MTU sized and it'll fit without merging. Linux used not to, it only
started doing this to save memory aggressively. I don't think
DPDK cares about this.
struct desc {
__le64 addr;
__le16 len;
__le8 batch;
__le8 num_buffers;
__le16 index;
__le16 flags;
};
Interesting. How about a benchmark for these ideas?
Sure, I would like to benchmark it. It won't take long to me. But
currently, I was still focusing on analysising the performance behaviour
of virtio 0.95/1.0 (when I get some time), to see what's not good for
performance and what's can be improved.
Besides that, as said, I often got interrupted. Moreoever, DPDK v17.05
code freeze deadline is coming. So it's just a remind that I may don't
have time for it recently. Sorry for that.
Post by Michael S. Tsirkin
Post by Michael S. Tsirkin
* Device specific descriptor flags
We have a lot of unused space in the descriptor. This can be put to
good use by reserving some flag bits for device use.
For example, network device can set a bit to request
that header in the descriptor is suppressed
(in case it's all 0s anyway). This reduces cache utilization.
Good proposal. But I think it could only work with Tx, where the driver
knows whether the headers are all 0s while filling the desc. But for
Rx, whereas the driver has to pre-fill the desc, it doesn't know. Thus
it still requires filling an header desc for storing it.
I don't see why - I don't think drivers pre-fill buffers in header for RX
right now. Why would they start?
Again, my bad, I meant "prepare" but not "pre-fill". Let me try to explain
- For Tx, when the header is all 0s, the header could be discarded. We
could use one desc for transfering a packet (with a flag NO_HEADER
or HEADER_ALL_ZERO bit set)
- For Rx, the header is filled in the device (or vhost) side. And the
driver has to prepare the header desc for each pkt, because the Rx
driver has no idea whether it will be all 0s.
That means, the header could not be discarded.
If such a global feature is negotiated, we could also discard the header
desc as well.
--yliu
Right and again, flags could be added to the used ring to pass extra
info.
Post by Michael S. Tsirkin
Maybe we could introduce a global feature? When that's negotiated, no
header desc need filled and processed? I'm thinking this could also
help the vector implementation I mentioned in another email.
It's possible of course - it's a subset of what I said.
Though it makes it less useful in the general case.
Post by Michael S. Tsirkin
Note: this feature can be supported in virtio 1.0 as well,
as we have unused bits in both descriptor and used ring there.
Agreed.
--yliu
---------------------------------------------------------------------
Yuanhan Liu
2017-03-01 04:57:09 UTC
Permalink
Post by Michael S. Tsirkin
Post by Michael S. Tsirkin
That said, if it's "16: 16" and if we use only 8 bits for batch, we
could still have another 8 bit for anything else, say the number of
desc for a single pkt. With that, the num_buffers of mergeable Rx
header could be replaced. More importantly, we could reduce a cache
line write if non offload are supported in mergeable Rx path.
Why do you bother with mergeable Rx without offloads?
Oh, my bad. I actually meant "without offloads __being used__". Just
assume the pkt size is 64B and no offloads are used. When mergeable
Rx is negotiated (which is the default case), num_buffers has to be
set 1. That means an extra cache line write. For the case of non
mergeable, the cache line write could be avoid by a trick like what
http://dpdk.org/browse/dpdk/commit/?id=c9ea670c1dc7e3f111d8139f915082b60c9c1ffe
the case when no offloads are used.
So while writing this email, I was thinking maybe we could not set
num_buffers to 1 when there is only one desc (let it be 0 and let
num_buffers == 0 imply num_buffers = 1). I'm not quite sure we can
do that now, thus I checked the DPDK code and found it's Okay.
896 seg_num = header->num_buffers;
897
898 if (seg_num == 0)
899 seg_num = 1;
I then also checked linux kernel code, and found it's not okay as
==> 365 u16 num_buf = virtio16_to_cpu(vi->vdev, hdr->num_buffers);
366 struct page *page = virt_to_head_page(buf);
367 int offset = buf - page_address(page);
368 unsigned int truesize = max(len, mergeable_ctx_to_buf_truesize(ctx));
369
370 struct sk_buff *head_skb = page_to_skb(vi, rq, page, offset, len,
371 truesize);
372 struct sk_buff *curr_skb = head_skb;
373
374 if (unlikely(!curr_skb))
375 goto err_skb;
==> 376 while (--num_buf) {
That means, if we want to do that, it needs an extra feature flag
(either a global feature flag or a desc flag), something like
ALLOW_ZERO_NUM_BUFFERS. Or even, make it allowable in virtio 1.1
(virtio 0.95/1.0 won't benifit from it though).
Does it make sense to you?
Right and then we could use a descriptor flag "header is all 0s".
For virtio 1.0 we could put these in the used ring instead.
Great.

--yliu
Michael S. Tsirkin
2017-03-01 01:28:25 UTC
Permalink
Hi,
For virtio-net, we use 2 descs for representing a (small) pkt. One for
- the desc buffer for storing pkt data is halfed
Though we later introduced 2 more options to overcome this: ANYLAY_OUT
and indirect desc. The indirect desc has another issue: it introdues
an extra cache line visit.
So if we don't care about this part, we could maybe just add
a descriptor flag that puts the whole header in the descriptor.
- virtio-net header could be scattered
Assume the ANYLAY_OUT case, whereas the headered is prepened before
each mbuf (or skb in kernel). In DPDK, a burst recevice in vhost pmd
means 32 different cache visit for virtio header.
For the legacy layout and indirect desc, the cache issue could somehone
diminished a bit: we could arrange the virtio header in a same memory
block and let the header desc point to the right one.
But it's still not good enough: the virtio-net headers aren't accessed
in batch: they have to be accessed one by one (by reading the desc).
That said, it's still not that good for cache utilization.
- put all virtio-net header in a memory block.
A burst size of 32 pkts need only access (32 * 12) / 64 = 6 cache lines.
While before, it could be 32 cache lines.
- introduce a header desc to reference above memory block.
desc->addr = starting addr of net headers mem block
desc->len = size of all net virtio net headers (burst size * header size)
Thus, in a burst size of 32, we only need 33 descs: one for headers and
others for store corresponding pkt data. More importantly, we could use
the "len" field for computing the batch size. We then could load the
virtio net headers at once; we could also prefetch all the descs at once.
Note it could also be adapted to virtio 0.95 and 1.0. I also made a simple
prototype with DPDK (yet again, it's Tx path only), I saw an impressive
boost (about 30%) in a mirco benchmark.
I think such proposal may should also help other devices, too, if they
also have a small header for each data.
Thoughts?
--yliu
That's great. An alternative might be to add an array of headers parallel
to array of descriptors and indexed by head. A bit in the descriptor
would then be enough to mark such a header as valid.

It's also an alternative way to pass in batches for virtio 1.1.

This has an advantage that it helps non-batched workloads as well
if enough packets end up in the ring, but maybe this
predicts on the CPU in a worse way. Worth benchmarking?

I hope above thoughts are helpful, but -

code walks - if you can show real gains I'd be inclined
to say let's go with it. You don't necessarily need to implement and
benchmark all possible ideas others can come up with :)
(though that's just me not speaking for anyone else -
we'll have to put it on the TC ballot of course)

--
MST
Michael S. Tsirkin
2017-07-18 16:23:11 UTC
Permalink
Post by Chien, Roger S
-----Original Message-----
On Behalf Of Michael S. Tsirkin
Sent: Wednesday, February 08, 2017 5:20 AM
Subject: [virtio-dev] packed ring layout proposal v2
This is an update from v1 version.
- update event suppression mechanism
- separate options for indirect and direct s/g
- lots of new features
---
Performance analysis of this is in my kvm forum 2016 presentation.
The idea is to have a r/w descriptor in a ring structure,
replacing the used and available ring, index and descriptor
buffer.
Guest adds descriptors with unique index values and DESC_HW set in flags.
Host overwrites used descriptors with correct len, index, and DESC_HW
clear. Flags are always set/cleared last.
#define DESC_HW 0x0080
struct desc {
__le64 addr;
__le32 len;
__le16 index;
__le16 flags;
};
When DESC_HW is set, descriptor belongs to device. When it is clear,
it belongs to the driver.
We can use 1 bit to set direction
/* This marks a buffer as write-only (otherwise read-only). */
#define VRING_DESC_F_WRITE 2
A valid bit per descriptor does not let the device do an efficient prefetch.
An alternative is to use a producer index(PI).
Using the PI posted by the driver, and the Consumer Index(CI) maintained by the device, the device knows how much work it has outstanding, so it can do the prefetch accordingly.
There are few options for the device to acquire the PI.
Most efficient will be to write the PI in the doorbell together with the queue number.
Right. This was suggested in "Fwd: Virtio-1.1 Ring Layout".
Or just the PI if we don't need the queue number.
I would like to raise the need for a Completion Queue(CQ).
Multiple Work Queues(hold the work descriptors, WQ in short) can be connected to a single CQ.
So when the device completes the work on the descriptor, it writes a Completion Queue Entry (CQE) to the CQ.
This seems similar to the design we currently have with a separate used
ring. It seems to underperform writing into the available ring
at least on a microbenchmark. The reason seems to be that
more cache lines need to get invalidated and re-fetched.
CQEs are continuous in memory so prefetching by the driver is efficient, although the device might complete work descriptors out of order.
That's not different from proposal you are replying to - descriptors
can be used and written out in any order, they are contigious
so driver can prefetch. I'd like to add that attempts to
add prefetch e.g. for the virtio used ring never showed any
conclusive gains - some workloads would get minor a speedup, others
lose a bit of performance. I do not think anyone looked
deeply into why that was the case. It would be easy for you to add this
code to current virtio drivers and/or devices and try it out.
The interrupt handler is connected to the CQ, so an allocation of a single CQ per core, with a single interrupt handler is possible although this core might be using multiple WQs.
Sending a single interrupt from multiple rings might save some
cycles. event index/interrupt disable are currently in
RAM so access is very cheap for the guest.
If you are going to share, just disable all interrupts
when you start processing.

I was wondering how do people want to do this in hardware
in fact. Are you going to read event index after each descriptor?

It might make sense to move event index/flags into device memory. And
once you do this, writing these out might become slower, and then some
kind of sharing of the event index might help.

No one suggested anything like this so far - maybe it's less of an issue
than I think, e.g. because of interrupt coalescing (which virtio also
does not have, but could be added if there is interest) or maybe people
just mostly care about polling performance.
One application for multiple WQs with a single CQ is Quality of Service(QoS).
A user can open a WQ per QoS value(pcp value for example), and the device will schedule the work accordingly.
A ring per QOS level might make sense e.g. in a network device. I don't
see why it's helpful to write out completed entries into a separate
ring for that though.

As we don't have QOS support in virtio net at all right now,
would that be best discussed once we have a working prototype
and everyone can see what the problem is?
Post by Chien, Roger S
* Scatter/gather support
/* This marks a buffer as continuing via the next field. */
#define VRING_DESC_F_NEXT 1
Unlike virtio 1.0, all descriptors must have distinct ID values.
Also unlike virtio 1.0, use of this flag will be an optional feature
(e.g. VIRTIO_F_DESC_NEXT) so both devices and drivers can opt out of it.
* Indirect buffers
/* This means the buffer contains a table of buffer descriptors. */
#define VRING_DESC_F_INDIRECT 4
struct indirect_descriptor_table {
/* The actual descriptors (16 bytes each) */
struct virtq_desc desc[len / 16];
};
The first descriptor is located at start of the indirect descriptor
table, additional indirect descriptors come immediately afterwards.
DESC_F_WRITE is the only valid flag for descriptors in the indirect
table. Others should be set to 0 and are ignored. id is also set to 0
and should be ignored.
virtio 1.0 seems to allow a s/g entry followed by
an indirect descriptor. This does not seem useful,
so we do not allow that anymore.
This support would be an optional feature, same as in virtio 1.0
virtio 1.0 allows passing a batch of descriptors in both directions, by
incrementing the used/avail index by values > 1. We can support this by
chaining a list of descriptors through a bit the flags field.
To allow use together with s/g, a different bit will be used.
#define VRING_DESC_F_BATCH_NEXT 0x0010
Batching works for both driver and device descriptors.
* Processing descriptors in and out of order
Device processing all descriptors in order can simply flip
the DESC_HW bit as it is done with descriptors.
Device can write descriptors out in order as they are used, overwriting
descriptors that are there.
Device must not use a descriptor until DESC_HW is set.
It is only required to look at the first descriptor
submitted.
Driver must not overwrite a descriptor until DESC_HW is clear.
It is only required to look at the first descriptor
submitted.
* Device specific descriptor flags
We have a lot of unused space in the descriptor. This can be put to
good use by reserving some flag bits for device use.
For example, network device can set a bit to request
that header in the descriptor is suppressed
(in case it's all 0s anyway). This reduces cache utilization.
Note: this feature can be supported in virtio 1.0 as well,
as we have unused bits in both descriptor and used ring there.
* Descriptor length in device descriptors
virtio 1.0 places strict requirements on descriptor length. For example
it must be 0 in used ring of TX VQ of a network device since nothing is
written. In practice guests do not seem to use this, so we can simplify
devices a bit by removing this requirement - if length is unused it
should be ignored by driver.
Some devices use identically-sized buffers in all descriptors.
Ignoring length for driver descriptors there could be an option too.
* Writing at an offset
Some devices might want to write into some descriptors
at an offset, the offset would be in config space,
#define VRING_DESC_F_OFFSET 0x0020
How exactly to use the offset would be device specific,
for example it can be used to align beginning of packet
in the 1st buffer for mergeable buffers to cache line boundary
while also aligning rest of buffers.
* Non power-of-2 ring sizes
As the ring simply wraps around, there's no reason to
require ring size to be power of two.
It can be made a separate feature though.
* Interrupt/event suppression
virtio 1.0 has two mechanisms for suppression but only
one can be used at a time. we pack them together
struct event {
__le16 idx;
__le16 flags;
}
VIRTIO_F_EVENT_IDX
VIRTIO_F_EVENT_FLAGS
* Flags can be used like in virtio 1.0, by storing a special
#define VRING_F_EVENT_ENABLE 0x0
#define VRING_F_EVENT_DISABLE 0x1
* Event index would be in the range 0 to 2 * Queue Size
(to detect wrap arounds) and wrap to 0 after that.
The assumption is that each side maintains an internal
descriptor counter 0 to 2 * Queue Size that wraps to 0.
In that case, interrupt triggers when counter reaches
the given value.
* If both features are negotiated, a special flags value
#define VRING_F_EVENT_IDX 0x2
* Prototype
A partial prototype can be found under
tools/virtio/ringtest/ring.c
real 0m0.399s
user 0m0.791s
sys 0m0.000s
real 0m0.503s
user 0m0.999s
sys 0m0.000s
It is planned to update it to this interface. Future changes and
enhancements can (and should) be tested against this prototype.
* Feature sets
In particular are we going overboard with feature bits? It becomes hard
to support all combinations in drivers and devices. Maybe we should
document reasonable feature sets to be supported for each device.
* Known issues/ideas
This layout is optimized for host/guest communication,
in a sense even more aggressively than virtio 1.0.
It might be suboptimal for PCI hardware implementations.
However, one notes that current virtio pci drivers are
unlikely to work with PCI hardware implementations anyway
(e.g. due to use of SMP barriers for ordering).
Suggestions for improving this are welcome but need to be tested to make
sure our main use case does not regress. Of course some improvements
might be made optional, but if we add too many of these driver becomes
unmanageable.
---
Note: should this proposal be accepted and approved, one or more
claims disclosed to the TC admin and listed on the Virtio TC
IPR page
https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fww
w.oasis-
open.org%2Fcommittees%2Fvirtio%2Fipr.php&data=02%7C01%7Cliorn%40m
ellanox.com%7Cf41239019c1441e73b0308d4c7b0a483%7Ca652971c7d2e4d9
ba6a4d149256f461b%7C0%7C0%7C636353008872143792&sdata=L946V5o0P
k8th%2B2IkHgvALmhnIEWD9hcMZvMxLetavc%3D&reserved=0
might become Essential Claims.
--
MST
---------------------------------------------------------------------
Lior Narkis
2017-07-19 07:41:55 UTC
Permalink
Post by Chien, Roger S
-----Original Message-----
Sent: Tuesday, July 18, 2017 7:23 PM
Subject: Re: [virtio-dev] packed ring layout proposal v2
Post by Chien, Roger S
-----Original Message-----
open.org]
Post by Chien, Roger S
On Behalf Of Michael S. Tsirkin
Sent: Wednesday, February 08, 2017 5:20 AM
Subject: [virtio-dev] packed ring layout proposal v2
This is an update from v1 version.
- update event suppression mechanism
- separate options for indirect and direct s/g
- lots of new features
---
Performance analysis of this is in my kvm forum 2016 presentation.
The idea is to have a r/w descriptor in a ring structure,
replacing the used and available ring, index and descriptor
buffer.
Guest adds descriptors with unique index values and DESC_HW set in flags.
Host overwrites used descriptors with correct len, index, and DESC_HW
clear. Flags are always set/cleared last.
#define DESC_HW 0x0080
struct desc {
__le64 addr;
__le32 len;
__le16 index;
__le16 flags;
};
When DESC_HW is set, descriptor belongs to device. When it is clear,
it belongs to the driver.
We can use 1 bit to set direction
/* This marks a buffer as write-only (otherwise read-only). */
#define VRING_DESC_F_WRITE 2
A valid bit per descriptor does not let the device do an efficient prefetch.
An alternative is to use a producer index(PI).
Using the PI posted by the driver, and the Consumer Index(CI) maintained
by the device, the device knows how much work it has outstanding, so it can
do the prefetch accordingly.
There are few options for the device to acquire the PI.
Most efficient will be to write the PI in the doorbell together with the queue
number.
Right. This was suggested in "Fwd: Virtio-1.1 Ring Layout".
Or just the PI if we don't need the queue number.
I would like to raise the need for a Completion Queue(CQ).
Multiple Work Queues(hold the work descriptors, WQ in short) can be
connected to a single CQ.
So when the device completes the work on the descriptor, it writes a
Completion Queue Entry (CQE) to the CQ.
This seems similar to the design we currently have with a separate used
ring. It seems to underperform writing into the available ring
at least on a microbenchmark. The reason seems to be that
more cache lines need to get invalidated and re-fetched.
Few points on that:
Each PCIe write will cause invalidation to a cache line.
Writing less than a cache line is inefficient, so it is better to put all metadata together and allocate a cache line for it.
Putting the metadata in the data buffer means two writes of less than a cache line each, both will be accessed by the driver, so potential two misses.
Post by Chien, Roger S
CQEs are continuous in memory so prefetching by the driver is efficient,
although the device might complete work descriptors out of order.
That's not different from proposal you are replying to - descriptors
can be used and written out in any order, they are contigious
so driver can prefetch.
Point is that if descriptors 1, 2, 4 are being completed in that order, in the proposed layout the completion indication will be placed at 1, 2, 4 in the virtq desc buffer.
With a CQ, completions on 1, 2, 4 will be placed at 1, 2, 3 CQ indexes.
This is why it is better for prefetching.
Post by Chien, Roger S
I'd like to add that attempts to
add prefetch e.g. for the virtio used ring never showed any
conclusive gains - some workloads would get minor a speedup, others
lose a bit of performance. I do not think anyone looked
deeply into why that was the case. It would be easy for you to add this
code to current virtio drivers and/or devices and try it out.
Noted.
I will say though that mlx5 uses prefetch and gets good performance because of it...
Post by Chien, Roger S
The interrupt handler is connected to the CQ, so an allocation of a single CQ
per core, with a single interrupt handler is possible although this core might be
using multiple WQs.
Sending a single interrupt from multiple rings might save some
cycles. event index/interrupt disable are currently in
RAM so access is very cheap for the guest.
If you are going to share, just disable all interrupts
when you start processing.
I was wondering how do people want to do this in hardware
in fact. Are you going to read event index after each descriptor?
Not sure I got you here.
Do you ask about how the device decides to write MSIX? And how interrupt moderation might work?
Post by Chien, Roger S
It might make sense to move event index/flags into device memory. And
once you do this, writing these out might become slower, and then some
kind of sharing of the event index might help.
No one suggested anything like this so far - maybe it's less of an issue
than I think, e.g. because of interrupt coalescing (which virtio also
does not have, but could be added if there is interest) or maybe people
just mostly care about polling performance.
One application for multiple WQs with a single CQ is Quality of Service(QoS).
A user can open a WQ per QoS value(pcp value for example), and the device
will schedule the work accordingly.
A ring per QOS level might make sense e.g. in a network device. I don't
see why it's helpful to write out completed entries into a separate
ring for that though.
I would like to add that for rdma device there are many queues (QPs), understanding which QP completed work by traversing all QPs in not efficient.

Another advantage of having a CQ connected to multiple WQs is that the interrupt moderation can be based on this single CQ,
So the criteria if to write interrupt or not is based on all the aggregated work that was completed on that CQ.
Post by Chien, Roger S
As we don't have QOS support in virtio net at all right now,
would that be best discussed once we have a working prototype
and everyone can see what the problem is?
Understood.
Although, I think the layout should not change frequently.
Post by Chien, Roger S
Post by Chien, Roger S
* Scatter/gather support
/* This marks a buffer as continuing via the next field. */
#define VRING_DESC_F_NEXT 1
Unlike virtio 1.0, all descriptors must have distinct ID values.
Also unlike virtio 1.0, use of this flag will be an optional feature
(e.g. VIRTIO_F_DESC_NEXT) so both devices and drivers can opt out of it.
* Indirect buffers
/* This means the buffer contains a table of buffer descriptors. */
#define VRING_DESC_F_INDIRECT 4
struct indirect_descriptor_table {
/* The actual descriptors (16 bytes each) */
struct virtq_desc desc[len / 16];
};
The first descriptor is located at start of the indirect descriptor
table, additional indirect descriptors come immediately afterwards.
DESC_F_WRITE is the only valid flag for descriptors in the indirect
table. Others should be set to 0 and are ignored. id is also set to 0
and should be ignored.
virtio 1.0 seems to allow a s/g entry followed by
an indirect descriptor. This does not seem useful,
so we do not allow that anymore.
This support would be an optional feature, same as in virtio 1.0
virtio 1.0 allows passing a batch of descriptors in both directions, by
incrementing the used/avail index by values > 1. We can support this by
chaining a list of descriptors through a bit the flags field.
To allow use together with s/g, a different bit will be used.
#define VRING_DESC_F_BATCH_NEXT 0x0010
Batching works for both driver and device descriptors.
* Processing descriptors in and out of order
Device processing all descriptors in order can simply flip
the DESC_HW bit as it is done with descriptors.
Device can write descriptors out in order as they are used, overwriting
descriptors that are there.
Device must not use a descriptor until DESC_HW is set.
It is only required to look at the first descriptor
submitted.
Driver must not overwrite a descriptor until DESC_HW is clear.
It is only required to look at the first descriptor
submitted.
* Device specific descriptor flags
We have a lot of unused space in the descriptor. This can be put to
good use by reserving some flag bits for device use.
For example, network device can set a bit to request
that header in the descriptor is suppressed
(in case it's all 0s anyway). This reduces cache utilization.
Note: this feature can be supported in virtio 1.0 as well,
as we have unused bits in both descriptor and used ring there.
* Descriptor length in device descriptors
virtio 1.0 places strict requirements on descriptor length. For example
it must be 0 in used ring of TX VQ of a network device since nothing is
written. In practice guests do not seem to use this, so we can simplify
devices a bit by removing this requirement - if length is unused it
should be ignored by driver.
Some devices use identically-sized buffers in all descriptors.
Ignoring length for driver descriptors there could be an option too.
* Writing at an offset
Some devices might want to write into some descriptors
at an offset, the offset would be in config space,
#define VRING_DESC_F_OFFSET 0x0020
How exactly to use the offset would be device specific,
for example it can be used to align beginning of packet
in the 1st buffer for mergeable buffers to cache line boundary
while also aligning rest of buffers.
* Non power-of-2 ring sizes
As the ring simply wraps around, there's no reason to
require ring size to be power of two.
It can be made a separate feature though.
* Interrupt/event suppression
virtio 1.0 has two mechanisms for suppression but only
one can be used at a time. we pack them together
struct event {
__le16 idx;
__le16 flags;
}
VIRTIO_F_EVENT_IDX
VIRTIO_F_EVENT_FLAGS
* Flags can be used like in virtio 1.0, by storing a special
#define VRING_F_EVENT_ENABLE 0x0
#define VRING_F_EVENT_DISABLE 0x1
* Event index would be in the range 0 to 2 * Queue Size
(to detect wrap arounds) and wrap to 0 after that.
The assumption is that each side maintains an internal
descriptor counter 0 to 2 * Queue Size that wraps to 0.
In that case, interrupt triggers when counter reaches
the given value.
* If both features are negotiated, a special flags value
#define VRING_F_EVENT_IDX 0x2
* Prototype
A partial prototype can be found under
tools/virtio/ringtest/ring.c
real 0m0.399s
user 0m0.791s
sys 0m0.000s
real 0m0.503s
user 0m0.999s
sys 0m0.000s
It is planned to update it to this interface. Future changes and
enhancements can (and should) be tested against this prototype.
* Feature sets
In particular are we going overboard with feature bits? It becomes hard
to support all combinations in drivers and devices. Maybe we should
document reasonable feature sets to be supported for each device.
* Known issues/ideas
This layout is optimized for host/guest communication,
in a sense even more aggressively than virtio 1.0.
It might be suboptimal for PCI hardware implementations.
However, one notes that current virtio pci drivers are
unlikely to work with PCI hardware implementations anyway
(e.g. due to use of SMP barriers for ordering).
Suggestions for improving this are welcome but need to be tested to make
sure our main use case does not regress. Of course some improvements
might be made optional, but if we add too many of these driver becomes
unmanageable.
---
Note: should this proposal be accepted and approved, one or more
claims disclosed to the TC admin and listed on the Virtio TC
IPR page
https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fww
Post by Chien, Roger S
w.oasis-
open.org%2Fcommittees%2Fvirtio%2Fipr.php&data=02%7C01%7Cliorn%40m
ellanox.com%7Cf41239019c1441e73b0308d4c7b0a483%7Ca652971c7d2e4d9
ba6a4d149256f461b%7C0%7C0%7C636353008872143792&sdata=L946V5o0P
Post by Chien, Roger S
k8th%2B2IkHgvALmhnIEWD9hcMZvMxLetavc%3D&reserved=0
might become Essential Claims.
--
MST
---------------------------------------------------------------------
Michael S. Tsirkin
2017-07-20 13:06:38 UTC
Permalink
Post by Lior Narkis
Post by Chien, Roger S
-----Original Message-----
Sent: Tuesday, July 18, 2017 7:23 PM
Subject: Re: [virtio-dev] packed ring layout proposal v2
Post by Chien, Roger S
-----Original Message-----
open.org]
Post by Chien, Roger S
On Behalf Of Michael S. Tsirkin
Sent: Wednesday, February 08, 2017 5:20 AM
Subject: [virtio-dev] packed ring layout proposal v2
This is an update from v1 version.
- update event suppression mechanism
- separate options for indirect and direct s/g
- lots of new features
---
Performance analysis of this is in my kvm forum 2016 presentation.
The idea is to have a r/w descriptor in a ring structure,
replacing the used and available ring, index and descriptor
buffer.
Guest adds descriptors with unique index values and DESC_HW set in flags.
Host overwrites used descriptors with correct len, index, and DESC_HW
clear. Flags are always set/cleared last.
#define DESC_HW 0x0080
struct desc {
__le64 addr;
__le32 len;
__le16 index;
__le16 flags;
};
When DESC_HW is set, descriptor belongs to device. When it is clear,
it belongs to the driver.
We can use 1 bit to set direction
/* This marks a buffer as write-only (otherwise read-only). */
#define VRING_DESC_F_WRITE 2
A valid bit per descriptor does not let the device do an efficient prefetch.
An alternative is to use a producer index(PI).
Using the PI posted by the driver, and the Consumer Index(CI) maintained
by the device, the device knows how much work it has outstanding, so it can
do the prefetch accordingly.
There are few options for the device to acquire the PI.
Most efficient will be to write the PI in the doorbell together with the queue
number.
Right. This was suggested in "Fwd: Virtio-1.1 Ring Layout".
Or just the PI if we don't need the queue number.
I would like to raise the need for a Completion Queue(CQ).
Multiple Work Queues(hold the work descriptors, WQ in short) can be
connected to a single CQ.
So when the device completes the work on the descriptor, it writes a
Completion Queue Entry (CQE) to the CQ.
This seems similar to the design we currently have with a separate used
ring. It seems to underperform writing into the available ring
at least on a microbenchmark. The reason seems to be that
more cache lines need to get invalidated and re-fetched.
Each PCIe write will cause invalidation to a cache line.
Writing less than a cache line is inefficient, so it is better to put all metadata together and allocate a cache line for it.
Putting the metadata in the data buffer means two writes of less than a cache line each, both will be accessed by the driver, so potential two misses.
I'm not sure how this is related to your suggestion. Sorry about being
dense. You suggested a separate used ring that can also be shared with
multiple available rings. Current design in effect makes used and
available rings overlap instead. One side effect is each entry is bigger
now (16 bytes as opposed to 8 bytes previously) so it should be easier
to move metadata from e.g. virtio net header to the descriptor entry.
Post by Lior Narkis
Post by Chien, Roger S
CQEs are continuous in memory so prefetching by the driver is efficient,
although the device might complete work descriptors out of order.
That's not different from proposal you are replying to - descriptors
can be used and written out in any order, they are contigious
so driver can prefetch.
Point is that if descriptors 1, 2, 4 are being completed in that order, in the proposed layout the completion indication will be placed at 1, 2, 4 in the virtq desc buffer.
Note: I think you mean used, not completed. Let's use the standard virtio terminology.
The answer is - not necessarily. device can write them out at entries 1,2,3. The only
requirement is really that eventually all entries 1-4 are switches to
driver ownership.

v2 says:
Device can write descriptors out in order as they are used, overwriting
descriptors that are there.

I think it would be clearer if it said:

Device can write descriptors out in the order in which they are used, overwriting
descriptors that are there.
Post by Lior Narkis
With a CQ, completions on 1, 2, 4 will be placed at 1, 2, 3 CQ indexes.
This is why it is better for prefetching.
Looks like a misunderstanding then.
Post by Lior Narkis
Post by Chien, Roger S
I'd like to add that attempts to
add prefetch e.g. for the virtio used ring never showed any
conclusive gains - some workloads would get minor a speedup, others
lose a bit of performance. I do not think anyone looked
deeply into why that was the case. It would be easy for you to add this
code to current virtio drivers and/or devices and try it out.
Noted.
I will say though that mlx5 uses prefetch and gets good performance because of it...
It is pretty popular with drivers, worth revisiting if someone has the
time.
Post by Lior Narkis
Post by Chien, Roger S
The interrupt handler is connected to the CQ, so an allocation of a single CQ
per core, with a single interrupt handler is possible although this core might be
using multiple WQs.
Sending a single interrupt from multiple rings might save some
cycles. event index/interrupt disable are currently in
RAM so access is very cheap for the guest.
If you are going to share, just disable all interrupts
when you start processing.
I was wondering how do people want to do this in hardware
in fact. Are you going to read event index after each descriptor?
Not sure I got you here.
Do you ask about how the device decides to write MSIX? And how interrupt moderation might work?
virtio has a flags/event index based interrupt suppression. It relies on
device reading flags/index after writing out a batch of descriptors.
Is this too costly and we should switch to driver writing the index,
or ok since it's only once per batch?
Post by Lior Narkis
Post by Chien, Roger S
It might make sense to move event index/flags into device memory. And
once you do this, writing these out might become slower, and then some
kind of sharing of the event index might help.
No one suggested anything like this so far - maybe it's less of an issue
than I think, e.g. because of interrupt coalescing (which virtio also
does not have, but could be added if there is interest) or maybe people
just mostly care about polling performance.
One application for multiple WQs with a single CQ is Quality of Service(QoS).
A user can open a WQ per QoS value(pcp value for example), and the device
will schedule the work accordingly.
A ring per QOS level might make sense e.g. in a network device. I don't
see why it's helpful to write out completed entries into a separate
ring for that though.
I would like to add that for rdma device there are many queues (QPs), understanding which QP completed work by traversing all QPs in not efficient.
Another advantage of having a CQ connected to multiple WQs is that the interrupt moderation can be based on this single CQ,
So the criteria if to write interrupt or not is based on all the aggregated work that was completed on that CQ.
Post by Chien, Roger S
As we don't have QOS support in virtio net at all right now,
would that be best discussed once we have a working prototype
and everyone can see what the problem is?
Understood.
Although, I think the layout should not change frequently.
Post by Chien, Roger S
Post by Chien, Roger S
* Scatter/gather support
/* This marks a buffer as continuing via the next field. */
#define VRING_DESC_F_NEXT 1
Unlike virtio 1.0, all descriptors must have distinct ID values.
Also unlike virtio 1.0, use of this flag will be an optional feature
(e.g. VIRTIO_F_DESC_NEXT) so both devices and drivers can opt out of it.
* Indirect buffers
/* This means the buffer contains a table of buffer descriptors. */
#define VRING_DESC_F_INDIRECT 4
struct indirect_descriptor_table {
/* The actual descriptors (16 bytes each) */
struct virtq_desc desc[len / 16];
};
The first descriptor is located at start of the indirect descriptor
table, additional indirect descriptors come immediately afterwards.
DESC_F_WRITE is the only valid flag for descriptors in the indirect
table. Others should be set to 0 and are ignored. id is also set to 0
and should be ignored.
virtio 1.0 seems to allow a s/g entry followed by
an indirect descriptor. This does not seem useful,
so we do not allow that anymore.
This support would be an optional feature, same as in virtio 1.0
virtio 1.0 allows passing a batch of descriptors in both directions, by
incrementing the used/avail index by values > 1. We can support this by
chaining a list of descriptors through a bit the flags field.
To allow use together with s/g, a different bit will be used.
#define VRING_DESC_F_BATCH_NEXT 0x0010
Batching works for both driver and device descriptors.
* Processing descriptors in and out of order
Device processing all descriptors in order can simply flip
the DESC_HW bit as it is done with descriptors.
Device can write descriptors out in order as they are used, overwriting
descriptors that are there.
Device must not use a descriptor until DESC_HW is set.
It is only required to look at the first descriptor
submitted.
Driver must not overwrite a descriptor until DESC_HW is clear.
It is only required to look at the first descriptor
submitted.
* Device specific descriptor flags
We have a lot of unused space in the descriptor. This can be put to
good use by reserving some flag bits for device use.
For example, network device can set a bit to request
that header in the descriptor is suppressed
(in case it's all 0s anyway). This reduces cache utilization.
Note: this feature can be supported in virtio 1.0 as well,
as we have unused bits in both descriptor and used ring there.
* Descriptor length in device descriptors
virtio 1.0 places strict requirements on descriptor length. For example
it must be 0 in used ring of TX VQ of a network device since nothing is
written. In practice guests do not seem to use this, so we can simplify
devices a bit by removing this requirement - if length is unused it
should be ignored by driver.
Some devices use identically-sized buffers in all descriptors.
Ignoring length for driver descriptors there could be an option too.
* Writing at an offset
Some devices might want to write into some descriptors
at an offset, the offset would be in config space,
#define VRING_DESC_F_OFFSET 0x0020
How exactly to use the offset would be device specific,
for example it can be used to align beginning of packet
in the 1st buffer for mergeable buffers to cache line boundary
while also aligning rest of buffers.
* Non power-of-2 ring sizes
As the ring simply wraps around, there's no reason to
require ring size to be power of two.
It can be made a separate feature though.
* Interrupt/event suppression
virtio 1.0 has two mechanisms for suppression but only
one can be used at a time. we pack them together
struct event {
__le16 idx;
__le16 flags;
}
VIRTIO_F_EVENT_IDX
VIRTIO_F_EVENT_FLAGS
* Flags can be used like in virtio 1.0, by storing a special
#define VRING_F_EVENT_ENABLE 0x0
#define VRING_F_EVENT_DISABLE 0x1
* Event index would be in the range 0 to 2 * Queue Size
(to detect wrap arounds) and wrap to 0 after that.
The assumption is that each side maintains an internal
descriptor counter 0 to 2 * Queue Size that wraps to 0.
In that case, interrupt triggers when counter reaches
the given value.
* If both features are negotiated, a special flags value
#define VRING_F_EVENT_IDX 0x2
* Prototype
A partial prototype can be found under
tools/virtio/ringtest/ring.c
real 0m0.399s
user 0m0.791s
sys 0m0.000s
real 0m0.503s
user 0m0.999s
sys 0m0.000s
It is planned to update it to this interface. Future changes and
enhancements can (and should) be tested against this prototype.
* Feature sets
In particular are we going overboard with feature bits? It becomes hard
to support all combinations in drivers and devices. Maybe we should
document reasonable feature sets to be supported for each device.
* Known issues/ideas
This layout is optimized for host/guest communication,
in a sense even more aggressively than virtio 1.0.
It might be suboptimal for PCI hardware implementations.
However, one notes that current virtio pci drivers are
unlikely to work with PCI hardware implementations anyway
(e.g. due to use of SMP barriers for ordering).
Suggestions for improving this are welcome but need to be tested to make
sure our main use case does not regress. Of course some improvements
might be made optional, but if we add too many of these driver becomes
unmanageable.
---
Note: should this proposal be accepted and approved, one or more
claims disclosed to the TC admin and listed on the Virtio TC
IPR page
https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fww
Post by Chien, Roger S
w.oasis-
open.org%2Fcommittees%2Fvirtio%2Fipr.php&data=02%7C01%7Cliorn%40m
ellanox.com%7Cf41239019c1441e73b0308d4c7b0a483%7Ca652971c7d2e4d9
ba6a4d149256f461b%7C0%7C0%7C636353008872143792&sdata=L946V5o0P
Post by Chien, Roger S
k8th%2B2IkHgvALmhnIEWD9hcMZvMxLetavc%3D&reserved=0
might become Essential Claims.
--
MST
---------------------------------------------------------------------
Continue reading on narkive:
Loading...