[SRU][Xenial][PATCH 0/1] ibmveth: Support to enable LSO/CSO for Trunk VEA.

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

[SRU][Xenial][PATCH 0/1] ibmveth: Support to enable LSO/CSO for Trunk VEA.

Joseph Salisbury-3
BugLink: http://bugs.launchpad.net/bugs/1692538

== SRU Justification ==
Commit 66aa0678ef is request to fix four issues with the ibmveth driver.  
The issues are as follows:
- Issue 1: ibmveth doesn't support largesend and checksum offload features when configured as "Trunk".
- Issue 2: SYN packet drops seen at destination VM. When the packet
originates, it has CHECKSUM_PARTIAL flag set and as it gets delivered to IO
server's inbound Trunk ibmveth, on validating "checksum good" bits in ibmveth
receive routine, SKB's ip_summed field is set with CHECKSUM_UNNECESSARY flag.
- Issue 3: First packet of a TCP connection will be dropped, if there is
no OVS flow cached in datapath.
- Issue 4: ibmveth driver doesn't have support for SKB's with frag_list.

The details for the fixes to these issues are described in the commits git log.

This fix has already been included in Zesty and Artful, but Xenial required a backport.  

Commit 66aa0678ef is in mainline as of v4.13-rc1.

== Fix ==
commit 66aa0678efc29abd2ab02a09b23f9a8bc9f12a6c
Author: Sivakumar Krishnasamy <[hidden email]>
Date:   Fri May 19 05:30:38 2017 -0400

    ibmveth: Support to enable LSO/CSO for Trunk VEA.

== Regression Potential ==
The changes are specific to the ibmveth driver, so regression risk is low.

== Test Case ==
A test kernel was built with this patch and tested by the original bug reporter.
The bug reporter states the test kernel resolved the bug.

Sivakumar Krishnasamy (1):
  ibmveth: Support to enable LSO/CSO for Trunk VEA.

 drivers/net/ethernet/ibm/ibmveth.c | 107 ++++++++++++++++++++++++++++++-------
 drivers/net/ethernet/ibm/ibmveth.h |   1 +
 2 files changed, 90 insertions(+), 18 deletions(-)

--
2.7.4


--
kernel-team mailing list
[hidden email]
https://lists.ubuntu.com/mailman/listinfo/kernel-team
Reply | Threaded
Open this post in threaded view
|

[SRU][Xenial][PATCH 1/1] ibmveth: Support to enable LSO/CSO for Trunk VEA.

Joseph Salisbury-3
From: Sivakumar Krishnasamy <[hidden email]>

BugLink: http://bugs.launchpad.net/bugs/1692538

Current largesend and checksum offload feature in ibmveth driver,
 - Source VM sends the TCP packets with ip_summed field set as
   CHECKSUM_PARTIAL and TCP pseudo header checksum is placed in
   checksum field
 - CHECKSUM_PARTIAL flag in SKB will enable ibmveth driver to mark
   "no checksum" and "checksum good" bits in transmit buffer descriptor
   before the packet is delivered to pseries PowerVM Hypervisor
 - If ibmveth has largesend capability enabled, transmit buffer descriptors
   are market accordingly before packet is delivered to Hypervisor
   (along with mss value for packets with length > MSS)
 - Destination VM's ibmveth driver receives the packet with "checksum good"
   bit set and so, SKB's ip_summed field is set with CHECKSUM_UNNECESSARY
 - If "largesend" bit was on, mss value is copied from receive descriptor
   into SKB's gso_size and other flags are appropriately set for
   packets > MSS size
 - The packet is now successfully delivered up the stack in destination VM

The offloads described above works fine for TCP communication among VMs in
the same pseries server ( VM A <=> PowerVM Hypervisor <=> VM B )

We are now enabling support for OVS in pseries PowerVM environment. One of
our requirements is to have ibmveth driver configured in "Trunk" mode, when
they are used with OVS. This is because, PowerVM Hypervisor will no more
bridge the packets between VMs, instead the packets are delivered to
IO Server which hosts OVS to bridge them between VMs or to external
networks (flow shown below),
  VM A <=> PowerVM Hypervisor <=> IO Server(OVS) <=> PowerVM Hypervisor
                                                                   <=> VM B
In "IO server" the packet is received by inbound Trunk ibmveth and then
delivered to OVS, which is then bridged to outbound Trunk ibmveth (shown
below),
        Inbound Trunk ibmveth <=> OVS <=> Outbound Trunk ibmveth

In this model, we hit the following issues which impacted the VM
communication performance,

 - Issue 1: ibmveth doesn't support largesend and checksum offload features
   when configured as "Trunk". Driver has explicit checks to prevent
   enabling these offloads.

 - Issue 2: SYN packet drops seen at destination VM. When the packet
   originates, it has CHECKSUM_PARTIAL flag set and as it gets delivered to
   IO server's inbound Trunk ibmveth, on validating "checksum good" bits
   in ibmveth receive routine, SKB's ip_summed field is set with
   CHECKSUM_UNNECESSARY flag. This packet is then bridged by OVS (or Linux
   Bridge) and delivered to outbound Trunk ibmveth. At this point the
   outbound ibmveth transmit routine will not set "no checksum" and
   "checksum good" bits in transmit buffer descriptor, as it does so only
   when the ip_summed field is CHECKSUM_PARTIAL. When this packet gets
   delivered to destination VM, TCP layer receives the packet with checksum
   value of 0 and with no checksum related flags in ip_summed field. This
   leads to packet drops. So, TCP connections never goes through fine.

 - Issue 3: First packet of a TCP connection will be dropped, if there is
   no OVS flow cached in datapath. OVS while trying to identify the flow,
   computes the checksum. The computed checksum will be invalid at the
   receiving end, as ibmveth transmit routine zeroes out the pseudo
   checksum value in the packet. This leads to packet drop.

 - Issue 4: ibmveth driver doesn't have support for SKB's with frag_list.
   When Physical NIC has GRO enabled and when OVS bridges these packets,
   OVS vport send code will end up calling dev_queue_xmit, which in turn
   calls validate_xmit_skb.
   In validate_xmit_skb routine, the larger packets will get segmented into
   MSS sized segments, if SKB has a frag_list and if the driver to which
   they are delivered to doesn't support NETIF_F_FRAGLIST feature.

This patch addresses the above four issues, thereby enabling end to end
largesend and checksum offload support for better performance.

 - Fix for Issue 1 : Remove checks which prevent enabling TCP largesend and
   checksum offloads.
 - Fix for Issue 2 : When ibmveth receives a packet with "checksum good"
   bit set and if its configured in Trunk mode, set appropriate SKB fields
   using skb_partial_csum_set (ip_summed field is set with
   CHECKSUM_PARTIAL)
 - Fix for Issue 3: Recompute the pseudo header checksum before sending the
   SKB up the stack.
 - Fix for Issue 4: Linearize the SKBs with frag_list. Though we end up
   allocating buffers and copying data, this fix gives
   upto 4X throughput increase.

Note: All these fixes need to be dropped together as fixing just one of
them will lead to other issues immediately (especially for Issues 1,2 & 3).

Signed-off-by: Sivakumar Krishnasamy <[hidden email]>
Signed-off-by: David S. Miller <[hidden email]>
(back ported from commit 66aa0678efc29abd2ab02a09b23f9a8bc9f12a6c)
Signed-off-by: Joseph Salisbury <[hidden email]>
---
 drivers/net/ethernet/ibm/ibmveth.c | 107 ++++++++++++++++++++++++++++++-------
 drivers/net/ethernet/ibm/ibmveth.h |   1 +
 2 files changed, 90 insertions(+), 18 deletions(-)

diff --git a/drivers/net/ethernet/ibm/ibmveth.c b/drivers/net/ethernet/ibm/ibmveth.c
index 2f9b12c..0482491 100644
--- a/drivers/net/ethernet/ibm/ibmveth.c
+++ b/drivers/net/ethernet/ibm/ibmveth.c
@@ -46,6 +46,8 @@
 #include <asm/vio.h>
 #include <asm/iommu.h>
 #include <asm/firmware.h>
+#include <net/tcp.h>
+#include <net/ip6_checksum.h>
 
 #include "ibmveth.h"
 
@@ -804,8 +806,7 @@ static int ibmveth_set_csum_offload(struct net_device *dev, u32 data)
 
  ret = h_illan_attributes(adapter->vdev->unit_address, 0, 0, &ret_attr);
 
- if (ret == H_SUCCESS && !(ret_attr & IBMVETH_ILLAN_ACTIVE_TRUNK) &&
-    !(ret_attr & IBMVETH_ILLAN_TRUNK_PRI_MASK) &&
+ if (ret == H_SUCCESS &&
     (ret_attr & IBMVETH_ILLAN_PADDED_PKT_CSUM)) {
  ret4 = h_illan_attributes(adapter->vdev->unit_address, clr_attr,
  set_attr, &ret_attr);
@@ -1035,6 +1036,15 @@ static netdev_tx_t ibmveth_start_xmit(struct sk_buff *skb,
  dma_addr_t dma_addr;
  unsigned long mss = 0;
 
+ /* veth doesn't handle frag_list, so linearize the skb.
+ * When GRO is enabled SKB's can have frag_list.
+ */
+ if (adapter->is_active_trunk &&
+    skb_has_frag_list(skb) && __skb_linearize(skb)) {
+ netdev->stats.tx_dropped++;
+ goto out;
+ }
+
  /*
  * veth handles a maximum of 6 segments including the header, so
  * we have to linearize the skb if there are more than this.
@@ -1059,9 +1069,6 @@ static netdev_tx_t ibmveth_start_xmit(struct sk_buff *skb,
 
  desc_flags = IBMVETH_BUF_VALID;
 
- if (skb_is_gso(skb) && adapter->fw_large_send_support)
- desc_flags |= IBMVETH_BUF_LRG_SND;
-
  if (skb->ip_summed == CHECKSUM_PARTIAL) {
  unsigned char *buf = skb_transport_header(skb) +
  skb->csum_offset;
@@ -1071,6 +1078,9 @@ static netdev_tx_t ibmveth_start_xmit(struct sk_buff *skb,
  /* Need to zero out the checksum */
  buf[0] = 0;
  buf[1] = 0;
+
+ if (skb_is_gso(skb) && adapter->fw_large_send_support)
+ desc_flags |= IBMVETH_BUF_LRG_SND;
  }
 
 retry_bounce:
@@ -1123,7 +1133,7 @@ retry_bounce:
  descs[i+1].fields.address = dma_addr;
  }
 
- if (skb_is_gso(skb)) {
+ if (skb->ip_summed == CHECKSUM_PARTIAL && skb_is_gso(skb)) {
  if (adapter->fw_large_send_support) {
  mss = (unsigned long)skb_shinfo(skb)->gso_size;
  adapter->tx_large_packets++;
@@ -1224,6 +1234,71 @@ static void ibmveth_rx_mss_helper(struct sk_buff *skb, u16 mss, int lrg_pkt)
  }
 }
 
+static void ibmveth_rx_csum_helper(struct sk_buff *skb,
+   struct ibmveth_adapter *adapter)
+{
+ struct iphdr *iph = NULL;
+ struct ipv6hdr *iph6 = NULL;
+ __be16 skb_proto = 0;
+ u16 iphlen = 0;
+ u16 iph_proto = 0;
+ u16 tcphdrlen = 0;
+
+ skb_proto = be16_to_cpu(skb->protocol);
+
+ if (skb_proto == ETH_P_IP) {
+ iph = (struct iphdr *)skb->data;
+
+ /* If the IP checksum is not offloaded and if the packet
+ *  is large send, the checksum must be rebuilt.
+ */
+ if (iph->check == 0xffff) {
+ iph->check = 0;
+ iph->check = ip_fast_csum((unsigned char *)iph,
+  iph->ihl);
+ }
+
+ iphlen = iph->ihl * 4;
+ iph_proto = iph->protocol;
+ } else if (skb_proto == ETH_P_IPV6) {
+ iph6 = (struct ipv6hdr *)skb->data;
+ iphlen = sizeof(struct ipv6hdr);
+ iph_proto = iph6->nexthdr;
+ }
+
+ /* In OVS environment, when a flow is not cached, specifically for a
+ * new TCP connection, the first packet information is passed up
+ * the user space for finding a flow. During this process, OVS computes
+ * checksum on the first packet when CHECKSUM_PARTIAL flag is set.
+ *
+ * Given that we zeroed out TCP checksum field in transmit path
+ * (refer ibmveth_start_xmit routine) as we set "no checksum bit",
+ * OVS computed checksum will be incorrect w/o TCP pseudo checksum
+ * in the packet. This leads to OVS dropping the packet and hence
+ * TCP retransmissions are seen.
+ *
+ * So, re-compute TCP pseudo header checksum.
+ */
+ if (iph_proto == IPPROTO_TCP && adapter->is_active_trunk) {
+ struct tcphdr *tcph = (struct tcphdr *)(skb->data + iphlen);
+
+ tcphdrlen = skb->len - iphlen;
+
+ /* Recompute TCP pseudo header checksum */
+ if (skb_proto == ETH_P_IP)
+ tcph->check = ~csum_tcpudp_magic(iph->saddr,
+ iph->daddr, tcphdrlen, iph_proto, 0);
+ else if (skb_proto == ETH_P_IPV6)
+ tcph->check = ~csum_ipv6_magic(&iph6->saddr,
+ &iph6->daddr, tcphdrlen, iph_proto, 0);
+
+ /* Setup SKB fields for checksum offload */
+ skb_partial_csum_set(skb, iphlen,
+     offsetof(struct tcphdr, check));
+ skb_reset_network_header(skb);
+ }
+}
+
 static int ibmveth_poll(struct napi_struct *napi, int budget)
 {
  struct ibmveth_adapter *adapter =
@@ -1231,7 +1306,6 @@ static int ibmveth_poll(struct napi_struct *napi, int budget)
  struct net_device *netdev = adapter->netdev;
  int frames_processed = 0;
  unsigned long lpar_rc;
- struct iphdr *iph;
  u16 mss = 0;
 
 restart_poll:
@@ -1289,17 +1363,7 @@ restart_poll:
 
  if (csum_good) {
  skb->ip_summed = CHECKSUM_UNNECESSARY;
- if (be16_to_cpu(skb->protocol) == ETH_P_IP) {
- iph = (struct iphdr *)skb->data;
-
- /* If the IP checksum is not offloaded and if the packet
- *  is large send, the checksum must be rebuilt.
- */
- if (iph->check == 0xffff) {
- iph->check = 0;
- iph->check = ip_fast_csum((unsigned char *)iph, iph->ihl);
- }
- }
+ ibmveth_rx_csum_helper(skb, adapter);
  }
 
  if (length > netdev->mtu + ETH_HLEN) {
@@ -1621,6 +1685,13 @@ static int ibmveth_probe(struct vio_dev *dev, const struct vio_device_id *id)
  netdev->hw_features |= NETIF_F_TSO;
  }
 
+ adapter->is_active_trunk = false;
+ if (ret == H_SUCCESS && (ret_attr & IBMVETH_ILLAN_ACTIVE_TRUNK)) {
+ adapter->is_active_trunk = true;
+ netdev->hw_features |= NETIF_F_FRAGLIST;
+ netdev->features |= NETIF_F_FRAGLIST;
+ }
+
  memcpy(netdev->dev_addr, mac_addr_p, ETH_ALEN);
 
  if (firmware_has_feature(FW_FEATURE_CMO))
diff --git a/drivers/net/ethernet/ibm/ibmveth.h b/drivers/net/ethernet/ibm/ibmveth.h
index 7acda04..de6e381 100644
--- a/drivers/net/ethernet/ibm/ibmveth.h
+++ b/drivers/net/ethernet/ibm/ibmveth.h
@@ -157,6 +157,7 @@ struct ibmveth_adapter {
     int pool_config;
     int rx_csum;
     int large_send;
+    bool is_active_trunk;
     void *bounce_buffer;
     dma_addr_t bounce_buffer_dma;
 
--
2.7.4


--
kernel-team mailing list
[hidden email]
https://lists.ubuntu.com/mailman/listinfo/kernel-team
Reply | Threaded
Open this post in threaded view
|

Re: [SRU][Xenial][PATCH 1/1] ibmveth: Support to enable LSO/CSO for Trunk VEA.

Daniel Axtens
Hi,

Sorry if I'm a little late to the party with this...

In support we've seen a case where ibmveth + openvswitch + bnx2x has
lead to some issues, which you should probably be aware of before
turning on large segments in more places.

Here's my summary from that support case, appropriately sanitised.

==========

[Issue: we see a firmware assertion from an IBM branded bnx2x card.
Decoding the dump with the help of upstream shows that the assert is
caused by a packet with GSO on and gso_size > ~9700 bytes being passed
to the card. We traced the packets through the system, and came up
with this root cause. The system uses ibmveth to talk to AIX LPARs, a
bnx2x network card to talk to the world, and Open vSwitch to tie them
together. There is no VIOS involvement - the card is attached to the
Linux partition.]

The packets causing the issue come through the ibmveth interface -
from the AIX LPAR. The veth protocol is 'special' - communication
between LPARs on the same chassis can use very large (64k) frames to
reduce overhead. Normal networks cannot handle such large packets, so
traditionally, the VIOS partition would signal to the AIX partitions
that it was 'special', and AIX would send regular, ethernet-sized
packets to VIOS, which VIOS would then send out.

This signalling between VIOS and AIX is done in a way that is not
standards-compliant, and so was never made part of Linux. Instead, the
Linux driver has always understood large frames and passed them up the
network stack.

In some cases (e.g. with TCP), multiple TCP segments are coalesced
into one large packet. In Linux, this goes through the generic receive
offload code, using a similar mechanism to GSO. These segments can be
very large which presents as a very large MSS (maximum segment size)
or gso_size.

Normally, the large packet is simply passed to whatever network
application on Linux is going to consume it, and everything is OK.

However, in this case, the packets go through Open vSwitch, and are
then passed to the bnx2x driver. The bnx2x driver/hardware supports
TSO and GSO, but with a restriction: the maximum segment size is
limited to around 9700 bytes. Normally this is more than adequate as
jumbo frames are limited to 9000 bytes. However, if a large packet
with large (>9700 byte) TCP segments arrives through ibmveth, and is
passed to bnx2x, the hardware will panic.

Turning off TSO prevents the crash as the kernel resegments the data
and assembles the packets in software. This has a performance cost.

Clearly at the very least, bnx2x should not crash in this case, and I
am working towards a patch for that.

However, this still leaves us with some issues. The only thing the
bnx2x driver can sensibly do is drop the packet, which will prevent
the crash. However, there will still be issues with large packets:
when they are dropped, the other side will eventually realise that the
data is missing and ask for a retransmit, but the retransmit might
also be too big - there's no way of signalling back to the AIX LPAR
that it should reduce the MSS. Even if the data eventually gets
through there will be a latency/throughput/performance hit.

==========

Seeing as IBM seems to be in active development in this area - indeed
this code explicitly deals with ibmveth + ovs, could we gently push
for this issue to be considered and fixed also?

I'm happy to engage on launchpad or whatever the most appropriate channel is.

Regards,
Daniel

On Sat, Aug 26, 2017 at 2:23 AM, Joseph Salisbury
<[hidden email]> wrote:

> From: Sivakumar Krishnasamy <[hidden email]>
>
> BugLink: http://bugs.launchpad.net/bugs/1692538
>
> Current largesend and checksum offload feature in ibmveth driver,
>  - Source VM sends the TCP packets with ip_summed field set as
>    CHECKSUM_PARTIAL and TCP pseudo header checksum is placed in
>    checksum field
>  - CHECKSUM_PARTIAL flag in SKB will enable ibmveth driver to mark
>    "no checksum" and "checksum good" bits in transmit buffer descriptor
>    before the packet is delivered to pseries PowerVM Hypervisor
>  - If ibmveth has largesend capability enabled, transmit buffer descriptors
>    are market accordingly before packet is delivered to Hypervisor
>    (along with mss value for packets with length > MSS)
>  - Destination VM's ibmveth driver receives the packet with "checksum good"
>    bit set and so, SKB's ip_summed field is set with CHECKSUM_UNNECESSARY
>  - If "largesend" bit was on, mss value is copied from receive descriptor
>    into SKB's gso_size and other flags are appropriately set for
>    packets > MSS size
>  - The packet is now successfully delivered up the stack in destination VM
>
> The offloads described above works fine for TCP communication among VMs in
> the same pseries server ( VM A <=> PowerVM Hypervisor <=> VM B )
>
> We are now enabling support for OVS in pseries PowerVM environment. One of
> our requirements is to have ibmveth driver configured in "Trunk" mode, when
> they are used with OVS. This is because, PowerVM Hypervisor will no more
> bridge the packets between VMs, instead the packets are delivered to
> IO Server which hosts OVS to bridge them between VMs or to external
> networks (flow shown below),
>   VM A <=> PowerVM Hypervisor <=> IO Server(OVS) <=> PowerVM Hypervisor
>                                                                    <=> VM B
> In "IO server" the packet is received by inbound Trunk ibmveth and then
> delivered to OVS, which is then bridged to outbound Trunk ibmveth (shown
> below),
>         Inbound Trunk ibmveth <=> OVS <=> Outbound Trunk ibmveth
>
> In this model, we hit the following issues which impacted the VM
> communication performance,
>
>  - Issue 1: ibmveth doesn't support largesend and checksum offload features
>    when configured as "Trunk". Driver has explicit checks to prevent
>    enabling these offloads.
>
>  - Issue 2: SYN packet drops seen at destination VM. When the packet
>    originates, it has CHECKSUM_PARTIAL flag set and as it gets delivered to
>    IO server's inbound Trunk ibmveth, on validating "checksum good" bits
>    in ibmveth receive routine, SKB's ip_summed field is set with
>    CHECKSUM_UNNECESSARY flag. This packet is then bridged by OVS (or Linux
>    Bridge) and delivered to outbound Trunk ibmveth. At this point the
>    outbound ibmveth transmit routine will not set "no checksum" and
>    "checksum good" bits in transmit buffer descriptor, as it does so only
>    when the ip_summed field is CHECKSUM_PARTIAL. When this packet gets
>    delivered to destination VM, TCP layer receives the packet with checksum
>    value of 0 and with no checksum related flags in ip_summed field. This
>    leads to packet drops. So, TCP connections never goes through fine.
>
>  - Issue 3: First packet of a TCP connection will be dropped, if there is
>    no OVS flow cached in datapath. OVS while trying to identify the flow,
>    computes the checksum. The computed checksum will be invalid at the
>    receiving end, as ibmveth transmit routine zeroes out the pseudo
>    checksum value in the packet. This leads to packet drop.
>
>  - Issue 4: ibmveth driver doesn't have support for SKB's with frag_list.
>    When Physical NIC has GRO enabled and when OVS bridges these packets,
>    OVS vport send code will end up calling dev_queue_xmit, which in turn
>    calls validate_xmit_skb.
>    In validate_xmit_skb routine, the larger packets will get segmented into
>    MSS sized segments, if SKB has a frag_list and if the driver to which
>    they are delivered to doesn't support NETIF_F_FRAGLIST feature.
>
> This patch addresses the above four issues, thereby enabling end to end
> largesend and checksum offload support for better performance.
>
>  - Fix for Issue 1 : Remove checks which prevent enabling TCP largesend and
>    checksum offloads.
>  - Fix for Issue 2 : When ibmveth receives a packet with "checksum good"
>    bit set and if its configured in Trunk mode, set appropriate SKB fields
>    using skb_partial_csum_set (ip_summed field is set with
>    CHECKSUM_PARTIAL)
>  - Fix for Issue 3: Recompute the pseudo header checksum before sending the
>    SKB up the stack.
>  - Fix for Issue 4: Linearize the SKBs with frag_list. Though we end up
>    allocating buffers and copying data, this fix gives
>    upto 4X throughput increase.
>
> Note: All these fixes need to be dropped together as fixing just one of
> them will lead to other issues immediately (especially for Issues 1,2 & 3).
>
> Signed-off-by: Sivakumar Krishnasamy <[hidden email]>
> Signed-off-by: David S. Miller <[hidden email]>
> (back ported from commit 66aa0678efc29abd2ab02a09b23f9a8bc9f12a6c)
> Signed-off-by: Joseph Salisbury <[hidden email]>
> ---
>  drivers/net/ethernet/ibm/ibmveth.c | 107 ++++++++++++++++++++++++++++++-------
>  drivers/net/ethernet/ibm/ibmveth.h |   1 +
>  2 files changed, 90 insertions(+), 18 deletions(-)
>
> diff --git a/drivers/net/ethernet/ibm/ibmveth.c b/drivers/net/ethernet/ibm/ibmveth.c
> index 2f9b12c..0482491 100644
> --- a/drivers/net/ethernet/ibm/ibmveth.c
> +++ b/drivers/net/ethernet/ibm/ibmveth.c
> @@ -46,6 +46,8 @@
>  #include <asm/vio.h>
>  #include <asm/iommu.h>
>  #include <asm/firmware.h>
> +#include <net/tcp.h>
> +#include <net/ip6_checksum.h>
>
>  #include "ibmveth.h"
>
> @@ -804,8 +806,7 @@ static int ibmveth_set_csum_offload(struct net_device *dev, u32 data)
>
>         ret = h_illan_attributes(adapter->vdev->unit_address, 0, 0, &ret_attr);
>
> -       if (ret == H_SUCCESS && !(ret_attr & IBMVETH_ILLAN_ACTIVE_TRUNK) &&
> -           !(ret_attr & IBMVETH_ILLAN_TRUNK_PRI_MASK) &&
> +       if (ret == H_SUCCESS &&
>             (ret_attr & IBMVETH_ILLAN_PADDED_PKT_CSUM)) {
>                 ret4 = h_illan_attributes(adapter->vdev->unit_address, clr_attr,
>                                          set_attr, &ret_attr);
> @@ -1035,6 +1036,15 @@ static netdev_tx_t ibmveth_start_xmit(struct sk_buff *skb,
>         dma_addr_t dma_addr;
>         unsigned long mss = 0;
>
> +       /* veth doesn't handle frag_list, so linearize the skb.
> +        * When GRO is enabled SKB's can have frag_list.
> +        */
> +       if (adapter->is_active_trunk &&
> +           skb_has_frag_list(skb) && __skb_linearize(skb)) {
> +               netdev->stats.tx_dropped++;
> +               goto out;
> +       }
> +
>         /*
>          * veth handles a maximum of 6 segments including the header, so
>          * we have to linearize the skb if there are more than this.
> @@ -1059,9 +1069,6 @@ static netdev_tx_t ibmveth_start_xmit(struct sk_buff *skb,
>
>         desc_flags = IBMVETH_BUF_VALID;
>
> -       if (skb_is_gso(skb) && adapter->fw_large_send_support)
> -               desc_flags |= IBMVETH_BUF_LRG_SND;
> -
>         if (skb->ip_summed == CHECKSUM_PARTIAL) {
>                 unsigned char *buf = skb_transport_header(skb) +
>                                                 skb->csum_offset;
> @@ -1071,6 +1078,9 @@ static netdev_tx_t ibmveth_start_xmit(struct sk_buff *skb,
>                 /* Need to zero out the checksum */
>                 buf[0] = 0;
>                 buf[1] = 0;
> +
> +               if (skb_is_gso(skb) && adapter->fw_large_send_support)
> +                       desc_flags |= IBMVETH_BUF_LRG_SND;
>         }
>
>  retry_bounce:
> @@ -1123,7 +1133,7 @@ retry_bounce:
>                 descs[i+1].fields.address = dma_addr;
>         }
>
> -       if (skb_is_gso(skb)) {
> +       if (skb->ip_summed == CHECKSUM_PARTIAL && skb_is_gso(skb)) {
>                 if (adapter->fw_large_send_support) {
>                         mss = (unsigned long)skb_shinfo(skb)->gso_size;
>                         adapter->tx_large_packets++;
> @@ -1224,6 +1234,71 @@ static void ibmveth_rx_mss_helper(struct sk_buff *skb, u16 mss, int lrg_pkt)
>         }
>  }
>
> +static void ibmveth_rx_csum_helper(struct sk_buff *skb,
> +                                  struct ibmveth_adapter *adapter)
> +{
> +       struct iphdr *iph = NULL;
> +       struct ipv6hdr *iph6 = NULL;
> +       __be16 skb_proto = 0;
> +       u16 iphlen = 0;
> +       u16 iph_proto = 0;
> +       u16 tcphdrlen = 0;
> +
> +       skb_proto = be16_to_cpu(skb->protocol);
> +
> +       if (skb_proto == ETH_P_IP) {
> +               iph = (struct iphdr *)skb->data;
> +
> +               /* If the IP checksum is not offloaded and if the packet
> +                *  is large send, the checksum must be rebuilt.
> +                */
> +               if (iph->check == 0xffff) {
> +                       iph->check = 0;
> +                       iph->check = ip_fast_csum((unsigned char *)iph,
> +                                                 iph->ihl);
> +               }
> +
> +               iphlen = iph->ihl * 4;
> +               iph_proto = iph->protocol;
> +       } else if (skb_proto == ETH_P_IPV6) {
> +               iph6 = (struct ipv6hdr *)skb->data;
> +               iphlen = sizeof(struct ipv6hdr);
> +               iph_proto = iph6->nexthdr;
> +       }
> +
> +       /* In OVS environment, when a flow is not cached, specifically for a
> +        * new TCP connection, the first packet information is passed up
> +        * the user space for finding a flow. During this process, OVS computes
> +        * checksum on the first packet when CHECKSUM_PARTIAL flag is set.
> +        *
> +        * Given that we zeroed out TCP checksum field in transmit path
> +        * (refer ibmveth_start_xmit routine) as we set "no checksum bit",
> +        * OVS computed checksum will be incorrect w/o TCP pseudo checksum
> +        * in the packet. This leads to OVS dropping the packet and hence
> +        * TCP retransmissions are seen.
> +        *
> +        * So, re-compute TCP pseudo header checksum.
> +        */
> +       if (iph_proto == IPPROTO_TCP && adapter->is_active_trunk) {
> +               struct tcphdr *tcph = (struct tcphdr *)(skb->data + iphlen);
> +
> +               tcphdrlen = skb->len - iphlen;
> +
> +               /* Recompute TCP pseudo header checksum */
> +               if (skb_proto == ETH_P_IP)
> +                       tcph->check = ~csum_tcpudp_magic(iph->saddr,
> +                                       iph->daddr, tcphdrlen, iph_proto, 0);
> +               else if (skb_proto == ETH_P_IPV6)
> +                       tcph->check = ~csum_ipv6_magic(&iph6->saddr,
> +                                       &iph6->daddr, tcphdrlen, iph_proto, 0);
> +
> +               /* Setup SKB fields for checksum offload */
> +               skb_partial_csum_set(skb, iphlen,
> +                                    offsetof(struct tcphdr, check));
> +               skb_reset_network_header(skb);
> +       }
> +}
> +
>  static int ibmveth_poll(struct napi_struct *napi, int budget)
>  {
>         struct ibmveth_adapter *adapter =
> @@ -1231,7 +1306,6 @@ static int ibmveth_poll(struct napi_struct *napi, int budget)
>         struct net_device *netdev = adapter->netdev;
>         int frames_processed = 0;
>         unsigned long lpar_rc;
> -       struct iphdr *iph;
>         u16 mss = 0;
>
>  restart_poll:
> @@ -1289,17 +1363,7 @@ restart_poll:
>
>                         if (csum_good) {
>                                 skb->ip_summed = CHECKSUM_UNNECESSARY;
> -                               if (be16_to_cpu(skb->protocol) == ETH_P_IP) {
> -                                       iph = (struct iphdr *)skb->data;
> -
> -                                       /* If the IP checksum is not offloaded and if the packet
> -                                        *  is large send, the checksum must be rebuilt.
> -                                        */
> -                                       if (iph->check == 0xffff) {
> -                                               iph->check = 0;
> -                                               iph->check = ip_fast_csum((unsigned char *)iph, iph->ihl);
> -                                       }
> -                               }
> +                               ibmveth_rx_csum_helper(skb, adapter);
>                         }
>
>                         if (length > netdev->mtu + ETH_HLEN) {
> @@ -1621,6 +1685,13 @@ static int ibmveth_probe(struct vio_dev *dev, const struct vio_device_id *id)
>                 netdev->hw_features |= NETIF_F_TSO;
>         }
>
> +       adapter->is_active_trunk = false;
> +       if (ret == H_SUCCESS && (ret_attr & IBMVETH_ILLAN_ACTIVE_TRUNK)) {
> +               adapter->is_active_trunk = true;
> +               netdev->hw_features |= NETIF_F_FRAGLIST;
> +               netdev->features |= NETIF_F_FRAGLIST;
> +       }
> +
>         memcpy(netdev->dev_addr, mac_addr_p, ETH_ALEN);
>
>         if (firmware_has_feature(FW_FEATURE_CMO))
> diff --git a/drivers/net/ethernet/ibm/ibmveth.h b/drivers/net/ethernet/ibm/ibmveth.h
> index 7acda04..de6e381 100644
> --- a/drivers/net/ethernet/ibm/ibmveth.h
> +++ b/drivers/net/ethernet/ibm/ibmveth.h
> @@ -157,6 +157,7 @@ struct ibmveth_adapter {
>      int pool_config;
>      int rx_csum;
>      int large_send;
> +    bool is_active_trunk;
>      void *bounce_buffer;
>      dma_addr_t bounce_buffer_dma;
>
> --
> 2.7.4
>
>
> --
> kernel-team mailing list
> [hidden email]
> https://lists.ubuntu.com/mailman/listinfo/kernel-team

--
kernel-team mailing list
[hidden email]
https://lists.ubuntu.com/mailman/listinfo/kernel-team
Reply | Threaded
Open this post in threaded view
|

[RESEND][SRU][Xenial][PATCH 0/1] ibmveth: Support to enable LSO/CSO for Trunk VEA.

Joseph Salisbury-3
In reply to this post by Joseph Salisbury-3
If possible, can this SRU request be reviewed?

Thanks,

Joe



On 08/25/2017 12:23 PM, Joseph Salisbury wrote:

> BugLink: http://bugs.launchpad.net/bugs/1692538
>
> == SRU Justification ==
> Commit 66aa0678ef is request to fix four issues with the ibmveth driver.  
> The issues are as follows:
> - Issue 1: ibmveth doesn't support largesend and checksum offload features when configured as "Trunk".
> - Issue 2: SYN packet drops seen at destination VM. When the packet
> originates, it has CHECKSUM_PARTIAL flag set and as it gets delivered to IO
> server's inbound Trunk ibmveth, on validating "checksum good" bits in ibmveth
> receive routine, SKB's ip_summed field is set with CHECKSUM_UNNECESSARY flag.
> - Issue 3: First packet of a TCP connection will be dropped, if there is
> no OVS flow cached in datapath.
> - Issue 4: ibmveth driver doesn't have support for SKB's with frag_list.
>
> The details for the fixes to these issues are described in the commits git log.
>
> This fix has already been included in Zesty and Artful, but Xenial required a backport.  
>
> Commit 66aa0678ef is in mainline as of v4.13-rc1.
>
> == Fix ==
> commit 66aa0678efc29abd2ab02a09b23f9a8bc9f12a6c
> Author: Sivakumar Krishnasamy <[hidden email]>
> Date:   Fri May 19 05:30:38 2017 -0400
>
>     ibmveth: Support to enable LSO/CSO for Trunk VEA.
>
> == Regression Potential ==
> The changes are specific to the ibmveth driver, so regression risk is low.
>
> == Test Case ==
> A test kernel was built with this patch and tested by the original bug reporter.
> The bug reporter states the test kernel resolved the bug.
>
> Sivakumar Krishnasamy (1):
>   ibmveth: Support to enable LSO/CSO for Trunk VEA.
>
>  drivers/net/ethernet/ibm/ibmveth.c | 107 ++++++++++++++++++++++++++++++-------
>  drivers/net/ethernet/ibm/ibmveth.h |   1 +
>  2 files changed, 90 insertions(+), 18 deletions(-)
>


--
kernel-team mailing list
[hidden email]
https://lists.ubuntu.com/mailman/listinfo/kernel-team
Reply | Threaded
Open this post in threaded view
|

Re: [RESEND][SRU][Xenial][PATCH 0/1] ibmveth: Support to enable LSO/CSO for Trunk VEA.

Stefan Bader-2
On 14.11.2017 17:15, Joseph Salisbury wrote:
> If possible, can this SRU request be reviewed?

There had been some feedback from Daniel for which I did not see any reply from
you. Possible issues sounded a bit worrying.

-Stefan

>
> Thanks,
>
> Joe
>
>
>
> On 08/25/2017 12:23 PM, Joseph Salisbury wrote:
>> BugLink: http://bugs.launchpad.net/bugs/1692538
>>
>> == SRU Justification ==
>> Commit 66aa0678ef is request to fix four issues with the ibmveth driver.  
>> The issues are as follows:
>> - Issue 1: ibmveth doesn't support largesend and checksum offload features when configured as "Trunk".
>> - Issue 2: SYN packet drops seen at destination VM. When the packet
>> originates, it has CHECKSUM_PARTIAL flag set and as it gets delivered to IO
>> server's inbound Trunk ibmveth, on validating "checksum good" bits in ibmveth
>> receive routine, SKB's ip_summed field is set with CHECKSUM_UNNECESSARY flag.
>> - Issue 3: First packet of a TCP connection will be dropped, if there is
>> no OVS flow cached in datapath.
>> - Issue 4: ibmveth driver doesn't have support for SKB's with frag_list.
>>
>> The details for the fixes to these issues are described in the commits git log.
>>
>> This fix has already been included in Zesty and Artful, but Xenial required a backport.  
>>
>> Commit 66aa0678ef is in mainline as of v4.13-rc1.
>>
>> == Fix ==
>> commit 66aa0678efc29abd2ab02a09b23f9a8bc9f12a6c
>> Author: Sivakumar Krishnasamy <[hidden email]>
>> Date:   Fri May 19 05:30:38 2017 -0400
>>
>>     ibmveth: Support to enable LSO/CSO for Trunk VEA.
>>
>> == Regression Potential ==
>> The changes are specific to the ibmveth driver, so regression risk is low.
>>
>> == Test Case ==
>> A test kernel was built with this patch and tested by the original bug reporter.
>> The bug reporter states the test kernel resolved the bug.
>>
>> Sivakumar Krishnasamy (1):
>>   ibmveth: Support to enable LSO/CSO for Trunk VEA.
>>
>>  drivers/net/ethernet/ibm/ibmveth.c | 107 ++++++++++++++++++++++++++++++-------
>>  drivers/net/ethernet/ibm/ibmveth.h |   1 +
>>  2 files changed, 90 insertions(+), 18 deletions(-)
>>
>
>


--
kernel-team mailing list
[hidden email]
https://lists.ubuntu.com/mailman/listinfo/kernel-team

signature.asc (836 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [RESEND][SRU][Xenial][PATCH 0/1] ibmveth: Support to enable LSO/CSO for Trunk VEA.

Joseph Salisbury-3
On 11/15/2017 08:49 AM, Stefan Bader wrote:
> On 14.11.2017 17:15, Joseph Salisbury wrote:
>> If possible, can this SRU request be reviewed?
> There had been some feedback from Daniel for which I did not see any reply from
> you. Possible issues sounded a bit worrying.
>
> -Stefan
Thanks, Stefan.  I'll re-read the feedback from Daniel and reply.

Joe

>
>> Thanks,
>>
>> Joe
>>
>>
>>
>> On 08/25/2017 12:23 PM, Joseph Salisbury wrote:
>>> BugLink: http://bugs.launchpad.net/bugs/1692538
>>>
>>> == SRU Justification ==
>>> Commit 66aa0678ef is request to fix four issues with the ibmveth driver.  
>>> The issues are as follows:
>>> - Issue 1: ibmveth doesn't support largesend and checksum offload features when configured as "Trunk".
>>> - Issue 2: SYN packet drops seen at destination VM. When the packet
>>> originates, it has CHECKSUM_PARTIAL flag set and as it gets delivered to IO
>>> server's inbound Trunk ibmveth, on validating "checksum good" bits in ibmveth
>>> receive routine, SKB's ip_summed field is set with CHECKSUM_UNNECESSARY flag.
>>> - Issue 3: First packet of a TCP connection will be dropped, if there is
>>> no OVS flow cached in datapath.
>>> - Issue 4: ibmveth driver doesn't have support for SKB's with frag_list.
>>>
>>> The details for the fixes to these issues are described in the commits git log.
>>>
>>> This fix has already been included in Zesty and Artful, but Xenial required a backport.  
>>>
>>> Commit 66aa0678ef is in mainline as of v4.13-rc1.
>>>
>>> == Fix ==
>>> commit 66aa0678efc29abd2ab02a09b23f9a8bc9f12a6c
>>> Author: Sivakumar Krishnasamy <[hidden email]>
>>> Date:   Fri May 19 05:30:38 2017 -0400
>>>
>>>     ibmveth: Support to enable LSO/CSO for Trunk VEA.
>>>
>>> == Regression Potential ==
>>> The changes are specific to the ibmveth driver, so regression risk is low.
>>>
>>> == Test Case ==
>>> A test kernel was built with this patch and tested by the original bug reporter.
>>> The bug reporter states the test kernel resolved the bug.
>>>
>>> Sivakumar Krishnasamy (1):
>>>   ibmveth: Support to enable LSO/CSO for Trunk VEA.
>>>
>>>  drivers/net/ethernet/ibm/ibmveth.c | 107 ++++++++++++++++++++++++++++++-------
>>>  drivers/net/ethernet/ibm/ibmveth.h |   1 +
>>>  2 files changed, 90 insertions(+), 18 deletions(-)
>>>
>>
>


--
kernel-team mailing list
[hidden email]
https://lists.ubuntu.com/mailman/listinfo/kernel-team

signature.asc (836 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [SRU][Xenial][PATCH 1/1] ibmveth: Support to enable LSO/CSO for Trunk VEA.

Joseph Salisbury-3
In reply to this post by Daniel Axtens
On 08/25/2017 12:42 PM, Daniel Axtens wrote:

> Hi,
>
> Sorry if I'm a little late to the party with this...
>
> In support we've seen a case where ibmveth + openvswitch + bnx2x has
> lead to some issues, which you should probably be aware of before
> turning on large segments in more places.
>
> Here's my summary from that support case, appropriately sanitised.
>
> ==========
>
> [Issue: we see a firmware assertion from an IBM branded bnx2x card.
> Decoding the dump with the help of upstream shows that the assert is
> caused by a packet with GSO on and gso_size > ~9700 bytes being passed
> to the card. We traced the packets through the system, and came up
> with this root cause. The system uses ibmveth to talk to AIX LPARs, a
> bnx2x network card to talk to the world, and Open vSwitch to tie them
> together. There is no VIOS involvement - the card is attached to the
> Linux partition.]
>
> The packets causing the issue come through the ibmveth interface -
> from the AIX LPAR. The veth protocol is 'special' - communication
> between LPARs on the same chassis can use very large (64k) frames to
> reduce overhead. Normal networks cannot handle such large packets, so
> traditionally, the VIOS partition would signal to the AIX partitions
> that it was 'special', and AIX would send regular, ethernet-sized
> packets to VIOS, which VIOS would then send out.
>
> This signalling between VIOS and AIX is done in a way that is not
> standards-compliant, and so was never made part of Linux. Instead, the
> Linux driver has always understood large frames and passed them up the
> network stack.
>
> In some cases (e.g. with TCP), multiple TCP segments are coalesced
> into one large packet. In Linux, this goes through the generic receive
> offload code, using a similar mechanism to GSO. These segments can be
> very large which presents as a very large MSS (maximum segment size)
> or gso_size.
>
> Normally, the large packet is simply passed to whatever network
> application on Linux is going to consume it, and everything is OK.
>
> However, in this case, the packets go through Open vSwitch, and are
> then passed to the bnx2x driver. The bnx2x driver/hardware supports
> TSO and GSO, but with a restriction: the maximum segment size is
> limited to around 9700 bytes. Normally this is more than adequate as
> jumbo frames are limited to 9000 bytes. However, if a large packet
> with large (>9700 byte) TCP segments arrives through ibmveth, and is
> passed to bnx2x, the hardware will panic.
>
> Turning off TSO prevents the crash as the kernel resegments the data
> and assembles the packets in software. This has a performance cost.
>
> Clearly at the very least, bnx2x should not crash in this case, and I
> am working towards a patch for that.
>
> However, this still leaves us with some issues. The only thing the
> bnx2x driver can sensibly do is drop the packet, which will prevent
> the crash. However, there will still be issues with large packets:
> when they are dropped, the other side will eventually realise that the
> data is missing and ask for a retransmit, but the retransmit might
> also be too big - there's no way of signalling back to the AIX LPAR
> that it should reduce the MSS. Even if the data eventually gets
> through there will be a latency/throughput/performance hit.
>
> ==========
>
> Seeing as IBM seems to be in active development in this area - indeed
> this code explicitly deals with ibmveth + ovs, could we gently push
> for this issue to be considered and fixed also?
>
> I'm happy to engage on launchpad or whatever the most appropriate channel is.
>
> Regards,
> Daniel
>
> On Sat, Aug 26, 2017 at 2:23 AM, Joseph Salisbury
> <[hidden email]> wrote:
>> From: Sivakumar Krishnasamy <[hidden email]>
>>
>> BugLink: http://bugs.launchpad.net/bugs/1692538
>>
>> Current largesend and checksum offload feature in ibmveth driver,
>>  - Source VM sends the TCP packets with ip_summed field set as
>>    CHECKSUM_PARTIAL and TCP pseudo header checksum is placed in
>>    checksum field
>>  - CHECKSUM_PARTIAL flag in SKB will enable ibmveth driver to mark
>>    "no checksum" and "checksum good" bits in transmit buffer descriptor
>>    before the packet is delivered to pseries PowerVM Hypervisor
>>  - If ibmveth has largesend capability enabled, transmit buffer descriptors
>>    are market accordingly before packet is delivered to Hypervisor
>>    (along with mss value for packets with length > MSS)
>>  - Destination VM's ibmveth driver receives the packet with "checksum good"
>>    bit set and so, SKB's ip_summed field is set with CHECKSUM_UNNECESSARY
>>  - If "largesend" bit was on, mss value is copied from receive descriptor
>>    into SKB's gso_size and other flags are appropriately set for
>>    packets > MSS size
>>  - The packet is now successfully delivered up the stack in destination VM
>>
>> The offloads described above works fine for TCP communication among VMs in
>> the same pseries server ( VM A <=> PowerVM Hypervisor <=> VM B )
>>
>> We are now enabling support for OVS in pseries PowerVM environment. One of
>> our requirements is to have ibmveth driver configured in "Trunk" mode, when
>> they are used with OVS. This is because, PowerVM Hypervisor will no more
>> bridge the packets between VMs, instead the packets are delivered to
>> IO Server which hosts OVS to bridge them between VMs or to external
>> networks (flow shown below),
>>   VM A <=> PowerVM Hypervisor <=> IO Server(OVS) <=> PowerVM Hypervisor
>>                                                                    <=> VM B
>> In "IO server" the packet is received by inbound Trunk ibmveth and then
>> delivered to OVS, which is then bridged to outbound Trunk ibmveth (shown
>> below),
>>         Inbound Trunk ibmveth <=> OVS <=> Outbound Trunk ibmveth
>>
>> In this model, we hit the following issues which impacted the VM
>> communication performance,
>>
>>  - Issue 1: ibmveth doesn't support largesend and checksum offload features
>>    when configured as "Trunk". Driver has explicit checks to prevent
>>    enabling these offloads.
>>
>>  - Issue 2: SYN packet drops seen at destination VM. When the packet
>>    originates, it has CHECKSUM_PARTIAL flag set and as it gets delivered to
>>    IO server's inbound Trunk ibmveth, on validating "checksum good" bits
>>    in ibmveth receive routine, SKB's ip_summed field is set with
>>    CHECKSUM_UNNECESSARY flag. This packet is then bridged by OVS (or Linux
>>    Bridge) and delivered to outbound Trunk ibmveth. At this point the
>>    outbound ibmveth transmit routine will not set "no checksum" and
>>    "checksum good" bits in transmit buffer descriptor, as it does so only
>>    when the ip_summed field is CHECKSUM_PARTIAL. When this packet gets
>>    delivered to destination VM, TCP layer receives the packet with checksum
>>    value of 0 and with no checksum related flags in ip_summed field. This
>>    leads to packet drops. So, TCP connections never goes through fine.
>>
>>  - Issue 3: First packet of a TCP connection will be dropped, if there is
>>    no OVS flow cached in datapath. OVS while trying to identify the flow,
>>    computes the checksum. The computed checksum will be invalid at the
>>    receiving end, as ibmveth transmit routine zeroes out the pseudo
>>    checksum value in the packet. This leads to packet drop.
>>
>>  - Issue 4: ibmveth driver doesn't have support for SKB's with frag_list.
>>    When Physical NIC has GRO enabled and when OVS bridges these packets,
>>    OVS vport send code will end up calling dev_queue_xmit, which in turn
>>    calls validate_xmit_skb.
>>    In validate_xmit_skb routine, the larger packets will get segmented into
>>    MSS sized segments, if SKB has a frag_list and if the driver to which
>>    they are delivered to doesn't support NETIF_F_FRAGLIST feature.
>>
>> This patch addresses the above four issues, thereby enabling end to end
>> largesend and checksum offload support for better performance.
>>
>>  - Fix for Issue 1 : Remove checks which prevent enabling TCP largesend and
>>    checksum offloads.
>>  - Fix for Issue 2 : When ibmveth receives a packet with "checksum good"
>>    bit set and if its configured in Trunk mode, set appropriate SKB fields
>>    using skb_partial_csum_set (ip_summed field is set with
>>    CHECKSUM_PARTIAL)
>>  - Fix for Issue 3: Recompute the pseudo header checksum before sending the
>>    SKB up the stack.
>>  - Fix for Issue 4: Linearize the SKBs with frag_list. Though we end up
>>    allocating buffers and copying data, this fix gives
>>    upto 4X throughput increase.
>>
>> Note: All these fixes need to be dropped together as fixing just one of
>> them will lead to other issues immediately (especially for Issues 1,2 & 3).
>>
>> Signed-off-by: Sivakumar Krishnasamy <[hidden email]>
>> Signed-off-by: David S. Miller <[hidden email]>
>> (back ported from commit 66aa0678efc29abd2ab02a09b23f9a8bc9f12a6c)
>> Signed-off-by: Joseph Salisbury <[hidden email]>
>> ---
>>  drivers/net/ethernet/ibm/ibmveth.c | 107 ++++++++++++++++++++++++++++++-------
>>  drivers/net/ethernet/ibm/ibmveth.h |   1 +
>>  2 files changed, 90 insertions(+), 18 deletions(-)
>>
>> diff --git a/drivers/net/ethernet/ibm/ibmveth.c b/drivers/net/ethernet/ibm/ibmveth.c
>> index 2f9b12c..0482491 100644
>> --- a/drivers/net/ethernet/ibm/ibmveth.c
>> +++ b/drivers/net/ethernet/ibm/ibmveth.c
>> @@ -46,6 +46,8 @@
>>  #include <asm/vio.h>
>>  #include <asm/iommu.h>
>>  #include <asm/firmware.h>
>> +#include <net/tcp.h>
>> +#include <net/ip6_checksum.h>
>>
>>  #include "ibmveth.h"
>>
>> @@ -804,8 +806,7 @@ static int ibmveth_set_csum_offload(struct net_device *dev, u32 data)
>>
>>         ret = h_illan_attributes(adapter->vdev->unit_address, 0, 0, &ret_attr);
>>
>> -       if (ret == H_SUCCESS && !(ret_attr & IBMVETH_ILLAN_ACTIVE_TRUNK) &&
>> -           !(ret_attr & IBMVETH_ILLAN_TRUNK_PRI_MASK) &&
>> +       if (ret == H_SUCCESS &&
>>             (ret_attr & IBMVETH_ILLAN_PADDED_PKT_CSUM)) {
>>                 ret4 = h_illan_attributes(adapter->vdev->unit_address, clr_attr,
>>                                          set_attr, &ret_attr);
>> @@ -1035,6 +1036,15 @@ static netdev_tx_t ibmveth_start_xmit(struct sk_buff *skb,
>>         dma_addr_t dma_addr;
>>         unsigned long mss = 0;
>>
>> +       /* veth doesn't handle frag_list, so linearize the skb.
>> +        * When GRO is enabled SKB's can have frag_list.
>> +        */
>> +       if (adapter->is_active_trunk &&
>> +           skb_has_frag_list(skb) && __skb_linearize(skb)) {
>> +               netdev->stats.tx_dropped++;
>> +               goto out;
>> +       }
>> +
>>         /*
>>          * veth handles a maximum of 6 segments including the header, so
>>          * we have to linearize the skb if there are more than this.
>> @@ -1059,9 +1069,6 @@ static netdev_tx_t ibmveth_start_xmit(struct sk_buff *skb,
>>
>>         desc_flags = IBMVETH_BUF_VALID;
>>
>> -       if (skb_is_gso(skb) && adapter->fw_large_send_support)
>> -               desc_flags |= IBMVETH_BUF_LRG_SND;
>> -
>>         if (skb->ip_summed == CHECKSUM_PARTIAL) {
>>                 unsigned char *buf = skb_transport_header(skb) +
>>                                                 skb->csum_offset;
>> @@ -1071,6 +1078,9 @@ static netdev_tx_t ibmveth_start_xmit(struct sk_buff *skb,
>>                 /* Need to zero out the checksum */
>>                 buf[0] = 0;
>>                 buf[1] = 0;
>> +
>> +               if (skb_is_gso(skb) && adapter->fw_large_send_support)
>> +                       desc_flags |= IBMVETH_BUF_LRG_SND;
>>         }
>>
>>  retry_bounce:
>> @@ -1123,7 +1133,7 @@ retry_bounce:
>>                 descs[i+1].fields.address = dma_addr;
>>         }
>>
>> -       if (skb_is_gso(skb)) {
>> +       if (skb->ip_summed == CHECKSUM_PARTIAL && skb_is_gso(skb)) {
>>                 if (adapter->fw_large_send_support) {
>>                         mss = (unsigned long)skb_shinfo(skb)->gso_size;
>>                         adapter->tx_large_packets++;
>> @@ -1224,6 +1234,71 @@ static void ibmveth_rx_mss_helper(struct sk_buff *skb, u16 mss, int lrg_pkt)
>>         }
>>  }
>>
>> +static void ibmveth_rx_csum_helper(struct sk_buff *skb,
>> +                                  struct ibmveth_adapter *adapter)
>> +{
>> +       struct iphdr *iph = NULL;
>> +       struct ipv6hdr *iph6 = NULL;
>> +       __be16 skb_proto = 0;
>> +       u16 iphlen = 0;
>> +       u16 iph_proto = 0;
>> +       u16 tcphdrlen = 0;
>> +
>> +       skb_proto = be16_to_cpu(skb->protocol);
>> +
>> +       if (skb_proto == ETH_P_IP) {
>> +               iph = (struct iphdr *)skb->data;
>> +
>> +               /* If the IP checksum is not offloaded and if the packet
>> +                *  is large send, the checksum must be rebuilt.
>> +                */
>> +               if (iph->check == 0xffff) {
>> +                       iph->check = 0;
>> +                       iph->check = ip_fast_csum((unsigned char *)iph,
>> +                                                 iph->ihl);
>> +               }
>> +
>> +               iphlen = iph->ihl * 4;
>> +               iph_proto = iph->protocol;
>> +       } else if (skb_proto == ETH_P_IPV6) {
>> +               iph6 = (struct ipv6hdr *)skb->data;
>> +               iphlen = sizeof(struct ipv6hdr);
>> +               iph_proto = iph6->nexthdr;
>> +       }
>> +
>> +       /* In OVS environment, when a flow is not cached, specifically for a
>> +        * new TCP connection, the first packet information is passed up
>> +        * the user space for finding a flow. During this process, OVS computes
>> +        * checksum on the first packet when CHECKSUM_PARTIAL flag is set.
>> +        *
>> +        * Given that we zeroed out TCP checksum field in transmit path
>> +        * (refer ibmveth_start_xmit routine) as we set "no checksum bit",
>> +        * OVS computed checksum will be incorrect w/o TCP pseudo checksum
>> +        * in the packet. This leads to OVS dropping the packet and hence
>> +        * TCP retransmissions are seen.
>> +        *
>> +        * So, re-compute TCP pseudo header checksum.
>> +        */
>> +       if (iph_proto == IPPROTO_TCP && adapter->is_active_trunk) {
>> +               struct tcphdr *tcph = (struct tcphdr *)(skb->data + iphlen);
>> +
>> +               tcphdrlen = skb->len - iphlen;
>> +
>> +               /* Recompute TCP pseudo header checksum */
>> +               if (skb_proto == ETH_P_IP)
>> +                       tcph->check = ~csum_tcpudp_magic(iph->saddr,
>> +                                       iph->daddr, tcphdrlen, iph_proto, 0);
>> +               else if (skb_proto == ETH_P_IPV6)
>> +                       tcph->check = ~csum_ipv6_magic(&iph6->saddr,
>> +                                       &iph6->daddr, tcphdrlen, iph_proto, 0);
>> +
>> +               /* Setup SKB fields for checksum offload */
>> +               skb_partial_csum_set(skb, iphlen,
>> +                                    offsetof(struct tcphdr, check));
>> +               skb_reset_network_header(skb);
>> +       }
>> +}
>> +
>>  static int ibmveth_poll(struct napi_struct *napi, int budget)
>>  {
>>         struct ibmveth_adapter *adapter =
>> @@ -1231,7 +1306,6 @@ static int ibmveth_poll(struct napi_struct *napi, int budget)
>>         struct net_device *netdev = adapter->netdev;
>>         int frames_processed = 0;
>>         unsigned long lpar_rc;
>> -       struct iphdr *iph;
>>         u16 mss = 0;
>>
>>  restart_poll:
>> @@ -1289,17 +1363,7 @@ restart_poll:
>>
>>                         if (csum_good) {
>>                                 skb->ip_summed = CHECKSUM_UNNECESSARY;
>> -                               if (be16_to_cpu(skb->protocol) == ETH_P_IP) {
>> -                                       iph = (struct iphdr *)skb->data;
>> -
>> -                                       /* If the IP checksum is not offloaded and if the packet
>> -                                        *  is large send, the checksum must be rebuilt.
>> -                                        */
>> -                                       if (iph->check == 0xffff) {
>> -                                               iph->check = 0;
>> -                                               iph->check = ip_fast_csum((unsigned char *)iph, iph->ihl);
>> -                                       }
>> -                               }
>> +                               ibmveth_rx_csum_helper(skb, adapter);
>>                         }
>>
>>                         if (length > netdev->mtu + ETH_HLEN) {
>> @@ -1621,6 +1685,13 @@ static int ibmveth_probe(struct vio_dev *dev, const struct vio_device_id *id)
>>                 netdev->hw_features |= NETIF_F_TSO;
>>         }
>>
>> +       adapter->is_active_trunk = false;
>> +       if (ret == H_SUCCESS && (ret_attr & IBMVETH_ILLAN_ACTIVE_TRUNK)) {
>> +               adapter->is_active_trunk = true;
>> +               netdev->hw_features |= NETIF_F_FRAGLIST;
>> +               netdev->features |= NETIF_F_FRAGLIST;
>> +       }
>> +
>>         memcpy(netdev->dev_addr, mac_addr_p, ETH_ALEN);
>>
>>         if (firmware_has_feature(FW_FEATURE_CMO))
>> diff --git a/drivers/net/ethernet/ibm/ibmveth.h b/drivers/net/ethernet/ibm/ibmveth.h
>> index 7acda04..de6e381 100644
>> --- a/drivers/net/ethernet/ibm/ibmveth.h
>> +++ b/drivers/net/ethernet/ibm/ibmveth.h
>> @@ -157,6 +157,7 @@ struct ibmveth_adapter {
>>      int pool_config;
>>      int rx_csum;
>>      int large_send;
>> +    bool is_active_trunk;
>>      void *bounce_buffer;
>>      dma_addr_t bounce_buffer_dma;
>>
>> --
>> 2.7.4
>>
>>
>> --
>> kernel-team mailing list
>> [hidden email]
>> https://lists.ubuntu.com/mailman/listinfo/kernel-team
Hi Daniel,

The SRU request for Xenial is on hold.  I forwarded the information you
provided to IBM.  Thanks for the detailed write up!  I'll keep you up to
date on any feedback they provide. 

You could also subscribe to the bug report, and feel free to comment as
needed:
http://pad.lv/1692538

Thanks,

Joe






--
kernel-team mailing list
[hidden email]
https://lists.ubuntu.com/mailman/listinfo/kernel-team
Reply | Threaded
Open this post in threaded view
|

Re: [SRU][Xenial][PATCH 1/1] ibmveth: Support to enable LSO/CSO for Trunk VEA.

Daniel Axtens
Hi Joe,

Thanks - subscribed and commenting now!

Regards,
Daniel

On Sat, Nov 18, 2017 at 12:23 AM, Joseph Salisbury <[hidden email]> wrote:
On 08/25/2017 12:42 PM, Daniel Axtens wrote:
> Hi,
>
> Sorry if I'm a little late to the party with this...
>
> In support we've seen a case where ibmveth + openvswitch + bnx2x has
> lead to some issues, which you should probably be aware of before
> turning on large segments in more places.
>
> Here's my summary from that support case, appropriately sanitised.
>
> ==========
>
> [Issue: we see a firmware assertion from an IBM branded bnx2x card.
> Decoding the dump with the help of upstream shows that the assert is
> caused by a packet with GSO on and gso_size > ~9700 bytes being passed
> to the card. We traced the packets through the system, and came up
> with this root cause. The system uses ibmveth to talk to AIX LPARs, a
> bnx2x network card to talk to the world, and Open vSwitch to tie them
> together. There is no VIOS involvement - the card is attached to the
> Linux partition.]
>
> The packets causing the issue come through the ibmveth interface -
> from the AIX LPAR. The veth protocol is 'special' - communication
> between LPARs on the same chassis can use very large (64k) frames to
> reduce overhead. Normal networks cannot handle such large packets, so
> traditionally, the VIOS partition would signal to the AIX partitions
> that it was 'special', and AIX would send regular, ethernet-sized
> packets to VIOS, which VIOS would then send out.
>
> This signalling between VIOS and AIX is done in a way that is not
> standards-compliant, and so was never made part of Linux. Instead, the
> Linux driver has always understood large frames and passed them up the
> network stack.
>
> In some cases (e.g. with TCP), multiple TCP segments are coalesced
> into one large packet. In Linux, this goes through the generic receive
> offload code, using a similar mechanism to GSO. These segments can be
> very large which presents as a very large MSS (maximum segment size)
> or gso_size.
>
> Normally, the large packet is simply passed to whatever network
> application on Linux is going to consume it, and everything is OK.
>
> However, in this case, the packets go through Open vSwitch, and are
> then passed to the bnx2x driver. The bnx2x driver/hardware supports
> TSO and GSO, but with a restriction: the maximum segment size is
> limited to around 9700 bytes. Normally this is more than adequate as
> jumbo frames are limited to 9000 bytes. However, if a large packet
> with large (>9700 byte) TCP segments arrives through ibmveth, and is
> passed to bnx2x, the hardware will panic.
>
> Turning off TSO prevents the crash as the kernel resegments the data
> and assembles the packets in software. This has a performance cost.
>
> Clearly at the very least, bnx2x should not crash in this case, and I
> am working towards a patch for that.
>
> However, this still leaves us with some issues. The only thing the
> bnx2x driver can sensibly do is drop the packet, which will prevent
> the crash. However, there will still be issues with large packets:
> when they are dropped, the other side will eventually realise that the
> data is missing and ask for a retransmit, but the retransmit might
> also be too big - there's no way of signalling back to the AIX LPAR
> that it should reduce the MSS. Even if the data eventually gets
> through there will be a latency/throughput/performance hit.
>
> ==========
>
> Seeing as IBM seems to be in active development in this area - indeed
> this code explicitly deals with ibmveth + ovs, could we gently push
> for this issue to be considered and fixed also?
>
> I'm happy to engage on launchpad or whatever the most appropriate channel is.
>
> Regards,
> Daniel
>
> On Sat, Aug 26, 2017 at 2:23 AM, Joseph Salisbury
> <[hidden email]> wrote:
>> From: Sivakumar Krishnasamy <[hidden email]>
>>
>> BugLink: http://bugs.launchpad.net/bugs/1692538
>>
>> Current largesend and checksum offload feature in ibmveth driver,
>>  - Source VM sends the TCP packets with ip_summed field set as
>>    CHECKSUM_PARTIAL and TCP pseudo header checksum is placed in
>>    checksum field
>>  - CHECKSUM_PARTIAL flag in SKB will enable ibmveth driver to mark
>>    "no checksum" and "checksum good" bits in transmit buffer descriptor
>>    before the packet is delivered to pseries PowerVM Hypervisor
>>  - If ibmveth has largesend capability enabled, transmit buffer descriptors
>>    are market accordingly before packet is delivered to Hypervisor
>>    (along with mss value for packets with length > MSS)
>>  - Destination VM's ibmveth driver receives the packet with "checksum good"
>>    bit set and so, SKB's ip_summed field is set with CHECKSUM_UNNECESSARY
>>  - If "largesend" bit was on, mss value is copied from receive descriptor
>>    into SKB's gso_size and other flags are appropriately set for
>>    packets > MSS size
>>  - The packet is now successfully delivered up the stack in destination VM
>>
>> The offloads described above works fine for TCP communication among VMs in
>> the same pseries server ( VM A <=> PowerVM Hypervisor <=> VM B )
>>
>> We are now enabling support for OVS in pseries PowerVM environment. One of
>> our requirements is to have ibmveth driver configured in "Trunk" mode, when
>> they are used with OVS. This is because, PowerVM Hypervisor will no more
>> bridge the packets between VMs, instead the packets are delivered to
>> IO Server which hosts OVS to bridge them between VMs or to external
>> networks (flow shown below),
>>   VM A <=> PowerVM Hypervisor <=> IO Server(OVS) <=> PowerVM Hypervisor
>>                                                                    <=> VM B
>> In "IO server" the packet is received by inbound Trunk ibmveth and then
>> delivered to OVS, which is then bridged to outbound Trunk ibmveth (shown
>> below),
>>         Inbound Trunk ibmveth <=> OVS <=> Outbound Trunk ibmveth
>>
>> In this model, we hit the following issues which impacted the VM
>> communication performance,
>>
>>  - Issue 1: ibmveth doesn't support largesend and checksum offload features
>>    when configured as "Trunk". Driver has explicit checks to prevent
>>    enabling these offloads.
>>
>>  - Issue 2: SYN packet drops seen at destination VM. When the packet
>>    originates, it has CHECKSUM_PARTIAL flag set and as it gets delivered to
>>    IO server's inbound Trunk ibmveth, on validating "checksum good" bits
>>    in ibmveth receive routine, SKB's ip_summed field is set with
>>    CHECKSUM_UNNECESSARY flag. This packet is then bridged by OVS (or Linux
>>    Bridge) and delivered to outbound Trunk ibmveth. At this point the
>>    outbound ibmveth transmit routine will not set "no checksum" and
>>    "checksum good" bits in transmit buffer descriptor, as it does so only
>>    when the ip_summed field is CHECKSUM_PARTIAL. When this packet gets
>>    delivered to destination VM, TCP layer receives the packet with checksum
>>    value of 0 and with no checksum related flags in ip_summed field. This
>>    leads to packet drops. So, TCP connections never goes through fine.
>>
>>  - Issue 3: First packet of a TCP connection will be dropped, if there is
>>    no OVS flow cached in datapath. OVS while trying to identify the flow,
>>    computes the checksum. The computed checksum will be invalid at the
>>    receiving end, as ibmveth transmit routine zeroes out the pseudo
>>    checksum value in the packet. This leads to packet drop.
>>
>>  - Issue 4: ibmveth driver doesn't have support for SKB's with frag_list.
>>    When Physical NIC has GRO enabled and when OVS bridges these packets,
>>    OVS vport send code will end up calling dev_queue_xmit, which in turn
>>    calls validate_xmit_skb.
>>    In validate_xmit_skb routine, the larger packets will get segmented into
>>    MSS sized segments, if SKB has a frag_list and if the driver to which
>>    they are delivered to doesn't support NETIF_F_FRAGLIST feature.
>>
>> This patch addresses the above four issues, thereby enabling end to end
>> largesend and checksum offload support for better performance.
>>
>>  - Fix for Issue 1 : Remove checks which prevent enabling TCP largesend and
>>    checksum offloads.
>>  - Fix for Issue 2 : When ibmveth receives a packet with "checksum good"
>>    bit set and if its configured in Trunk mode, set appropriate SKB fields
>>    using skb_partial_csum_set (ip_summed field is set with
>>    CHECKSUM_PARTIAL)
>>  - Fix for Issue 3: Recompute the pseudo header checksum before sending the
>>    SKB up the stack.
>>  - Fix for Issue 4: Linearize the SKBs with frag_list. Though we end up
>>    allocating buffers and copying data, this fix gives
>>    upto 4X throughput increase.
>>
>> Note: All these fixes need to be dropped together as fixing just one of
>> them will lead to other issues immediately (especially for Issues 1,2 & 3).
>>
>> Signed-off-by: Sivakumar Krishnasamy <[hidden email]>
>> Signed-off-by: David S. Miller <[hidden email]>
>> (back ported from commit 66aa0678efc29abd2ab02a09b23f9a8bc9f12a6c)
>> Signed-off-by: Joseph Salisbury <[hidden email]>
>> ---
>>  drivers/net/ethernet/ibm/ibmveth.c | 107 ++++++++++++++++++++++++++++++-------
>>  drivers/net/ethernet/ibm/ibmveth.h |   1 +
>>  2 files changed, 90 insertions(+), 18 deletions(-)
>>
>> diff --git a/drivers/net/ethernet/ibm/ibmveth.c b/drivers/net/ethernet/ibm/ibmveth.c
>> index 2f9b12c..0482491 100644
>> --- a/drivers/net/ethernet/ibm/ibmveth.c
>> +++ b/drivers/net/ethernet/ibm/ibmveth.c
>> @@ -46,6 +46,8 @@
>>  #include <asm/vio.h>
>>  #include <asm/iommu.h>
>>  #include <asm/firmware.h>
>> +#include <net/tcp.h>
>> +#include <net/ip6_checksum.h>
>>
>>  #include "ibmveth.h"
>>
>> @@ -804,8 +806,7 @@ static int ibmveth_set_csum_offload(struct net_device *dev, u32 data)
>>
>>         ret = h_illan_attributes(adapter->vdev->unit_address, 0, 0, &ret_attr);
>>
>> -       if (ret == H_SUCCESS && !(ret_attr & IBMVETH_ILLAN_ACTIVE_TRUNK) &&
>> -           !(ret_attr & IBMVETH_ILLAN_TRUNK_PRI_MASK) &&
>> +       if (ret == H_SUCCESS &&
>>             (ret_attr & IBMVETH_ILLAN_PADDED_PKT_CSUM)) {
>>                 ret4 = h_illan_attributes(adapter->vdev->unit_address, clr_attr,
>>                                          set_attr, &ret_attr);
>> @@ -1035,6 +1036,15 @@ static netdev_tx_t ibmveth_start_xmit(struct sk_buff *skb,
>>         dma_addr_t dma_addr;
>>         unsigned long mss = 0;
>>
>> +       /* veth doesn't handle frag_list, so linearize the skb.
>> +        * When GRO is enabled SKB's can have frag_list.
>> +        */
>> +       if (adapter->is_active_trunk &&
>> +           skb_has_frag_list(skb) && __skb_linearize(skb)) {
>> +               netdev->stats.tx_dropped++;
>> +               goto out;
>> +       }
>> +
>>         /*
>>          * veth handles a maximum of 6 segments including the header, so
>>          * we have to linearize the skb if there are more than this.
>> @@ -1059,9 +1069,6 @@ static netdev_tx_t ibmveth_start_xmit(struct sk_buff *skb,
>>
>>         desc_flags = IBMVETH_BUF_VALID;
>>
>> -       if (skb_is_gso(skb) && adapter->fw_large_send_support)
>> -               desc_flags |= IBMVETH_BUF_LRG_SND;
>> -
>>         if (skb->ip_summed == CHECKSUM_PARTIAL) {
>>                 unsigned char *buf = skb_transport_header(skb) +
>>                                                 skb->csum_offset;
>> @@ -1071,6 +1078,9 @@ static netdev_tx_t ibmveth_start_xmit(struct sk_buff *skb,
>>                 /* Need to zero out the checksum */
>>                 buf[0] = 0;
>>                 buf[1] = 0;
>> +
>> +               if (skb_is_gso(skb) && adapter->fw_large_send_support)
>> +                       desc_flags |= IBMVETH_BUF_LRG_SND;
>>         }
>>
>>  retry_bounce:
>> @@ -1123,7 +1133,7 @@ retry_bounce:
>>                 descs[i+1].fields.address = dma_addr;
>>         }
>>
>> -       if (skb_is_gso(skb)) {
>> +       if (skb->ip_summed == CHECKSUM_PARTIAL && skb_is_gso(skb)) {
>>                 if (adapter->fw_large_send_support) {
>>                         mss = (unsigned long)skb_shinfo(skb)->gso_size;
>>                         adapter->tx_large_packets++;
>> @@ -1224,6 +1234,71 @@ static void ibmveth_rx_mss_helper(struct sk_buff *skb, u16 mss, int lrg_pkt)
>>         }
>>  }
>>
>> +static void ibmveth_rx_csum_helper(struct sk_buff *skb,
>> +                                  struct ibmveth_adapter *adapter)
>> +{
>> +       struct iphdr *iph = NULL;
>> +       struct ipv6hdr *iph6 = NULL;
>> +       __be16 skb_proto = 0;
>> +       u16 iphlen = 0;
>> +       u16 iph_proto = 0;
>> +       u16 tcphdrlen = 0;
>> +
>> +       skb_proto = be16_to_cpu(skb->protocol);
>> +
>> +       if (skb_proto == ETH_P_IP) {
>> +               iph = (struct iphdr *)skb->data;
>> +
>> +               /* If the IP checksum is not offloaded and if the packet
>> +                *  is large send, the checksum must be rebuilt.
>> +                */
>> +               if (iph->check == 0xffff) {
>> +                       iph->check = 0;
>> +                       iph->check = ip_fast_csum((unsigned char *)iph,
>> +                                                 iph->ihl);
>> +               }
>> +
>> +               iphlen = iph->ihl * 4;
>> +               iph_proto = iph->protocol;
>> +       } else if (skb_proto == ETH_P_IPV6) {
>> +               iph6 = (struct ipv6hdr *)skb->data;
>> +               iphlen = sizeof(struct ipv6hdr);
>> +               iph_proto = iph6->nexthdr;
>> +       }
>> +
>> +       /* In OVS environment, when a flow is not cached, specifically for a
>> +        * new TCP connection, the first packet information is passed up
>> +        * the user space for finding a flow. During this process, OVS computes
>> +        * checksum on the first packet when CHECKSUM_PARTIAL flag is set.
>> +        *
>> +        * Given that we zeroed out TCP checksum field in transmit path
>> +        * (refer ibmveth_start_xmit routine) as we set "no checksum bit",
>> +        * OVS computed checksum will be incorrect w/o TCP pseudo checksum
>> +        * in the packet. This leads to OVS dropping the packet and hence
>> +        * TCP retransmissions are seen.
>> +        *
>> +        * So, re-compute TCP pseudo header checksum.
>> +        */
>> +       if (iph_proto == IPPROTO_TCP && adapter->is_active_trunk) {
>> +               struct tcphdr *tcph = (struct tcphdr *)(skb->data + iphlen);
>> +
>> +               tcphdrlen = skb->len - iphlen;
>> +
>> +               /* Recompute TCP pseudo header checksum */
>> +               if (skb_proto == ETH_P_IP)
>> +                       tcph->check = ~csum_tcpudp_magic(iph->saddr,
>> +                                       iph->daddr, tcphdrlen, iph_proto, 0);
>> +               else if (skb_proto == ETH_P_IPV6)
>> +                       tcph->check = ~csum_ipv6_magic(&iph6->saddr,
>> +                                       &iph6->daddr, tcphdrlen, iph_proto, 0);
>> +
>> +               /* Setup SKB fields for checksum offload */
>> +               skb_partial_csum_set(skb, iphlen,
>> +                                    offsetof(struct tcphdr, check));
>> +               skb_reset_network_header(skb);
>> +       }
>> +}
>> +
>>  static int ibmveth_poll(struct napi_struct *napi, int budget)
>>  {
>>         struct ibmveth_adapter *adapter =
>> @@ -1231,7 +1306,6 @@ static int ibmveth_poll(struct napi_struct *napi, int budget)
>>         struct net_device *netdev = adapter->netdev;
>>         int frames_processed = 0;
>>         unsigned long lpar_rc;
>> -       struct iphdr *iph;
>>         u16 mss = 0;
>>
>>  restart_poll:
>> @@ -1289,17 +1363,7 @@ restart_poll:
>>
>>                         if (csum_good) {
>>                                 skb->ip_summed = CHECKSUM_UNNECESSARY;
>> -                               if (be16_to_cpu(skb->protocol) == ETH_P_IP) {
>> -                                       iph = (struct iphdr *)skb->data;
>> -
>> -                                       /* If the IP checksum is not offloaded and if the packet
>> -                                        *  is large send, the checksum must be rebuilt.
>> -                                        */
>> -                                       if (iph->check == 0xffff) {
>> -                                               iph->check = 0;
>> -                                               iph->check = ip_fast_csum((unsigned char *)iph, iph->ihl);
>> -                                       }
>> -                               }
>> +                               ibmveth_rx_csum_helper(skb, adapter);
>>                         }
>>
>>                         if (length > netdev->mtu + ETH_HLEN) {
>> @@ -1621,6 +1685,13 @@ static int ibmveth_probe(struct vio_dev *dev, const struct vio_device_id *id)
>>                 netdev->hw_features |= NETIF_F_TSO;
>>         }
>>
>> +       adapter->is_active_trunk = false;
>> +       if (ret == H_SUCCESS && (ret_attr & IBMVETH_ILLAN_ACTIVE_TRUNK)) {
>> +               adapter->is_active_trunk = true;
>> +               netdev->hw_features |= NETIF_F_FRAGLIST;
>> +               netdev->features |= NETIF_F_FRAGLIST;
>> +       }
>> +
>>         memcpy(netdev->dev_addr, mac_addr_p, ETH_ALEN);
>>
>>         if (firmware_has_feature(FW_FEATURE_CMO))
>> diff --git a/drivers/net/ethernet/ibm/ibmveth.h b/drivers/net/ethernet/ibm/ibmveth.h
>> index 7acda04..de6e381 100644
>> --- a/drivers/net/ethernet/ibm/ibmveth.h
>> +++ b/drivers/net/ethernet/ibm/ibmveth.h
>> @@ -157,6 +157,7 @@ struct ibmveth_adapter {
>>      int pool_config;
>>      int rx_csum;
>>      int large_send;
>> +    bool is_active_trunk;
>>      void *bounce_buffer;
>>      dma_addr_t bounce_buffer_dma;
>>
>> --
>> 2.7.4
>>
>>
>> --
>> kernel-team mailing list
>> [hidden email]
>> https://lists.ubuntu.com/mailman/listinfo/kernel-team
Hi Daniel,

The SRU request for Xenial is on hold.  I forwarded the information you
provided to IBM.  Thanks for the detailed write up!  I'll keep you up to
date on any feedback they provide. 

You could also subscribe to the bug report, and feel free to comment as
needed:
http://pad.lv/1692538

Thanks,

Joe







--
kernel-team mailing list
[hidden email]
https://lists.ubuntu.com/mailman/listinfo/kernel-team