i40iw Linux* Driver for Intel(R) Ethernet Connection X722
=========================================================

June 28, 2019


Contents
========

- Prerequisites
- Building and Installation
- Testing
- Interoperability
- RDMA Statistics
- Known Issues
- Unsupported and Discontinued Items


Prerequisites
=============

One of the following:
* Latest stable upstream kernel with the inbox Infiniband* support installed
* Red Hat* Enterprise Linux* (RHEL) 7.4/7.5/7.6 with the inbox Infiniband*
support installed
* Suse* Linux Enterprise Server (SLES) 12 SP3/12 SP4/15 with the inbox
Infiniband* support installed

- NOTE: The i40e driver must be built from source on your system prior to i40iw
install.


Memory Requirements:
--------------------
Default i40iw load requires a minimum of 6GB of memory for initialization.

For applications where the amount of memory is constrained, you can decrease
the required memory by lowering the available resources to the i40iw driver. To
do this, load the driver with the following profile setting.

Note: This can have performance and scaling impacts as the number of queue
pairs and other RDMA resources are decreased in order to lower memory usage to
approximately 1.2 GB.

  modprobe i40iw resource_profile=2


Scaling Limits
--------------
Intel(R) Ethernet Connection X722 has limited RDMA resources, including the
number of Queue Pairs (QPs), Completion Queues (CQs), and Memory Regions (MRs).
In highly scaled environments or highly interconnected HPC-style applications
such as all-to-all, users may experience QP failure errors once they reach the
RDMA resource limits.

Below are the per-physical port limits for 4-port devices for the three
resources associated with the default i40iw driver load:
  QPs: 16384
  CQs: 32768
  MRs: 2453503

Other resource profiles allocate resources differently. If i40iw is loaded with
resource_profile 2, then resources will be more limited.

The example below shows the resource limit per-physical port when you use
modprobe i40iw resource_profile 2. (Note that this may increase if you load
fewer than 32 VFs using the max_rdma_vfs module parameter.)
  QPs: 2048
  CQs: 3584
  MRs: 6143


Building and Installation
=========================

1. Untar i40iw-<version>.tar.gz.

2. Install the i40iw PF driver as follows:

   # cd i40iw-<version> directory
   # ./build.sh <absolute path to i40e driver directory> k

   For example: ./build.sh /opt/i40e-2.3.3 k

3. Please download the latest rdma_core user-space package from
   https://github.com/linux-rdma/rdma-core/releases and follow its
   installation procedure.

Note: There might be errors resulting from conflicting packages when upgrading
rdma core to the latest content. If so, use the following procedure:

   A. Remove the existing packages using:
      # rpm -e <package.rpm>

   B. Install the newer version of the package:
      # rpm -ivh <package.rpm>


Adapter and Switch Flow Control Setting
---------------------------------------
We recommend enabling link-level flow control (both TX and RX) on X722 and
connected switch.

For better performance, enable flow control on all the nodes and on the switch
they are connected to.

To enable flow control on X722, use the ethtool -A command. For example:

# ethtool -A p4p1 rx on tx on
  where p4p1 is the iwarp interface name.

Confirm the setting with the ethtool -a command. For example:

# ethtool -a p4p1

You should see this output:
  Pause parameters for p4p1:
  Autonegotiate: off
  RX: on
  TX: on

To enable link-level flow control on the switch, please consult your switch
vendor's documentation. Look for flow-control and make sure both TX and RX are
set. Here is an example for a generic switch to enable both TX and RX flow
control on port 45:

  enable flow-control tx-pause ports 45
  enable flow-control rx-pause ports 45


Recommended Settings for Intel MPI 2017.0.x
-------------------------------------------
Note: The following instructions assume that Intel MPI is installed using
default locations. Refer to Intel MPI documentation for further details on
parameters and general instructions.

1. Add or modify the following line in /etc/dat.conf, changing
<iwarp_interface> to match your interface name:

ofa-v2-iwarp u2.0 nonthreadsafe default libdaplofa.so.2 dapl.2.0
"<iwarp_interface> 0" ""

2. To select the iWARP device, add the following to mpiexec command:

  -genv I_MPI_FALLBACK_DEVICE disable
  -genv I_MPI_DEVICE rdma:ofa-v2-iwarp

Example
  mpiexec command line for uDAPL-2.0:
  mpiexec -machinefile <pathto>mpd.hosts_impi
  -genv I_MPI_FALLBACK_DEVICE disable
  -genv I_MPI_DEVICE rdma:ofa-v2-iwarp
  -ppn <number of processes per node> -n <number of nodes>
  <path to mpi application> <optional_parameters>

Note: mpd.hosts_impi is a text file with a list of the nodes' qualified
hostnames or IP addresses, one per line, in the MPI ring.

Note: Recommended optional_parameters if running IMB-MPI1 benchmark:
-time 1000000 (specifies that a benchmark will run at most that many seconds
per message size) -mem 2GB (specifies that at most that many GBytes are
allocated per process for the message buffers)


Recommended Settings for Open MPI 3.x.x
---------------------------------------
Note: The following instructions assume that Open MPI is installed using
default locations. Refer to Open MPI documentation at open-mpi.org for further
details on parameters and general instructions.

Note: There is more than one way to specify MCA parameters in OpenMPI.
Please visit this link and use the best method for your environment:
  http://www.open-mpi.org/faq/?category=tuning#setting-mca-params

Necessary parameters to mpirun command: -mca btl openib,self,vader
Use openib (Open Fabrics device), send to self semantics and shared memory.

-mca_btl_openib_receive_queues P,128,256,192,128:P,65536,256,192,128
Set the receive queue sizes. This is especially useful for interop between
iWARP RDMA vendors, because the queue sizes could be different per vendor in
the file "<path>openmpi/mca-btl-openib-device-params.ini"

-mca oob ^ud
Do not use UD QPs

Example mpirun command line:
  mpirun -np <number of processes per node> -hostfile <pathto>mpd.hosts_ompi
  --map-by node --allow-run-as-root --display-map -v -tag-output
  -mca_btl_openib_receive_queues P,128,256,192,128:P,65536,256,192,128
  -mca btl openib,self,vader
  -mca btl_mpi_leave_pinned 0
  -mca oob ^ud
  <path>/openmpi_benchmarks/3.x.x/benchmark [optional_parameters]

Note: mpd.hosts_ompi is a text file with a list of the nodes' qualified
hostnames or IP addresses and "slots=<total number of logical cores per node>",
one per line, in the MPI ring. The slots parameter is required for <total
number of logical cores per node> greater than 72. Refer to openMPI
documentation for more details.

Note: underscores are not allowed in hostnames.
  Example:
  QA0094-1-0 slots=72
  QA0096-1-0 slots=72

Recommended optional_parameters for IMB-MPI1 benchmark:
  -time 1000000 (specifies that a benchmark will run at most that
  many seconds per message size)


Testing
=======

Verify RDMA traffic
-------------------
The following rping test can be run to confirm RDMA functionality:
   Server side: rping -s -a <server_ip> -vVd
   Client side: rping -c -a <server_ip> -vVd
Execute the server side before the client side. Rping will run endlessly
printing the data on the console. First run rping server and client both on
machine A. After confirming machine A is operating correctly, run rping server
and client both on machine B. After confirming machine B is operating
correctly, run rping from machine A to machine B.

* Make sure portmapper is running.

To check the status:

# systemctl status iwpmd

To start portmapper:

# systemctl start iwpmd


Latency/Bandwidth Test Example
------------------------------
Download the latest perftest package from
https://github.com/linux-rdma/perftest.
Intel recommends using ib_write_bw, ib_read_bw for bandwidth tests and
ib_write_lat and ib_read_lat for latency tests.

*Note: Up to 48 bytes of inline data are supported.

More examples and information about these tests can be found here:
https://github.com/linux-rdma/perftest

You can find full command line information by running <perftest_name> -h
(example: ib_write_bw -h or ib_write_bw --help)
Each of the perftest benchmarks uses "-d <IB device name>" to select the
RDMA-enabled port to run on.
To determine which Ethernet interface corresponds to which IB device name, use
the following steps:
  1. Use ifconfig or other utility to determine the MAC address of the
    port you want to test on. For example:

    # ifconfig eno2 or ip a

  eno2: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
  inet 13.13.2.16 netmask 255.255.255.0 broadcast 13.13.2.255
  inet6 fe80::a6bf:1ff:fe26:9997 prefixlen 64 scopeid 0x20<link>
  ether a4:bf:01:26:99:97 txqueuelen 1000 (Ethernet)

  2. Run ibv_devices to determine the IB device name corresponding to that
     MAC address.

Ex:
  ibv_devices | sort
  ------ ----------------
  device node GUID
  i40iw0 a4bf012699980000
  i40iw1 a4bf012699990000
  i40iw2 a4bf012699960000
  i40iw3 a4bf012699970000

Bandwidth test example:
(4K message size test on port 6001 for 60 seconds on device i40iw3)

On the Server:
ib_read_bw -I 48 -F -D 60 -R -p 6001 -s 4096 -d i40iw3

On the Client:
ib_read_bw -I 48 -F -D 60 -R -p 6001 -s 4096 -d i40iw3 <IP address of server
port>

Latency test example:
(1 byte message size test on port 6002 for 60 seconds on device i40iw3):

On the Server:
ib_write_lat -I 48 -F -D 60 -R -p 6002 -s 1 -d i40iw3

On the Client:
ib_write_lat -I 48 -F -D 60 -R -p 6002 -s 1 -d i40iw3 <IP address of server
port>


Interoperability
================

To interoperate with Chelsio iWARP devices:

Load Chelsio T4/T5 RDMA driver (iw_cxgb4) with parameter
dack_mode set to 0.

	modprobe iw_cxgb4 dack_mode=0

If iw_cxgb4 is loaded on system boot, create the /etc/modprobe.d/iw_cxgb4.conf
file with the following entry:

	options iw_cxgb4 dack_mode=0

Reload iw_cxgb4 for the new parameters to take effect.


RDMA Statistics
===============

Use the following command to read RDMA Protocol statistics:
  cd /sys/class/infiniband/i40iw0/proto_stats; for f in *; do echo -n
  "$f: "; cat "$f"; done; cd

The following counters will increment when RDMA applications are transferring
data over the network:
  - ipInReceives
  - tcpInSegs
  - tcpOutSegs


Known Issues/Troubleshooting
============================

RDMA fails perftest
-------------------
When testing iWARP devices with perftest 4.4-0.5, most devices will fail. See
https://github.com/linux-rdma/perftest/issues/52 for details.


Incompatible Drivers in initramfs
---------------------------------
There may be incompatible drivers in the initramfs image. You can either
update the image or remove the drivers from initramfs.

Specifically look for i40e, ib_addr, ib_cm, ib_core, ib_mad, ib_sa, ib_ucm,
ib_uverbs, iw_cm, rdma_cm, rdma_ucm in the output of the following command:
  lsinitrd |less
If you see any of those modules, rebuild initramfs with the following
command and include the name of the module in the "" list. Below is an
example:
  dracut --force --omit-drivers "i40e ib_addr ib_cm ib_core ib_mad ib_sa
  ib_ucm ib_uverbs iw_cm rdma_cm rdma_ucm"


Unsupported and Discontinued Items
==================================

Support for libi40iw has been discontinued.

i40iw does not support NFSoRDMA.


Intel(R) Ethernet Connection X722 iWARP RDMA VF driver discontinued
-------------------------------------------------------------------
Support for the Intel(R) Ethernet Connection X722 iWARP RDMA VF driver
(i40iwvf) has been discontinued. There is no change to the Linux iWARP RDMA PF
driver (i40iw).


Support
=======
For general information, go to the Intel support website at:
http://www.intel.com/support/

or the Intel Wired Networking project hosted by Sourceforge at:
http://sourceforge.net/projects/e1000

If an issue is identified with the released source code on a supported kernel
with a supported adapter, email the specific information related to the issue
to e1000-rdma@lists.sourceforge.net


License
-------

This software is available to you under a choice of one of two
licenses. You may choose to be licensed under the terms of the GNU
General Public License (GPL) Version 2, available from the file
COPYING in the main directory of this source tree, or the
OpenFabrics.org BSD license below:

  Redistribution and use in source and binary forms, with or
  without modification, are permitted provided that the following
  conditions are met:

  - Redistributions of source code must retain the above
    copyright notice, this list of conditions and the following
    disclaimer.

  - Redistributions in binary form must reproduce the above
    copyright notice, this list of conditions and the following
    disclaimer in the documentation and/or other materials
    provided with the distribution.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.


Copyright(c) 2016-2019 Intel Corporation.


Trademarks
==========
Intel is a trademark or registered trademark of Intel Corporation or its
subsidiaries in the United States and/or other countries.

* Other names and brands may be claimed as the property of others.


