

### **Nsight Product Family**

Nsight Systems - Analyze application algorithms system-wide

Nsight Compute - Analyze CUDA kernels

Nsight Graphics - Debug/analyze graphics workloads

#### **Workflow**



)



#### Overview



Interactive CUDA Kernel profiler

Targeted metric sections for various performance aspects

Customizable data collection and presentation (tables, charts, ...)

**UI** and Command Line

Python-based API for guided analysis and post-processing

Support for remote profiling across machines and platforms



## **Profiling Activities**

#### Interactive Profile



#### (Non-interactive) Profile



Cancel Reset Activity

#### Command Line









Detailed memory workload analysis chart and tables

Shows transferred data or throughputs

Tooltips provide metric names, calculation formulas and detailed background info



Comparison of results directly within the tool with "Baselines"

Supported across kernels, reports, and GPU architectures



Source/PTX/SASS analysis and correlation

Source metrics per instruction and aggregated (e.g. PC sampling data)

Metric heatmap

### **Occupancy Calculator**



## Command Line Collection and Analysis

Example 1: collecting the "full" sections set for 10 instances of specific kernels, writing results to a uniquely named report

```
$ ncu --set full -k "regex:device_tea_leaf_ppcg_solve_(calc|update).*" -c 10 -f -o
\ tea_leaf_%i ./tea_leaf
```

#### Example 2: printing selected metrics for two kernels in raw CSV format from a pre-collected report



### The Application

#### Get the source code

```
$ git clone --depth 1 https://github.com/UK-MAC/TeaLeaf CUDA
```

#### Get compiler, (Open)MPI and CUDA. Data collected on GA100 with:

```
CUDA 11.5, OpenMPI, Nsight-Compute/2022.1.1
```

Update Makefile for target architecture (e.g. AMPERE, SM 8.0) and compiler/libraries (as necessary)

#### Instrument with NVTX (as necessary)

```
nvtxRangePush("update_and_calc");
nvtxRangePush("update");
device_tea_leaf_ppcg_solve_update_r
<<<matrix_power_grid_dim, block_shape>>>
[...]
nvtxRangePop();

nvtxRangePush("calc");
device_tea_leaf_ppcg_solve_calc_sd_new
<<<matrix_power_grid_dim, block_shape>>>
[...]
nvtxRangePop();
nvtxRangePop();
nvtxRangePop();
```

### Overview with Nsight Systems

Profile with Nsight Systems to identify best CUDA kernel optimization targets

--> Focus on device\_tea\_leaf\_ppcg\_solve\_(calc|update).\* kernels



## **Profiling with Nsight Compute**

#### Interactive Profiling Example





## **Profiling with Nsight Compute**

Collected 10 instances of these kernels with ncu command line, "full" set of metrics. Inspect the resulting report in the Nsight Compute UI (ncu-ui).

Follow the rule links and guidance for best experience.

Summary page confirms that all instances of each respective kernel have similar performance characteristics - focus on a single instance for each



Switch to the *Details* page and select the second kernel (with slightly worse throughputs). Memory units are over-utilized.

Roofline shows that floating point performance is memory-bound (left of ridge point)



MWA table shows bandwidth 80% utilized, chart shows high Device-to-L2 utilization



Scheduler stats show low eligible/issued, need to check stall reasons

Good achieved occupancy, doesn't appear to be the issue

Stall reasons dominated by long scoreboard, locate using Source Counters section



MWA found sub-optimal cache access patterns, locate using Source Counters section

Source Counters show uncoalesced accesses and location of the stalls

Jump to Source page via this link



Stalled at DMUL instruction, waiting for LDG (load global) in line 43 (via register R10) LDG instructions are uncoalesced Lots of excessive (non-ideal) L2 sector accesses



Where is this in the code?

Need to add -lineinfo flag in Makefile during compilation (NV\_FLAGS) for CUDA-C/SASS correlation.

Re-compile, re-run

Consider using --import-source yes

```
View: Source and SASS ▼
Source: tea_leaf_ppcg.cuknl
                          ▼ 🖪 Find...
                                                                                                Source: device_tea_leaf_ppcg_solve_calc_sd_new >
Navigation: Warp Stall Sampling (All Cycles)
                                                                                                Navigation: Warp Stall Sampling (All Cycles)
                                                                          Warp Stall Sampling a
                                                                                                                                                                           Warp Stall Sampling
     # Source
                                                                                 (All Cycles) 1
                                                                                                     # Address
                                                                                                                                                                                   (All Cycles
                                        + beta[step]*rtemp[THARR2D(
                                                                                                                                IMAD RO, R4, UR4, R3
                                                                                                   37 00001476 e6fc4f40
                    utemp[THARR2D(0, 0, 0)] += sd[THARR2D(0, 0, 6
                                                                                                   38 00001476 e6fc4f50
                                                                                                                                IMAD.WIDE R2, R15, R2, c[0x0][0x1e0]
                                                                                                                                LDG.E.64.CONSTANT R8, [R8.64]
                                                                                                   39 00001476 e6fc4f60
                                                                                                   48 88881476 e6fc4f78
                                                                                                                                IMAD.WIDE R10, R0, R15, c[0x0][0x1a8]
           else if (PRECONDITIONER = TL_PREC_NONE)
                                                                                                                                LDG.E.64.CONSTANT R2, [R2.64]
                                                                                                   41 00001476 e6fc4f80
                                                                                                   42 00001476 e6fc4f90
                                                                                                                                IMAD.WIDE R4, R0, R15, c[0x0][0x198]
                                                                                       492
                                                                                                                                LDG.E.64.CONSTANT R10, [R10.64]
               if (WITHIN_BOUNDS)
                                                                                                   43 00001476 e6fc4fa0
                                                                                                   44 00001476 e6fc4fb0
                    sd[THARR2D(0, 0, 0)] = alpha[step]*sd[THARR2D
                                                                                                   45 00001476 e6fc4fc0
                                        + beta[step]*rtemp[THARR2D(
                                                                                                                                DFMA R6, R6, R2, R12
                                                                                                                                                                                        650
                                                                                                   46 00001476 e6fc4fd0
                    utemp[THARR2D(0, 0, 0)] += sd[THARR2D(0, 0, 6
                                                                                                                                IMAD.WIDE R12, R0, R15, c[0x0][0x1b0]
                                                                                                   47 00001476 e6fc4fe0
                                                                                                   48 00001476 e6fc4ff0
                                                                                                   49 00001476 e6fc5000
                                                                                                                                DADD R14, R6, R14
                                                                                                   50 00001476 e6fc5010
                                                                                                   51 00001476 e6fc5020
                                                                                                   52 00001476 e6fc5030
   247 /* New ppcg_store_r */
                                                                                                                                ULDC UR7, c[0x0][0x170]
                                                                                                                                                                                       1,409
                                                                                                   53 00001476 e6fc5040
                                                                                                                                ULDC UR4, c[0x0][0x16c]
                                                                                                   54 00001476 e6fc5050
   249 __global__ void device_tea_leaf_ppcg_store_r
                                                                                                   55 00001476 e6fc5060
                                                                                                                                UIADD3 UR4, UR7, UR4, URZ
                                                                                                                                ULDC UR6, c[0x0][0x18c]
   250 (kernel_info_t kernel_info,
                                                                                                   56 00001476 e6fc5070
```



## Collecting Data

By default, CLI results are printed to stdout

Use --export/-o to save results to a report file, use -f to force overwrite

\$ ncu -f -o \$HOME/my\_report <app>
\$ my report.ncu-rep

Use --log-file to pipe text output to a different stream (stdout/stderr/file)

Can use (env) variables available in your batch script or file macros to add report name placeholders Full parity with nvprof filename placeholders/file macros

https://docs.nvidia.com/nsight-compute/NsightComputeCli/index.html#command-line-options-file-macros



#### What To Collect

#### Curated "sets" and "sections" with commonly-used, high-value metrics

```
$ ncu --list-sets
Identifier Sections
                                                                                        Estimated Metrics
default
                                                                                        35
           LaunchStats, Occupancy, SpeedOfLight
detailed
           ComputeWorkloadAnalysis, InstructionStats, LaunchStats, MemoryWorkloadAnaly 157
           sis, Occupancy, SchedulerStats, SourceCounters, SpeedOfLight, SpeedOfLight
           RooflineChart, WarpStateStats
full
           ComputeWorkloadAnalysis, InstructionStats, LaunchStats, MemoryWorkloadAnaly 162
           sis, MemoryWorkloadAnalysis Chart, MemoryWorkloadAnalysis Tables, Occupancy
           , SchedulerStats, SourceCounters, SpeedOfLight, SpeedOfLight RooflineChart,
            WarpStateStats
           SourceCounters
                                                                                        47
source
```

#### Use defaults, or combine as desired

```
$ ncu --set default --section SourceCounters --metrics sm inst executed pipe tensor.sum ./my-app
```

#### What To Collect

#### Query metrics for any targeted chip

```
$ ncu --query-metrics --chip ga100
smsp warps issue stalled not selected
                                                                            cumulative # of warps waiting
for the microscheduler to select the warp to issue
smsp warps issue stalled selected
                                                                            cumulative # of warps selected
by the microscheduler to issue an instruction
smsp warps issue stalled short scoreboard
                                                                            cumulative # of warps waiting
for a scoreboard dependency on MIO operation other than (local, global, surface, tex)
tpc cycles active
                                                                            # of cycles where TPC was active
tpc cycles elapsed
                                                                            # of cycles where TPC was active
==PROF== Note that these metrics must be appended with a valid suffix before profiling them. See --help for
more information on --query-metrics-mode.
```

#### Specify sub-metrics in section files, or on the command line

```
$ ncu --query-metrics-mode suffix --metrics sm__inst_executed_pipe_tensor ./my-app
sm__inst_executed_pipe_tensor.sum
sm__inst_executed_pipe_tensor.avg
sm__inst_executed_pipe_tensor.min
```

#### **Source Analysis**

SASS (assembly) is always available, embedded into the report CUDA-C (Source) and PTX availability depends on compilation flags Use -lineinfo to include source/SASS correlation data in the binary

Source is not embedded in the report by default, need local or remote access to the source file to resolve in the UI. Import source during collection to (--import-source yes) to solve this.

Compiler optimizations can prevent exact source/SASS correlation

## Replay Modes

https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#replay





Kernel Replay (interactive and non-interactive)

Range Replay (non-interactive)



Application Replay (non-interactive)



#### Conclusion

Nsight Compute enables detailed CUDA kernel analysis

Rules give guidance on optimization opportunities and help metric understanding

Limit metrics to what is required when overhead is a concern. Consider using application replay.

Still requires level of hardware understanding to fully utilize the tool - pay attention to rule results and refer to <a href="https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html">https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html</a>

Analyze results in the UI, or post-process with CSV output or python report interface

Check known issues: <a href="https://docs.nvidia.com/nsight-compute/ReleaseNotes/index.html#known-issues">https://docs.nvidia.com/nsight-compute/ReleaseNotes/index.html#known-issues</a>

## **Further Reading**

**Download** <a href="https://developer.nvidia.com/nsight-compute">https://developer.nvidia.com/nsight-compute</a> (can be newer than toolkit version)

**Documentation** <a href="https://docs.nvidia.com/nsight-compute">https://docs.nvidia.com/nsight-compute</a> (and local with the tool)

https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html

Forums <a href="https://devtalk.nvidia.com">https://devtalk.nvidia.com</a>

Further Reading <a href="https://developer.nvidia.com/nsight-compute-videos">https://developer.nvidia.com/nsight-compute-videos</a>

https://developer.nvidia.com/nsight-compute-blogs

https://github.com/NVIDIA/nsight-training

Repository with interactive training material for multiple Nsight tools, including

Systems and Compute.

https://gitlab.com/NERSC/roofline-on-nvidia-gpus





#### **Python Interfaces**





On a single-node submission, one Nsight Compute instance can profile all launched child processes

Data for all processes is stored in one report file

ncu --target-processes all -o <singlereport-name> <app> <args>



On multi-node submissions, one tool instance can be used per node

Ensure that instances don't write to the same report file on a shared disk

```
ncu -o report_%q{OMPI_COMM_WORLD_RANK}
<app> <arqs>
```



Multiple tool instances on the same node are supported, but...

All kernels across all GPUs will be serialized using system-wide file lock



# Consider profiling only a single rank, e.g. using a wrapper script

```
#!/bin/bash
if [[ "$OMPI_COMM_WORLD_RANK" == "3" ]] ; then
    /sw/cluster/cuda/11.1/ nsight-compute/ncu -
o report_${OMPI_COMM_WORLD_RANK} --target-
processes all $*
else
    $*
fi
```