# Programming for Hybrid: Untangling your Threads Leigh Davies Senior Game/Graphics Application Engineer 03/2023 # What is a Hybrid SOC (System On a Chip)? ### Combines Performance Cores and Efficient Cores - Two core types with different power and performance characteristics - Both core types have the same ISA support - No AVX512, TSX - New AVX-VNNI, UMWAIT/TPAUSE #### Performance Cores - Concentrate on single and limited threading scenarios - Performance intensive ### Efficient Cores - Concentrate on MT throughput and power limited scenarios - Efficiency focused # What is a Hybrid SOC (System On a Chip)? ### Combines Performance Cores and Efficient Cores - Two core types with different power and performance characteristics - Both core types have the same ISA support - No AVX512, TSX - New AVX-VNNI, UMWAIT/TPAUSE #### Performance Cores - Concentrate on single and limited threading scenarios - Performance intensive ### Efficient Cores - Concentrate on MT throughput and power limited scenarios - Efficiency focused Intel® Thread Director (HGS+) #### Hardware unit Intelligence built directly into the core Monitors the runtime instruction mix of each thread and as well as the state of each core – with nanosecond precision #### Provides runtime feedback to the OS to make the optimal scheduling decision for any workload or workflow based on ISA and other inputs #### Dynamically adapts guidance based on the thermal design point, operating conditions, and power settings – without any user input # **CPU Specs Progression** | FEATURES | ROCKET LAKE<br>(i9-11900k) | ALDER LAKE<br>(i9-12900k) | RAPTOR LAKE<br>(i9-13900k) | | | |----------------------------------------------|------------------------------------------------------------------------------|--------------------------------------------------------------------------------|--------------------------------------------------------------------------------|--|--| | Lithography P-cores Base Freq Max Turbo Freq | <ul><li>✓ 14 nm</li><li>✓ 8/16</li><li>✓ 3.5 GHz</li><li>✓ 5.3 GHz</li></ul> | <ul><li>✓ Intel 7</li><li>✓ 8/16</li><li>✓ 2.4 GHz</li><li>✓ 5.2 GHz</li></ul> | <ul><li>✓ Intel 7</li><li>✓ 8/16</li><li>✓ 3.0 GHz</li><li>✓ 5.8 GHz</li></ul> | | | | L1D\$/I\$<br>L2U\$<br>L3U\$ | <ul><li>✓ 48/32 KB</li><li>✓ 512 KB</li><li>✓ 16 MB</li></ul> | <ul><li>✓ 48/32 KB</li><li>✓ 1280 KB</li><li>✓ 30 MB</li></ul> | <ul><li>✓ 48/32 KB</li><li>✓ 2048 KB</li><li>✓ 36 MB</li></ul> | | | | E-cores<br>L2U\$ (4x E-core) | ✓ 0<br>✓ N/A | ✓ 8<br>✓ 2048 KB | ✓ 16<br>✓ 4096 KB | | | | Max Mem Size<br>Mem Type<br>Mem B/w | <ul><li>✓ 128 GB</li><li>✓ DDR4-3200</li><li>✓ 50 GB/s</li></ul> | <ul><li>✓ 128 GB</li><li>✓ DDR5-4800</li><li>✓ 76.8 GB/s</li></ul> | <ul><li>✓ 128 GB</li><li>✓ DDR5-5600</li><li>✓ 89.6 GB/s</li></ul> | | | intel #### Disclaimer # Hyper-Threading Recap Important for later in the talk..... Hyper-Threading (Simultaneous Multi-Threading) #### SMT and instruction throughput - Improves Core CPI (Clockticks Per Instruction) - Potential degrades Thread CPI "E-cores are designed to provide better performance than a logical P-core with both hardware sibling hyper-thread busy." Each box represents a processor execution unit # Changing Our Assumptions ... #### All cores have the same performance profile - Significant performance delta between cores - Same ISA != same throughput #### All cores have the same frequency - There may be one, two, or more, faster cores - The fastest core may move around the package #### Hyper Threading doubles the physical core count - Hyperthreading may be available on only some cores in a package - Logical core count may not equal 2x physical core count #### Optimizing the CPU only matters if CPU Bound Power may be shared between GPU/CPU/Other -> frequency impact # **CPU Topology** All cores exposed to OS as individual Logical Processors using; # Preferred Enumeration method: GetLogicalProcessorInformationEx() - struct\_PROCESSOR\_RELATIONSHIP: - Field: EfficiencyClass; << Higher mean more perf</li> - Note: This is relative to other logical processors in the system. - For 12<sup>th</sup>/13<sup>th</sup> Gen Intel Core EfficiencyClass=1 is P-Cores. - struct \_CACHE\_RELATIONSHIP: - Field: Level << The cache level</p> - Field: Type << The cache Type (Data, instruction, etc)</p> - Field: GroupMask.Mask << LP's connected to the cache</li> - Note: Even cores with the same EfficiencyClass can have different cache configurations. #### Pseudo Code ``` Typedef pSLPI_EX SYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX*; uintptr t affinity; if (GetLogicalProcessorInformationEx(RelationAll, (pSLPI_EX)&buffer[0], &size)) for (size_t i = 0; i < size;) SYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX* procInfos = (pSLPI_EX) &buffer[i]; switch (procInfos->Relationship) case RelationProcessorCore: for (uint32 t g = 0; g < proclnfos->Processor.GroupCount; ++g) LPNumber = BitScan(procInfos->Processor.GroupMask[g].Mask); LPClass = procInfos->Processor. EfficiencyClass; }break; case RelationCache: cache.m ProcessorMask = procInfos->Cache.GroupMask.Mask; {break; i += procInfos->Size; ``` ## Thread Scheduling #### No central scheduler - Scheduling routines are called whenever events occur that change the state of a thread - Example scheduling events include: - A thread becomes ready to execute (newly created or released from wait state) - A thread enters a wait state or ends. - Interval timer interrupts - Other hardware interrupts (for I/O wait completion) - Quantum End - A thread priority is changed - Thread QoS changes - System concurrency/utilization changes (causing parking) - Intel® Thread Director updates # Scheduling Priorities - A process has only a single base priority value - Each thread has two priority values: current (Dynamic) and base - Scheduling decisions are made based on the current priority - The system under certain circumstances increases the priority of threads in the dynamic range (1 through 15) for brief periods | | Priority Class Relative Priority | | | | | | | |-------------------------------|----------------------------------|------|-----------------|--------|-----------------|------|--| | Relative Priorities | Real time | High | Above<br>Normal | Normal | Below<br>Normal | idle | | | THREAD_PRIORITY_TIME_CRITICAL | 31 | 15 | 15 | 15 | 15 | 15 | | | THREAD_PRIORITY_HIGHEST | 26 | 15 | 12 | 10 | 8 | 6 | | | THREAD_PRIORITY_ABOVE_NORMAL | 25 | 14 | 11 | 9 | 7 | 5 | | | THREAD_PRIORITY_NORMAL | 24 | 13 | 10 | 8 | 6 | 4 | | | THREAD_PRIORITY_BELOW_NORMAL | 23 | 12 | 9 | 7 | 5 | 3 | | | THREAD_PRIORITY_LOWEST | 22 | 11 | 8 | 6 | 4 | 2 | | | THREAD_PRIORITY_IDLE | 16 | 1 | 1 | 1 | 1 | 1 | | # Thread Scheduling To improve scalability Windows 8+ added Shared Ready Queues: Reduce contention on Ready Queue. #### Priority driven, preemptive - Ready Queue consist of; - 32 queues (FIFO lists) of "ready" threads - UP: Highest priority thread always runs - MP: One of the highest priority runnable thread will be running somewhere - Threads run for an amount of time called a quantum - Can be cut short due to preemption by higher priority thread - The system treats all threads with the same priority as equal - No attempt to share processor(s) "fairly" among processes, only among threads Current Hybrid systems have up to 8 LPs per group # Scheduling Scenarios: On wake #### On wake - If newly-Ready thread is not of higher priority than the Running thread... - ...it is put at the tail of the Ready queue for its current priority - If priority >=14 quantum is reset. - If priority <14 and you're about to be boosted and didn't already have a boost, quantum is set to process quantum - 1 # Scheduling Scenarios: Preemption #### Preemption - A thread becomes Ready at a higher priority than the running thread and all processors are busy - Lower-priority Running thread is preempted - Preempted thread goes back to <u>head</u> of its Ready queue - Action: pick lowest priority thread to preempt # Scheduling Scenarios: Voluntary switch #### Voluntary switch - Waiting on a dispatcher object - Termination - Explicit lowering of priority - Action: scan for next Ready thread (starting at your priority & down) # Scheduling Scenarios: Quantum End ## Running thread experiences quantum end - Priority is decremented unless already at thread base priority - Thread goes to <u>tail</u> of Ready queue for its new priority - May continue running if no equal or higher-priority threads are Ready - May migrate to ANOTHER LPs ready queue and start running thereafter - Action: pick next thread at same priority level ## **Priority Boosts** # Windows 7 Internals Part 1 System enhanceurs processes deseased, memory management, and more ### Windows periodically adjusts the current dynamic priority of threads, reasons include: - Scheduler/dispatcher events: - An event is pulsed - A mutex/semaphore was released/abandoned - A timer was set - Other hardware interrupts - Has been in the ready queue a long time - A thread was alerted/suspended/resumed... #### Behavior of these boosts: - Applied to thread's current priority will not take you above priority 15 - After a boost, you get one quantum Then decays I level, runs another quantum # Intel® Thread Director Background #### **Thread Scheduling Overview** - Processors that support x86 hybrid architecture are categorized on their performance and efficiency. - Intel Thread Director provides a hint to the os as to the thread that will benefit most from placement on a specific LP - This hint is used within the same or lower QoS/Priority threads. - HW periodically writes a feedback table (EHFI) # Intel® Core™ Processor Windows Scheduling/Parking Background #### Windows Core Parking Engine - Makes global scalability decisions about the workload and determines the optimum set of compute cores for execution. - Max Turbo vs All core frequency - Enhance battery life - Prioritize shared resources - Etc.... Power Management Settings related to Scheduling / Parking: Varies by power plan. CPMinCores: Specifies the minimum percentage of logical processors that can be unparked state at any given time. CPMaxCores: Specifies the maximum percentage of logical processors that can be unparked state at any given time. CPIncreaseTime: The minimum time that must elapse before additional logical processors can be transitioned from the parked to the unparked state. CPDecreaseTime: The minimum time that must elapse before additional logical processors can be transitioned from the unparked to the parked state. CPHeadroom: Specifies the additional utilization that would cause the core parking engine to unpark an additional parked logical processor # Single Thread Scenario - The following example shows Windows leveraging an Intel core for single thread performance. - This behavior is dynamically achieved when Logical Processor (LP) 0 has the highest performance capability. ### Limited Threaded Scenario - The following example shows an example scheduling behavior in a limited software thread scenario. - This behavior is dynamically achieved by the Windows scheduler/parking engine when P-Cores are more performant than the E-Cores. E-Cores are more performant than the SMT sibling of a busy core. - When the capabilities dynamically change, Windows automatically accounts for this for optimal scheduling - Favored core priority given to focus application threads, high priority or long duration threads ### Multi Threaded Scenario - All cores are used by Windows in multithread scenarios - In power/thermal constraint scenarios, there may be times when all cores aren't used for optimal system performance/efficiency. - The behavior is dynamically achieved by hardware providing feedback to Windows, and Windows automatically acting on that feedback. # Example combination of P-cores and E-cores P-Core P-Core ### Low Power Scenario In certain scenarios like low power envelope SKUs or better battery life goals, it can be more efficient to run low utilization work on cores with higher efficiency capability at efficient frequency # Simplified Processor/Ready Queue Selection 25 # Software Enabling for Hybrid OS scheduler will move threads based on their priority, QoS and performance/efficiency HW metrics ### OS scheduler will try to assign work based on: - Most performant core used first are used first for single-thread & multi-thread performance - Spill over multithreaded work uses additional physical for MT-performance - SMT siblings are used last to avoid any contention impacting performance #### **Core Parking** - OS will park inactive and lightly utilized logical processors. - Saves power, or higher frequencies for running processors. - Recent changes have made the OS more aggressive at parking for DC scenarios. - AC scenarios are different due to softparklatency tunings #### **Avoid Hard Affinities** - Use OS Hints for soft affinity - Any ISV code providing affinity could potentially see perf degrades as OS cannot override the decision - QoS, SetThreadIdealProcessor help determine where the OS will queue a thread to run. - CPUSets API. This API takes HGS do not use hints into account and breaks affinity #### **Workload Scalability** - Not all workloads scale with increased core count - Increased threads add overhead in context switches/ synchronization APIs, reduced cache and shared hardware resources - Scale thread count based on workload benefits - GetLogicalProcessorInformationEx - To scale application based on best fit to hardware # Profiling Hybrid Games **VTune 2022** - CPI - Cache - Thread Director VTune Download **WPA** - SMT Usage - Concurrency - OS Behaviour Windows SDK GPU/CPU Concurrency Github://Presentmon https://learn.microsoft.com/en-us/windows-hardware/drivers/display/using-gpuview https://developer.amd.com/wordpress/media/2012/10/Using%20GPUView%20to%20Understand%20your%20DirectX%2011%20Game.pps # Thanks to IO Interactive - Example data collected from Hitman 3. - Reproduced with permission from IO Interactive. - Used to show profiling data only, title interacts with well with the OS. ### CPU or GPU Bound? ### GPU bound will affect timings of CPU threads - When idling waiting on the GPU the CPU will drop into lower C-states: Lowers CPU performance - Changes thread concurrency - Frame latency hides scheduling issues #### Capture timings with PresentMon: < presentmon -track\_gpu -captureall multi\_csv -timed 20 -terminate\_after\_timed > #### Or - Capture with - View merged.etl in GPUView - Post process with - > < presentmon -track\_gpu -etl\_file merged.etl -multi-csv > #### PresentMon output msBetweenPresents = CPUtimgs msUntilDisplayed = Display Latency msGPUActive = GPU Timing # Viewing GPU Data #### Simple to plot timings from PresentMon GPU Ims faster than CPU Big CPU frame time variations Areas of interest: Possible IO stalls, memory paging # Viewing GPU Data # Understanding Concurrency #### WPA: Timeline by Process, Thread #### **CPU Concurrency:** Number of simultaneous active threads at and point in a frame. #### Xperf/WPA: Precise view, tracks kernel events in ETL files. Provides a fine-grained view of individual threads. Sampled concurrency views like VTune don't provide enough detail on concurrency at the OS event level. # Understanding WPA (1/5) A lot of customisable views into OS/hardware level data - Computation - CPU Usage - Timeline by Process, Thread (Precise) ← Track events rather than using sampling Thread-level timings # Understanding WPA (2/5) A lot of customisable views into OS/hardware level data # Understanding WPA (3/5) A lot of customisable views into OS/hardware level data Logical processor state 5.878938500 0.00 Park ## Understanding WPA (4/5) A lot of customisable views into OS/hardware level data - Generic Events - Sorted by Service Provider - Microsoft-Windows-DirectD3D12 - Microsoft-Windows-Kernel-Processor-Power - Microsoft-Windows-DXGI Useful to track where GPU commands are issued ## Understanding WPA (5/5) Thread Timeline Core Parking DX12 Events ### Thread Execution CPU: Timeline by Process, Thread #### Add Cpu + CPU Usage(ms) Sum #### Sort By CPU Usage (Sum) Breakdown thread execution time by Core type Long duration threads should favour Performance cores WPA Tabled can be copied to Excel and graphed. ### Thread Execution CPU: Timeline by Process, Thread #### Add Cpu + CPU Usage(ms) Sum https://devblogs.microsoft.com/performance-diagnostics/wpa-intro/ ## Thread Ready/Wait times Are threads efficiently scheduled? How long do they wait and is the OS able to schedule them? #### CPU Usage (Precise) - Add CPU usage to table - Move *Ready* & *Wait* up the table - Reformat units ## Thread Ready/Wait times ### Thread Ready/Wait times #### Hybrid system has: - Smaller Wait time for main 2 threads - Less Ready time on Render thread Less context switches for all main threads | Thread<br>Job | New<br>Threadld | CPU Usage<br>(ms) | Ready (ms)<br>Sum | Waits (ms)<br>Sum | Count | Ready (us)<br>Max | Waits (us)<br>Max | Count:<br>Waits | | |---------------|-----------------|-------------------|-------------------|-------------------|-------|-------------------|-------------------|-----------------|------| | Render | 7444 | 20125.96 | 43.94 | 749.32 | 9226 | 381.3 | 50467.6 | 5044 | | | Game | 11256 | 19926.66 | 62.12 | 831.01 | 37873 | 426 | 3319.6 | 27374 | | | Worker 1 | 10376 | 8285.756 | 240.84 | 12276.28 | 75037 | 692.7 | 10014.7 | 71946 | | | Worker 2 | 10852 | 8233.223 | 237.80 | 12327.32 | 76566 | 762.7 | 10017.6 | 73448 | | | Worker 3 | 7976 | 8213.791 | 254.07 | 12326.72 | 75698 | 526.4 | 10013.5 | 72006 | | | | | | Non F | Hybrid S | ystem | | | | | | | | | Hyl | brid Sys | stem | | Smaller | max. Ready | time | | | New<br>ThreadId | CPU Usage<br>(ms) | Ready (ms)<br>Sum | Waits (ms)<br>Sum | Count | Ready (us)<br>Max | Waits (us)<br>Max | Count:<br>Waits | | | Render | 13968 | 20147.11 | 19.88 | 627.40 | 6731 | 136.2 | 6593.4 | 5225 | | | Game | 5048 | 20084.29 | 65.55 | 645.93 | 29891 | 125.1 | 2785.3 | 26094 | | | Worker 1 | 13324 | 7927.543 | 489.20 | 12406.89 | 67058 | 344.7 | 18072.9 | 63135 | | | Worker 2 | 14000 | 7500 001 | 714.03 | 12529.15 | 65640 | 334.2 | 17040.8 | 62649 | | | V V OTROT Z | 14880 | 7589.991 | 714.00 | 12027.10 | 65649 | 004.2 | 17040.0 | 02047 | | # Ideally compare against 2 systems - i.e. Intel i9-12900K - e-cores on vs off in bios ### Understanding Context switches Context Switch is the process of changing the active thread on a processor. Overhead of changing architecture state - Capture with log.cmd normal - Load symbols (MSFT symbol server) - Add NewThreadStack to table view NewThread Stack The stack of the new thread when it is switched in. Usually indicates what the thread was blocked or waiting on. ### Intel® VTune™ Profiler Advanced sampling profiler allows you to quickly identify CPU bottlenecks causing slow frames and tasks. Hotspot Analysis: Identifies functions consuming the most CPU time Thread Performance: Visualizes thread behavior to quickly identify concurrency problems | Grouping: Task Domain / Task Type / Function / Call Sta | iun - | | | | * * 2 | 10 | |---------------------------------------------------------|------------|----------------------|-------------------------|----------|-----------|-----| | Task Domain / Task Type / Function / Call Stack | CPU Time # | Instructions Retired | Microarchitecture Us | | Task Time | П | | | | | Microarchitecture Usage | CPI Rate | | | | UE4Domain | 65.847s | 211,632,115,937 | 29.8% | 0.990 | 155.2521 | | | ► FDeferredShadingSceneRenderer_Render | 8,956s | 20,558,580,980 | 23.8% | 1.350 | 15.575s | | | ▼ FDeferredShadingSceneRenderer_InitViews | 1.822s | 4,945,551,288 | 26,5% | 1.178 | 3,340s | | | ▶ FSceneRenderer_ComputeViewVisibility | 1.582s | 4,621,327,612 | 26.9% | 1.112 | 2,903s | | | # UWorld_Tick | 1.463s | 5,770,821,626 | 28,6% | 0,887 | 4.109s | | | + FCompression_UncompressMemory | 1.187s | 4,825,032,389 | 60.5% | 0.712 | 1.6925 | i | | ▶ FScene_UpdateAllPrimitiveSceneInfos | 0.580s | 1.380,673,014 | 18.9% | 1.356 | 0.851s | į. | | ▶ FScene_AddPrimitiveSceneInfos | 0.558s | 1,331,746,702 | 10,2% | 1,348 | 0.815s | ś | | FScene_AddPrimitiveSceneInfoToScene | 0.535s | 1,312,085,470 | 18.5% | 1,328 | 0.795s | ś | | ► FDeferredShadingSceneRenderer_RenderLights | 0.333s | 673,865,649 | 22.1% | 1,554 | 0.564s | í | | ► Slate::Tick | 0.219s | 203,536,533 | 30.3% | 3.053 | 0.635s | ś | | FVIewport_Draw | 0.166s | 43,686,757 | 20.5% | 2.930 | 0.291s | 6 | | ■ Slate::DrawWindows | 0.164s | 181,458,102 | 34.6% | 2.761 | 0.539s | ś | | FAudioDevice_Update | 0,159s | 37,496,378 | 59.6% | 2,715 | 0.207s | 4 | | FDeferredShadingSceneRenderer_InitViewsPossibl | 0.147s | 206,229,370 | 27.9% | 2.055 | 0.270s | 5 | | ▶ Slate::DrawWindow_RenderThread | 0.112s | 204,785,341 | -35.5% | 1.737 | 0.1725 | s l | | ► FSceneRenderer_InItDynamicShadows | 0.103s | 152,822,069 | 27.6% | 1.938 | 0.188s | 1 | | ▶ Frame 0 | 0.085s | 199,353,144 | 22,4% | 1.395 | 1.340s | 5 | | Slate;:DrawWindow | 0.078s | 14,858,960 | 16,6% | 5,405 | 0.126s | ś | | ► Frame 324 | 0.072s | 271,605,376 | 64.5% | 0,872 | 0.145s | 5 | | ► Frame 387 | 0.070s | 263,047,990 | 2.49 | 0.882 | 0.157s | 6 | | ► Frame 214 | 0.069s | 260,711,662 | 1.0% | 0.896 | 0.156s | ś | | ► Frame 211 | 0.069s | 265,593,751 | 11.8% | 0.867 | 0.155s | 5 | | Frame 377 | 0.069s | 254,510,381 | 0.0% | 0,892 | 0.155s | s | | Frame 248 | 0.069s | 268,360,639 | 2 0% | 0,864 | 0.1569 | 5 | | + Frame 379 | 0:068s | 239,726,238 | 375 | 0.936 | 0.157s | 5 | | + Frame 269 | 0.068s | 263,348,608 | 106 | 0.851 | 0.146s | s | | Frame 217 | 0.068s | 262,593,844 | 108 | 0.867 | 0.148s | 5 | | Frame 272 | 0.067s | 254,704,545 | 205 | 0.875 | 0.155s | 5 | | ▶ Frame 821 | 0.067s | 240,466,446 | 17.1% | 0,871 | 0.144s | 5 | | Frame 231 | 0.067s | 249,131,002 | 200 | 0.879 | 0.145s | 5 | | Frame 357 | 0.067s | 257,373,132 | 1.0% | 0.864 | 0.1395 | 5 | | ► Frame 224 | 0.067s | 249,544,739 | 31.6% | 0.881 | 0.1625 | 5 | | ▶ Frame 370 | 0.067s | 261,285,508 | 5.4% | 0.857 | 0.1425 | 5 | | F Frame 381 | 0.067s | 238,690,251 | 40.2% | 0.897 | 0.158s | 5 | | Frame 233 | 0.067s | 250,635,455 | 37.1% | 0.876 | 0.154s | 5 | | ► Frame 229 | 0.067s | 254,875,318 | 11.2% | 0.882 | 0.155s | 1 | Instrumentation API: Extensive API enables frame and task markup for better results ### Finding Architectural Issues Configure VTune™ for microarchitecture analysis: Small sampling internal. Can be run from command-line if preferred, minimal overheads. Embed into application using a hotkey? Virtualization based security limits VTune™ collection, disable for collection of microarchitecture events. ### Useful VTune Metrics | Metric | Description | |-----------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | CPI Rate | Cycles per Instruction Retired, or CPI, how much time each executed instruction took, in units of cycles. Modern superscalar processors issue up to four instructions per cycle, suggesting a theoretical best CPI of 0.25. | | Cache Bound | This metric shows how often the machine was stalled on L1, L2 and L3 caches. This metric also includes coherence penalties for shared data. | | Contested<br>Accesses | Contested accesses occur when data written by one thread is read by another thread on a different core. Examples of contested accesses include synchronizations such as locks, true data sharing such as modified locked variables, and false sharing. | ### Microarchitecture by Core Type Import results into VTune UI, use a custom grouping to sort thread activity into core type. ### Relative Thread Perf. by Core Type | | | | | P-Core | E-Core ← | | |----------|-----------|--------|--------|-------------|-------------|--------------| | | | | | Million | Million | Relative | | | | P-Core | e-Core | Instruction | Instruction | instructions | | Thread | Thread ID | CPI | CPI | s/Second | s/Second | persecond | | Render | 9496 | 0.93 | 2.18 | 5210.81 | 1664.37 | 0.32 | | Game | 13964 | 1.67 | 2.68 | 2894.89 | 1350.75 | 0.47 | | Worker 1 | 416 | 0.79 | 0.99 | 6140.13 | 3663.97 | 0.60 | | Worker 2 | 13648 | 0.79 | 0.95 | 6093.55 | 3826.64 | 0.63 | | Streamer | 14092 | 1.38 | 1.25 | 3487.70 | 2907.63 | 0.83 | | Audio | 13640 | 1.43 | 1.23 | 3370.63 | 2935.93 | 0.87 | Frequency\* Threads more efficient on P-Cores Threads slightly more efficient on P-Cores Memory limited, core type doesn't matter ### Relative Thread Perf. by Core Type Potentially unfair to compare P- and E-Cores: E-Cores are lowering sibling activity. (see slide 8 - Hyper-Threading Recap) | **** | *SMT Statistics* | **** | **** | *SMT Statistics* | **** | | |-------------|-----------------------|-------------------------|-------------|-----------------------|-------------------------|--| | | 8C/ | 16T | | Hybrid | | | | Core ID | Both Siblings<br>Idle | Both Siblings<br>Active | Core ID | Both Siblings<br>Idle | Both Siblings<br>Active | | | | Percentage | Percentage | | Percentage | Percentage | | | LPO&LP1 | 44.94 | 29.75 | LPO&LP1 | 45.44 | 20.34 | | | LP2&LP3 | 43.19 | 34.73 | LP2&LP3 | 31.47 | 23.08 | | | LP4&LP5 | 42.18 | 36.61 | LP4&LP5 | 43.44 | 22.66 | | | LP6&LP7 | 43.11 | 36.66 | LP6&LP7 | 45.92 | 23.2 | | | LP8&LP9 | 1.99 | 45.28 | LP8&LP9 | 11.46 | 28.22 | | | LP10 & LP11 | 44.01 | 37.17 | LP10 & LP11 | 46.6 | 23.71 | | | LP12 & LP13 | 2.14 | 45.93 | LP12 & LP13 | 13.15 | 27.92 | | | LP14 & LP15 | 43.63 | 35.79 | LP14 & LP15 | 46.38 | 23.17 | | | Average | | 37.74 | | | 24.0375 | | | | Thread<br>ID | Hybrid<br>P-Core CPI | Symmetric<br>P-Core CPI | Hybrid vs<br>Symmetric | |----------|--------------|----------------------|-------------------------|------------------------| | Render | 9496 | 0.925 | 0.991 | 1.07 | | Game | 13964 | 1.665 | 1.714 | 1.03 | | Worker 1 | 416 | 0.785 | 0.888 | 1.13 | | Worker 2 | 13648 | 0.791 | 0.904 | 1.14 | | Streamer | 14092 | 1.382 | 1.415 | 1.02 | | Audio | 13640 | 1.43 | 1.395 | 0.98 | - 33% reduction in SMT work. - 3-14% improvement in P-Core SMT. ### Thread Director Uncovered - Analyse EHFI classes as part of a hotspot VTune collection. - Most game code will be class 0 :- and will target the P-Cores by default.\* <sup>\*</sup> When not power constrained ### Case Study Background - Titles used are anonymous - All data taken from titles un-optimised for Hybrid - Data gathered during platform validation - All titles give a good user experience on Hybrid - Used purely to illustrate OS behaviour ### Case Study 1: Worker Threads on E-Cores #### **Problem statement:** Title scales on Hybrid but... - Very high E-Core utilisation - Critical threads on E-Cores P-Core(ms) - ntel technologies' features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system configuration. Configurations used for test and this perf data: Intel® i9-12900K + NVIDIA 3090 All testing was performed at Intel® Munich. Numbers may differ based on actual hardware used and/or based on how the benchmark is written. Intel® makes no guarantee on the specific numbers and it is intended for providing reference - The above is for reference and work in progress data and software E-core(ms) ### Case Study 1: Non-Hybrid Behaviour #### **Engine created three thread-pools** - Worker threads \* X - Background threads \* X - Physics threads \* X Worker threads run serially with physics, background fills in idle time. #### Managed by thread priority. Thread concurrency of 16: 14 Worker Threads + 2 Main threads ### Case Study 1: Over reliance on Priority | NewThreadId | CPU Usage (ms) | Ready (ms) | | | |-------------|----------------|------------|-----|---| | 13164 | 26262.94 | 170.62 | 15 | | | 2672 | 21733.46 | 8170.25 | 9 | | | 11952 | 18384.48 | 52.02 | 15 | | | | | | | | | 13204 | 10747.57 | 445.80 | 11 | | | 1112 | 10723.61 | 445.45 | 11 | | | 10308 | 10716.05 | 604.76 | 11 | | | 12416 | 10691.85 | 514.57 | 11 | 1 | | 2816 | 10687.33 | 519.11 | 11 | - | | 2936 | 10521.04 | 739.96 | 11 | | | 12820 | 9885.51 | 1284.05 | 11 | | | 5172 | 9859.60 | 1322.32 | 11 | | | 10204 | 9853.41 | 1363.99 | 11 | | | 360 | 9842.28 | 1402.56 | 11 | | | 2112 | 9831.45 | 1436.90 | 11 | | | 12824 | 9830.03 | 1324.76 | 11 | | | 6840 | 9815.88 | 1403.71 | 11 | | | 6744 | 9718.37 | 1439.86 | 11/ | | | | | | | | | 12408 | 9993.18 | 9532.86 | 9 | | | 11832 | 9956.15 | 9536.22 | 9 | | | 5252 | 9947.14 | 9480.11 | 9 | | | 4352 | 9943.25 | 9548.69 | ò | Н | | 12256 | 9939.59 | 9580.09 | 9 | | | 13120 | 9933.24 | 9538.27 | 9 | | | 12564 | 9601.82 | 9952.32 | 9 | | | 12352 | 9550.92 | 10004.38 | 9 | | | 6300 | 9492.72 | 10081.98 | 9 | | | 11480 | 9482.79 | 10129.44 | 9 | | | 5980 | 9407.38 | 10187.29 | 9 | | | 8576 | 9398.60 | 10206.77 | 9 | | | 2376 | 9344.28 | 10255.47 | 9 | | | 12840 | 9334.23 | 10271.34 | 9 | | | | 0100 =1 | 1070.07 | | | | 232 | 2122.71 | 4278.37 | | | | 352 | 2026.23 | 4438.73 | | | | 1548 | 1780.68 | 132.23 | | | Thread 5,252 sits in a ready state while 12,416 is running On symmetric system, priority 9 threads spend 50% of their time in Ready Queue # 9 Threads # Case Study 1: Priority Does Not Block Background Thread on Hybrid - Priority 9 Ready time drops 10x. - Low priority threads don't have to wait. | NewThreadId | CPU Usage (ms) | Ready (ms) | Waits (ms) | |-------------|----------------|------------|------------| | 9404 | 29561.60 | 407.07 | 19.29 | | 16832 | 26209.49 | 15.69 | 6116.78 | | 1948 | 18290.85 | 152.10 | 11550.94 | | 472 | 11923.43 | 376.37 | 17822.46 | | : | : | : | : | | 6196 | 11894.83 | 390.61 | 17877.16 | | 17784 | 11625.97 | 1006.57 | 17534.10 | | 15592 | 11622.22 | 1006.30 | 17515.31 | | 11824 | 11615.71 | 1020.29 | 17517.20 | ### Case Study 1: Summary - Low priority work, runs in parallel with high priority work. - High priority, long running threads, run on E-Cores when previous lower priority work is already in-flight on P-Cores. - Could defer scheduling of priority 10 threads until after priority have started running. - Could move background threads on to EcoQos Thread 16,832 running on e-core # Case Study 2: Unclear Critical Path / Poor Multi-Threaded Scaling Thread creation based on logical processor count #### Disclaimer - Intel technologies' features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system configuration. Configurations used for test and this perf data: Intel® i9-12900K + NVIDIA 3090 All testing was performed at Intel® Munich. Numbers may differ based on actual hardware used and/or based on how the benchmark is written. Intel® makes no guarantee on the specific numbers and it is intended for providing reference - The above is for reference and work in progress data and software ## Case Study 2: Poor Multi-Threaded Scaling Amdahl's Law: 2x increase in cores should halve wall time. But: 22 worker threads is slower than 10 worker threads. #### Hybrid System Additional threads show high CPI on hybrid. L3 boundedness increased by 2x. | | Hybrid | Non Hybrid | |-------------------------------------|---------|------------| | metric_CPU operating frequency(GHz) | 5.2460 | 5.3889 | | metric_CPI | 2.3365 | 1.5610 | | metric_TMA_Backend_Bound(%) | 77.0914 | 54.9688 | | metric_TMAMemory_Bound(%) | 66.0355 | 42.5052 | | metric_TMAL3_Bound(%) | 48.6328 | 23.5164 | ### Case Study 2: Unclear Critical Path Thread 5,212 is high priority and stays on P-Cores. Thread 13,112 looks the same as worker threads from the OS level. Same priority 11 as workers. 80% short run time on thread wake up. ### Case Study 2: Summary Thread 5,212 waits on thread 13,112 Two long running threads with hard dependency between them context switch while being highly subscribed → high chance to schedule on an E-Core. Time spent on E-Core is part of the critical path. - Increase priority of critical path thread - Reduce number of worker threads to reduce memory contention ### Case Study 3: Erratic Behaviour Over Time When processor selection goes bad... - Application stutters during gameplay - Al called on separate threads decoupled from primary task system - Threads doing Al change behaviour over time - Stutter coincides with higher core parking and long running Al threads ### Case Study 3: OS Forced Serialization Games is affinitizing to e-cores for Al Most E-Cores are parked Only core 16 is fully unparked ### Case Study 3: Summary #### Remember this?? Frame rate stutter linked to core parking Thread's Ideal Processor was outside the thread affinity mask. Therefore used last used core (16) Data contention resulted in blocked thread progress until quantum end - Fixed with SetThreadIdealProcessor - Removed data contention ### Hybrid CPU Best Practices #### Profile your workload - Use QueryPerformanceCounter() for micro-benchmarking - Use Intel® VTune™ Profiler for in-depth CPU performance analysis Don't oversubscribe your thread pool - Don't use hyperthread cores if your workload can't benefit from hyper-threading - Avoid unnecessary context switches and cache flushes Use Quality of Service APIs for OS and Intel® Thread Director optimizations QoS APIs can be used in combination with Static Partitioning APIs based on application architecture Avoid static partitioning; allow cores to steal work from other cores Work stealing allows idle threads to take tasks from cores that may be overworked, increasing throughput Avoid pinning threads to a single logical processor Avoid scheduling lower priority tasks on the same cores as your critical path Understand how your middleware uses threads # Thank you ### Notices & Disclaimers Performance varies by use, configuration and other factors. Learn more at <a href="https://www.lntel.com/PerformanceIndex">www.lntel.com/PerformanceIndex</a> (graphics and accelerators). Performance results are based on testing as of dates shown in configurations and may not reflect all publicly available updates. No product or component can be absolutely secure. Intel does not control or audit third-party data. You should consult other sources to evaluate accuracy. Intel technologies may require enabled hardware, software or service activation. All product plans and roadmaps are subject to change without notice. Code names are used by Intel to identify products, technologies, or services that are in development and not publicly available. These are not "commercial" names and not intended to function as trademarks. Statements that refer to future plans or expectations are forward-looking statements. These statements are based on current expectations and involve many risks and uncertainties that could cause actual results to differ materially from those expressed or implied in such statements. For more information on the factors that could cause actual results to differ materially, see our most recent earnings release and SEC filings at www.intc.com. © Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others. ### intel. #### Optimizing Software for x86 Hybrid Architecture White Paper October 2021 Revision 1.0 Document Number: 348851-001US #