Graphics (Intel, AMD, Google, Xilinx)

05:28PM EDT – Welcome to Scorching Chips! That’s the annual conference all in regards to the latest, greatest, and upcoming big silicon that may get us all excited. Maintain tuned all through Monday and Tuesday for our widespread AnandTech Keep Blogs.

05:31PM EDT – Stream is starting! We have Intel, AMD, Google, Xilinx

Graphics (Intel, AMD, Google, Xilinx)

05:32PM EDT – One of many sophisticated duties at Intel

05:33PM EDT – Aiming for 500x over Intel’s earlier best GPU

05:33PM EDT – Scale is important

05:33PM EDT – 4 variants of Xe

05:34PM EDT – Exascale market needs scale

05:34PM EDT – broad set of datatypes

05:34PM EDT – Xe-Core

05:34PM EDT – Not EUs – Xe Cores now

05:35PM EDT – Each core in HPC has 8x 512-bit vectors, 8×4096-bit matrix engines, 8-deep systloic array

05:35PM EDT – Huge 512 KB L1 cache for Xe Core

05:35PM EDT – Software program program configurable scratch pad shared memory

05:36PM EDT – 8192 x INT8 for Xe-Core

05:36PM EDT – One slice has 16 Xe Cores, 16 RT fashions, 1 {{hardware}} context

05:36PM EDT – ProVis and content material materials creation

05:37PM EDT – Stack is 4 Slices

05:37PM EDT – 64 Xe Cores, 64 RT Fashions, 4 {{hardware}} contextsd, L2 cache, 4 HBM2e controllers

05:37PM EDT – 8 Vehicle Hyperlinks

05:37PM EDT – Help 2 stacks

05:38PM EDT – associated immediately through packaging

05:38PM EDT – GPU to GPU communication

05:38PM EDT – 8 completely associated GPUs through embedded swap

05:38PM EDT – not for CPU-to-GPU

05:39PM EDT – 8 GPUs in OAM

05:39PM EDT – OCP Accelerator Module

05:39PM EDT – 1 million INT8/clock in a single system

05:40PM EDT – Superior packaging

05:41PM EDT – Quite a few new stuff

05:41PM EDT – EMIB + Foveros

05:41PM EDT – 5 fully totally different course of nodes

05:42PM EDT – MDFI interconnect website guests

05:42PM EDT – loads of challenges

05:42PM EDT – Realized tons

05:43PM EDT – Floorplan locked very early

05:43PM EDT – Run Foveros at 1.5x frequency initially thought to attenuate foveros connections

05:43PM EDT – booted a few days after first silicon once more

05:44PM EDT – Order of magnitude additional Foveros connections than totally different earlier designs

05:44PM EDT – Compute tiles constructed on TSMC N5

05:45PM EDT – 640mm2 per base tile, constructed on Intel 7

05:46PM EDT – Xe Hyperlink Tile in-built decrease than a yr

05:47PM EDT – OneAPI assist

05:47PM EDT – 45 TFLOPs of sustained perf

05:48PM EDT – Prospects early subsequent yr

05:48PM EDT – Q&A

05:50PM EDT – Q: PV of 45TF FP32 compute – 45 TF of FP64? A: Certain

05:51PM EDT – Q: Additional insights into {{hardware}} context – is 8x PV monolithic or 800 conditions? A: Looks like a single logical machine, neutral functions can run in isolation in context stage

05:53PM EDT – Q: Does Xe Hyperlink assist CXL, in that case, which revision? A: nothing to do with CXL

05:54PM EDT – Q: Does the GPU hook up with CPU by PCIe or CXL? A: PCIe

05:54PM EDT – Q: Xe Hyperlink bandwidth? A: 90G serdes

05:55PM EDT – Q: Peak vitality/TDP? A: Not disclosing – no product specific numbers

05:55PM EDT – Subsequent communicate up is AMD – RDNA2

05:57PM EDT – CDNA for compute vs RDNA for gaming

05:57PM EDT – Every are focused on compute for each course

05:58PM EDT – Versatile and adaptable design

05:58PM EDT – 18 months after first RDNA product

05:59PM EDT – 128 MB of Infinity cache

05:59PM EDT – enhance frequency

05:59PM EDT – RDNA unshackled the design from positive underpinnings of GCN

05:59PM EDT – Perf/W is important metric

05:59PM EDT – scale back wasted vitality

06:00PM EDT – DX12 Ultimate assist, assist for DirectStorage

06:00PM EDT – Subsequent gen consoles helped with enchancment of featureset

06:01PM EDT – +30% Freq at iso-power, or under half vitality for isofrequency

06:02PM EDT – All achieved with out change in course of node

06:03PM EDT – RX5000 – RDNA1 – extreme bandwidth nonetheless low hit costs

06:04PM EDT – Attempting to steer clear of GDDR use to reduce vitality – so enhance caches!

06:04PM EDT – GPU cache hit costs

06:04PM EDT – graphics was one-pass compute

06:05PM EDT – Huge L3 caches

06:07PM EDT – lower energy per bit – just one.3 pJ/bit on cache vs 7-8 pJ/bit for GDDR6

06:08PM EDT – Widespread memory latency on RX6800 is 34% lower than RX5700

06:10PM EDT – Ray tracing in RDNA2

06:10PM EDT – Variable Cost Shading

06:10PM EDT – Sampler Options

06:10PM EDT – Mesh Shaders

06:11PM EDT – RT aimed to be effficient with out together with overhead

06:12PM EDT – tightly built-in into the shader construction

06:12PM EDT – Streamlined implementation

06:13PM EDT – VRS makes use of optimistic grained cost alternative per 8×8 pixels

06:13PM EDT – VRS as a lot as 2×2 in an 8×8 grid

06:16PM EDT – Q&A

06:19PM EDT – Q: Why Infinity Cache vs Stacked V-Cache A: Not discussing, have been solely evaluating on-die cache

06:22PM EDT – Q: TDP % as CUs vs totally different? A: Extreme workload, most vitality in CUs – exact breakdown depends on workload – CUs are largest, can exceed 50%, 2nd is GPU widespread blocks, third DRAM interface. Infinity cache modifications 2nd/third place. 4th is Infinity Cache

06:23PM EDT – Q: SteamDeck Van Gogh? A: No comment

06:29PM EDT – Google VCU communicate

06:30PM EDT – video is >60% of worldwide web page guests

06:30PM EDT – need increased algorithms

06:30PM EDT – Video is getting {{hardware}} to compress

06:31PM EDT – AV1 is 200x additional time consuming to encode in software program program vs H.264

06:31PM EDT – Pixels/second has elevated 8000x since H.264

06:32PM EDT – most shopper {{hardware}} is optimized for worth, not effectivity or effectivity

06:32PM EDT – Could not uncover the whole thing in off the shelf offering

06:32PM EDT – Encode 10s of variations with one enter

06:33PM EDT – Need full entry to configuration devices

06:34PM EDT – Devoted VP9 encode and decode

06:36PM EDT – enabling sw/hw codesign

06:38PM EDT – With HLS, allowed them to examine loads of construction variations for choices and effectivity

06:39PM EDT – Accelerators must scale to warehouse scale computing

06:40PM EDT – tolerate errors at chip and core stage errors – reliability is the subsequent stage function

06:40PM EDT – Have to assist 48 encodes per decode (MOT)

06:40PM EDT – Add one video, encode a variety of variations

06:41PM EDT – chip stage cache was ineffective

06:41PM EDT – core counts to assist big MOTs

06:41PM EDT – LPDDR4 for bandwidth per byte

06:41PM EDT – on chip reminiscences use ECC

06:42PM EDT – conservative NOC design

06:43PM EDT – A single decoded physique will be utilized a variety of events – one decode for a variety of encodes

06:43PM EDT – parallel queues for prime utilization

06:44PM EDT – 2 ASICs per board bifurcated, 5 boards per chassis, 2 chassis per host

06:44PM EDT – As many machines per rack as space and vitality allow

06:44PM EDT – Effectivity vs twin socket Skylake

06:44PM EDT – 100x VP9 vs H.264

06:45PM EDT – one 20 VCU machine replaces racks of CPUs

06:46PM EDT – Submit-deployment improved top quality

06:47PM EDT – Q&A time

06:49PM EDT – Q: Can VCU be in tandem with ASIC? A: Not potential, no intermediate faucet in/faucet out. Tightly coupled design

06:50PM EDT – Q: What is the PCIe card profile – lanes/TDP? A: In house bifurcation format, in some other case FHFL dual-slot, publish silicon under 100W

06:50PM EDT – Q: VCU permit by GCP? A: On a regular basis searching for distinctive GCP, nonetheless no bulletins.

6:52PM EDT – Q: Q: Can HLS draw parity with RTL? A: Certain

06:54PM EDT – Q: SECDED ECC caches? A: the place potential SECDED, some SRAMs in codec encoder solely does detection – if error does occur, we’re in a position to reset the job

06:54PM EDT – Q: 8K60 – Can a single VCU do that? A: Throughput, positive. Nevertheless no VP9 profile.

06:55PM EDT – Q: Completely different codecs? A: No comment of future codecs. Intently involved in AV1 subsequent gen AV2

06:55PM EDT – Q; Audio streams? A: Stream is broken between video/audio, is perhaps broken up and processed elsewhere. VCU wouldn’t contact audio

06:58PM EDT – Final communicate is Xilinx

06:59PM EDT – Xilinx Versal AI Edge

06:59PM EDT -7nm

06:59PM EDT – AIE-ML construction optimized for inference

07:00PM EDT – What’s ML used for

07:00PM EDT – All functions require loads of AI at low latency and low vitality

07:02PM EDT – low end and extreme end machine highlighted instantly

07:03PM EDT – 10s of ML tiles to 100s of Tiles

07:04PM EDT – loads of kind components for interece

07:05PM EDT – Particulars on the Construction

07:05PM EDT – memory tiles, optimized compute core

07:06PM EDT – Native assist for INT4 and BF16

07:07PM EDT – SRAM is ECC protected

07:07PM EDT – Memory tile amount depends on machine – mid-range is about 38 megabytes of memory tiles

07:10PM EDT – New ML focused tile on these mid-range merchandise

07:10PM EDT – extreme end nonetheless makes use of AIE on account of 5G needs it

07:10PM EDT – VLIW vector processor

07:10PM EDT – non-blocking interconnect

07:10PM EDT – micro-DMA

07:15PM EDT – machine stage information movement

07:15PM EDT – Tiles can be taught direct from DDR, no need intermediate ranges

07:16PM EDT – DDR helps dwell compression

07:20PM EDT – memory is distributed, no information replication, no cache misses

07:23PM EDT – COming rapidly to the software program program stack

07:23PM EDT – Needn’t program at C++ – pytorch, Tensorflow, Caffe, tvm

07:24PM EDT – use cases

07:25PM EDT – How one can use a full Versal AI Edge processor for a single use

07:31PM EDT – That could be a wrap

Bài viết liên quan