05:28PM EDT – Welcome to Scorching Chips! That’s the annual conference all in regards to the latest, greatest, and upcoming big silicon that may get us all excited. Maintain tuned all through Monday and Tuesday for our widespread AnandTech Keep Blogs.
05:31PM EDT – Stream is starting! We have Intel, AMD, Google, Xilinx
05:32PM EDT – One of many sophisticated duties at Intel
05:33PM EDT – Aiming for 500x over Intel’s earlier best GPU
05:33PM EDT – Scale is important
05:33PM EDT – 4 variants of Xe
05:34PM EDT – Exascale market needs scale
05:34PM EDT – broad set of datatypes
05:34PM EDT – Xe-Core
05:34PM EDT – Not EUs – Xe Cores now
05:35PM EDT – Each core in HPC has 8x 512-bit vectors, 8×4096-bit matrix engines, 8-deep systloic array
05:35PM EDT – Huge 512 KB L1 cache for Xe Core
05:35PM EDT – Software program program configurable scratch pad shared memory
05:36PM EDT – 8192 x INT8 for Xe-Core
05:36PM EDT – One slice has 16 Xe Cores, 16 RT fashions, 1 {{hardware}} context
05:36PM EDT – ProVis and content material materials creation
05:37PM EDT – Stack is 4 Slices
05:37PM EDT – 64 Xe Cores, 64 RT Fashions, 4 {{hardware}} contextsd, L2 cache, 4 HBM2e controllers
05:37PM EDT – 8 Vehicle Hyperlinks
05:37PM EDT – Help 2 stacks
05:38PM EDT – associated immediately through packaging
05:38PM EDT – GPU to GPU communication
05:38PM EDT – 8 completely associated GPUs through embedded swap
05:38PM EDT – not for CPU-to-GPU
05:39PM EDT – 8 GPUs in OAM
05:39PM EDT – OCP Accelerator Module
05:39PM EDT – 1 million INT8/clock in a single system
05:40PM EDT – Superior packaging
05:41PM EDT – Quite a few new stuff
05:41PM EDT – EMIB + Foveros
05:41PM EDT – 5 fully totally different course of nodes
05:42PM EDT – MDFI interconnect website guests
05:42PM EDT – loads of challenges
05:42PM EDT – Realized tons
05:43PM EDT – Floorplan locked very early
05:43PM EDT – Run Foveros at 1.5x frequency initially thought to attenuate foveros connections
05:43PM EDT – booted a few days after first silicon once more
05:44PM EDT – Order of magnitude additional Foveros connections than totally different earlier designs
05:44PM EDT – Compute tiles constructed on TSMC N5
05:45PM EDT – 640mm2 per base tile, constructed on Intel 7
05:46PM EDT – Xe Hyperlink Tile in-built decrease than a yr
05:47PM EDT – OneAPI assist
05:47PM EDT – 45 TFLOPs of sustained perf
05:48PM EDT – Prospects early subsequent yr
05:48PM EDT – Q&A
05:50PM EDT – Q: PV of 45TF FP32 compute – 45 TF of FP64? A: Certain
05:51PM EDT – Q: Additional insights into {{hardware}} context – is 8x PV monolithic or 800 conditions? A: Looks like a single logical machine, neutral functions can run in isolation in context stage
05:53PM EDT – Q: Does Xe Hyperlink assist CXL, in that case, which revision? A: nothing to do with CXL
05:54PM EDT – Q: Does the GPU hook up with CPU by PCIe or CXL? A: PCIe
05:54PM EDT – Q: Xe Hyperlink bandwidth? A: 90G serdes
05:55PM EDT – Q: Peak vitality/TDP? A: Not disclosing – no product specific numbers
05:55PM EDT – Subsequent communicate up is AMD – RDNA2
05:57PM EDT – CDNA for compute vs RDNA for gaming
05:57PM EDT – Every are focused on compute for each course
05:58PM EDT – Versatile and adaptable design
05:58PM EDT – 18 months after first RDNA product
05:59PM EDT – 128 MB of Infinity cache
05:59PM EDT – enhance frequency
05:59PM EDT – RDNA unshackled the design from positive underpinnings of GCN
05:59PM EDT – Perf/W is important metric
05:59PM EDT – scale back wasted vitality
06:00PM EDT – DX12 Ultimate assist, assist for DirectStorage
06:00PM EDT – Subsequent gen consoles helped with enchancment of featureset
06:01PM EDT – +30% Freq at iso-power, or under half vitality for isofrequency
06:02PM EDT – All achieved with out change in course of node
06:03PM EDT – RX5000 – RDNA1 – extreme bandwidth nonetheless low hit costs
06:04PM EDT – Attempting to steer clear of GDDR use to reduce vitality – so enhance caches!
06:04PM EDT – GPU cache hit costs
06:04PM EDT – graphics was one-pass compute
06:05PM EDT – Huge L3 caches
06:07PM EDT – lower energy per bit – just one.3 pJ/bit on cache vs 7-8 pJ/bit for GDDR6
06:08PM EDT – Widespread memory latency on RX6800 is 34% lower than RX5700
06:10PM EDT – Ray tracing in RDNA2
06:10PM EDT – Variable Cost Shading
06:10PM EDT – Sampler Options
06:10PM EDT – Mesh Shaders
06:11PM EDT – RT aimed to be effficient with out together with overhead
06:12PM EDT – tightly built-in into the shader construction
06:12PM EDT – Streamlined implementation
06:13PM EDT – VRS makes use of optimistic grained cost alternative per 8×8 pixels
06:13PM EDT – VRS as a lot as 2×2 in an 8×8 grid
06:16PM EDT – Q&A
06:19PM EDT – Q: Why Infinity Cache vs Stacked V-Cache A: Not discussing, have been solely evaluating on-die cache
06:22PM EDT – Q: TDP % as CUs vs totally different? A: Extreme workload, most vitality in CUs – exact breakdown depends on workload – CUs are largest, can exceed 50%, 2nd is GPU widespread blocks, third DRAM interface. Infinity cache modifications 2nd/third place. 4th is Infinity Cache
06:23PM EDT – Q: SteamDeck Van Gogh? A: No comment
06:29PM EDT – Google VCU communicate
06:30PM EDT – video is >60% of worldwide web page guests
06:30PM EDT – need increased algorithms
06:30PM EDT – Video is getting {{hardware}} to compress
06:31PM EDT – AV1 is 200x additional time consuming to encode in software program program vs H.264
06:31PM EDT – Pixels/second has elevated 8000x since H.264
06:32PM EDT – most shopper {{hardware}} is optimized for worth, not effectivity or effectivity
06:32PM EDT – Could not uncover the whole thing in off the shelf offering
06:32PM EDT – Encode 10s of variations with one enter
06:33PM EDT – Need full entry to configuration devices
06:34PM EDT – Devoted VP9 encode and decode
06:36PM EDT – enabling sw/hw codesign
06:38PM EDT – With HLS, allowed them to examine loads of construction variations for choices and effectivity
06:39PM EDT – Accelerators must scale to warehouse scale computing
06:40PM EDT – tolerate errors at chip and core stage errors – reliability is the subsequent stage function
06:40PM EDT – Have to assist 48 encodes per decode (MOT)
06:40PM EDT – Add one video, encode a variety of variations
06:41PM EDT – chip stage cache was ineffective
06:41PM EDT – core counts to assist big MOTs
06:41PM EDT – LPDDR4 for bandwidth per byte
06:41PM EDT – on chip reminiscences use ECC
06:42PM EDT – conservative NOC design
06:43PM EDT – A single decoded physique will be utilized a variety of events – one decode for a variety of encodes
06:43PM EDT – parallel queues for prime utilization
06:44PM EDT – 2 ASICs per board bifurcated, 5 boards per chassis, 2 chassis per host
06:44PM EDT – As many machines per rack as space and vitality allow
06:44PM EDT – Effectivity vs twin socket Skylake
06:44PM EDT – 100x VP9 vs H.264
06:45PM EDT – one 20 VCU machine replaces racks of CPUs
06:46PM EDT – Submit-deployment improved top quality
06:47PM EDT – Q&A time
06:49PM EDT – Q: Can VCU be in tandem with ASIC? A: Not potential, no intermediate faucet in/faucet out. Tightly coupled design
06:50PM EDT – Q: What is the PCIe card profile – lanes/TDP? A: In house bifurcation format, in some other case FHFL dual-slot, publish silicon under 100W
06:50PM EDT – Q: VCU permit by GCP? A: On a regular basis searching for distinctive GCP, nonetheless no bulletins.
6:52PM EDT – Q: Q: Can HLS draw parity with RTL? A: Certain
06:54PM EDT – Q: SECDED ECC caches? A: the place potential SECDED, some SRAMs in codec encoder solely does detection – if error does occur, we’re in a position to reset the job
06:54PM EDT – Q: 8K60 – Can a single VCU do that? A: Throughput, positive. Nevertheless no VP9 profile.
06:55PM EDT – Q: Completely different codecs? A: No comment of future codecs. Intently involved in AV1 subsequent gen AV2
06:55PM EDT – Q; Audio streams? A: Stream is broken between video/audio, is perhaps broken up and processed elsewhere. VCU wouldn’t contact audio
06:58PM EDT – Final communicate is Xilinx
06:59PM EDT – Xilinx Versal AI Edge
06:59PM EDT -7nm
06:59PM EDT – AIE-ML construction optimized for inference
07:00PM EDT – What’s ML used for
07:00PM EDT – All functions require loads of AI at low latency and low vitality
07:02PM EDT – low end and extreme end machine highlighted instantly
07:03PM EDT – 10s of ML tiles to 100s of Tiles
07:04PM EDT – loads of kind components for interece
07:05PM EDT – Particulars on the Construction
07:05PM EDT – memory tiles, optimized compute core
07:06PM EDT – Native assist for INT4 and BF16
07:07PM EDT – SRAM is ECC protected
07:07PM EDT – Memory tile amount depends on machine – mid-range is about 38 megabytes of memory tiles
07:10PM EDT – New ML focused tile on these mid-range merchandise
07:10PM EDT – extreme end nonetheless makes use of AIE on account of 5G needs it
07:10PM EDT – VLIW vector processor
07:10PM EDT – non-blocking interconnect
07:10PM EDT – micro-DMA
07:15PM EDT – machine stage information movement
07:15PM EDT – Tiles can be taught direct from DDR, no need intermediate ranges
07:16PM EDT – DDR helps dwell compression
07:20PM EDT – memory is distributed, no information replication, no cache misses
07:23PM EDT – COming rapidly to the software program program stack
07:23PM EDT – Needn’t program at C++ – pytorch, Tensorflow, Caffe, tvm
07:24PM EDT – use cases
07:25PM EDT – How one can use a full Versal AI Edge processor for a single use
07:31PM EDT – That could be a wrap