https://ieeexplore.ieee.org/document/9340376
Authors describe implementation of ThunderX3 CPU. Here are the key points that stood out for me.
ThunderX2 was derived from an earlier MIPS based architecture (which I was privileged to work on for a short while). ARM instructions that did not directly map to MIPS were expanded into micro-ops. A 6% gain in SPECint was achieved by reducing micro-op expansion.
Details on arbitration policies for hardware threads inside the OOO pipeline are presented. Arbitration decisions are made, for example, based on number of micro-ops in flight further down the pipe or the most micro-ops to retire.
The paper described the rational for choosing a ring topology over mesh: memory bandwidth not increasing dramatically over next generation.