More

camel-cdr · 2026-05-06T13:29:43 1778074183

The website says RVA23, if you click on read more. It still seems quite vague, but to give them some credit, they did spend 17K € to sponsor RISC-V Summit Europe 2026. Well probably hear more from them there.

camel-cdr · 2026-04-23T08:59:24 1776934764

bitfield insert/extract was also looked at by the scalar efficiency SIG: https://lists.riscv.org/g/sig-scalar-efficiency/topic/115060...

IIRC it didn't go anywere, because it wasn't worth the encoding space.

But a rlwimi sounds like a good candidate for >32b encoding.

brucehoult · 2026-04-24T00:27:03 1776990423

Both the PowerPC and Arm64 instructions do grab a lot of encoding space.

rlwimi uses 26 bits of opcode space (i.e. 2^26 = 64M code points). In a RISC-V context you can drop the Rc (set status flags) bit, but for RV64 you need to expand the shift/start/end fields from 5 to 6 bits, so you end up needing 28 bits of encoding space, 18 for the field spec and 5 each for Rd1 and Rd/Rs2.

A RISC-V major opcode, such as OP-IMM (which this effectively is, but with a R/W Rd/Rs2) only has 2^25 bits of encoding space for all instructions in total!

PPC64's rldimi expands shift and size to 6 bits each but drops the ability to take the source field from an arbitrary position but only from the LSBs, and so uses 23 encoding bits. i.e. exactly my proposed RISC-V instruction (except for the set flags bit, so 22 bits).

Arm64's BFM/SBFM effectively uses 24 bits to provide both 32 bit and 64 bit operations — there are 25 bits but `sf` and `N` must be the same, potentially allowing the other half of the code points (plus the ones for 32 bit with the MSBs of `immr` and `imms` set) to be used for something else in future. Note that BFM leaves all other bits in the dst unchanged, while SBFM both sign-extends into the higher bits of dst AND zeros the lower bits of DST.

So BFM/SBFM *could* be fit into RISC-V, taking up half of a major opcode, of which there aren't many left. That is a pretty huge amount — the enormous V extension takes 1 1/2 major opcodes, for far more functionality. It would free up various immediate shifts and sign/zero extension instructions, but those don't take much encoding space, no more than 16 bits each.

As nice as they are, it's hard to avoid a conclusion that both (32 bit) PowerPC and Arm64 spend too much opcode space on these.

I think PPC64's `rldimi` and M88K's `mak` (extended to 64 bits) and my last RISC-V suggestion — which are all effectively the same thing — hit the right tradeoff, not using excessive encoding space but allowing a 2-instruction sequence for that bit field move):

    srli   a3, a1, 21
    maki   a2, a3, (1<<6) | 10   # decoder expands to `maki a2, a2, a3, (1<<6) | 10`

That's 22 bits of opcode space, the same as any one of `addi`, `andi`, `ori`, `xori`, `slti`, `sltiu` (OP-IMM) or `addiw` (OP-IMM-32).

The original RV64GC has 5/8 funct3 encodings in OP-IMM-32 unused, which `maki` (or call it `bfi` or whatever) could have used one of. It has a combined `Rd`/`Rs2` field which is unusual in full size 4-byte RISC-V instructions, but not unprecedented: the V extension does that for multiply-add instructions.

I don't immediately see any ratified or currently-proposed extension using this space.

gblargg · 2026-04-24T01:21:38 1776993698

What would justify using this significant space for them these days? Video encoding/decoding in software seems like the most likely candidate, since there's a lot of bitfield packing and high data volume.

(Thanks for your elaboration on various architectures. It's an interesting glimpse into what goes in in allocating opcode space on fixed-length instruction machines.)

brucehoult · 2026-04-24T02:49:05 1776998945

My example is applicable to compiler / assembler / JIT / emulator.

The performance of conventional compilers and assemblers is not important to anyone but developers, but everyone uses JavaScript / WebAsm all the time. And QEMU can be important too (e.g. in docker for non-native ISAs, using binfmt_misc).

I guess I should point out in the proposed RISC-V example, it's 6 bytes of code as the initial shift can be a 2-byte "C" extension instruction. So that's slightly smaller code than everything except 32 bit PowerPC, which is another important aspect. Arm64 and M68k use 8 bytes of code.

Oh! I just realised standard RISC-V can be improved in this case (but not by so much in the general case).

    srli   x12, x10, 20          # shift field down to correct position
    andi   x12, x12, 0x7FE       # mask to 10 bits
    andi   x11, x11, ~0x7FE      # clear space in the destination
    or     x11, x11, x12         # insert the field

That's just 12 bytes of code.

In the more general case you need a `lui` or `lui;andi` pair to load the mask into a register, and then register to register ops, for 14 bytes total.

Note that x86_64 needs four instructions and 14 bytes of code, so no better than RISC-V.

camel-cdr · 2026-04-23T08:50:29 1776934229

pext/pdep are incredible, I'm hoping to see them in more SIMD ISAs in the future.

But my favorite is the 8x8 bit matrix transpose SIMD instruction (gf2p8affine, which does a bit more, buy I care about the tranapose). Combined with SIMD byte permutes it allows you to do things like: arbitrarily permute bits in SIMD elements, find the invers of a permutation, very fast histograming/binning

camel-cdr · 2026-04-21T19:32:20 1776799940

Thanks I've been doing dumb sudo sh -c ... stuff before.

camel-cdr · 2026-04-20T10:57:58 1776682678

The lines

    __m512i vx  = _mm512_set1_epi64(static_cast<long long>(x));
    __m512i vk  = _mm512_load_si512(reinterpret_cast<const __m512i*>(base));
    __mmask8 m  = _mm512_cmp_epu64_mask(vx, vk, _MM_CMPINT_GE);
    return static_cast<std::uint32_t>(__builtin_popcount(m));

would be replaced with:

    return __riscv_vcpop(__riscv_vmsgeu(__riscv_vle64_v_u64m1(base, FANOUT), x, FANOUT), FANOUT);

and you set FANOUT to __riscv_vsetvlmax_e32m1() at runtime.

Alternatively, if you don't want a dynamic FANOUT you keep the FANOUT=8 (or another constant) and do a stripmining loop

    size_t cnt = 0;
    for (size_t vl, n = 8; n > 0; n -= vl, base += vl) {
     vl = __riscv_vsetvl_e64m1(n);
     cnt += __riscv_vcpop(__riscv_vmsgeu(__riscv_vle64_v_u64m1(base, vl), x, vl), vl);
    }
    return cnt;

This will take FANOUT/VLEN iterations and the branches will be essentially almost predicted.

If you know FANOUT is always 8 and you'll never want to changes it, you could alternatively use select the optimal LMUL:

    size_t vl = __riscv_vsetvlmax_e32m1();
    if (vl == 2) return __riscv_vcpop(__riscv_vmsgeu(__riscv_vle64_v_u64m4(base, 8), x, 8), 8);
    if (vl == 4) return __riscv_vcpop(__riscv_vmsge(u__riscv_vle64_v_u64m2(base, 8), x, 8), 8);
    return __riscv_vcpop(__riscv_vmsgeu(__riscv_vle64_v_u64m1(base, 8), x, 8), 8);

camel-cdr · 2026-04-11T13:32:55 1775914375

> good luck parsing through 100 different "performance optimization manuals" from 100 different companies

This would be a problem for any ISA with multiple/many vendors.

camel-cdr · 2026-03-29T21:00:55 1774818055

Idk, it seems to me like the Rivos people are still doing their RISC-V CPU work.

camel-cdr · 2026-03-29T18:18:24 1774808304

Sadly still on quite old hardware, with no RVV. Hopefully scaleway will have some newer servers in the future and this can be simply updated to the new devices.

LeFantome · 2026-03-29T19:05:30 1774811130

You can get RVV instances from Saleway.

camel-cdr · 2026-03-29T19:06:39 1774811199

Oh, cool, I didn't see them on the website. (https://labs.scaleway.com/en/em-rv1/)

camel-cdr · 2026-03-15T18:33:11 1773599591

K&R syntax is -1 char, if you are in C:

    double solve(double a,double b,double c,double d){return a+b+c+d;}
    double solve(double a...){return a+1[&a]+2[&a]+3[&a];}
    double solve(a,b,c,d)double a,c,b,d;{return a+b+c+d;}

camel-cdr · 2026-03-13T22:08:08 1773439688

> For example, should we use vrgather (with what LMUL), or interesting workarounds such as widening+slide1, to implement a basic operation such as interleaving two vectors?

Use Zvzip, in the mean time:

zip: vwmaccu.vx(vwaddu.vv(a, b), -1, b), or segmented load/store when you are touching memory anyways

unzip: vsnrl

trn1/trn2: masked vslide1up/vslide1down with even/odd mask

The only thing base RVV does bad in those is register to register zip, which takes twice as many instructions as other ISAs. Zvzip gives you dedicated instructions of the above.

janwas · 2026-03-14T08:34:47 1773477287

Looks like the ratification plan for Zvzip is November. So maybe 3y until HW is actually usable? That's a neat trick with wmacc, congrats. But still, half the speed for quite a fundamental operation that has been heavily used in other ISAs for 20+ years :(

Great that you did a gap analysis [1]. I'm curious if one of the inputs for that was the list of Highway ops [2]?

[1]: https://gist.github.com/camel-cdr/99a41367d6529f390d25e36ca3... [2]: https://github.com/google/highway/blob/master/g3doc/quick_re...