It's been awhile since I've done game engine work, but is this impressive? The first thing that comes to mind is they're using instanced rendering. This allows the CPU to only deal with 1 sprite, while telling the GPU to render multiple instances of the sprite, and use a GPU buffer to find each sprite's transformation matrix. All the CPU has to do is update that mmap'ed buffer with new position information (or do something more clever to derive transformations).
Am I missing something that makes the video novel?
you're not missing anything, it's not impressive. i was just checking how fast computers are and sharing the results. my original title was "an optimized 2d game engine can render 200k sprites at 200fps" but the mods changed it to match my youtube title (which made it a lot more popular). and the fact it's written in jai isn't relevant, it's just what i happened to use
I figured it wasn't given that you were showcasing a GL project. But nonethees disappointing as someone curious as to whether or not the language helped in indirect ways with how you structured your project and if you feel you could scale it up to something closer to production ready. That did seem to be the goal of Jai when I last looked into its development some 4 years ago.
jai's irrelevant to the performance here, but it's very relevant to how easy this was to make. i'm not a systems programmer. i've tried writing hardware accelerated things like this in C++ but have failed to get anything to compile for years. the only reason i was able to get this working is because of jai. this is my first time successfully using openGL directly, outside of someone else's game engine
Nope, I don't think so. The point I see in it is regarding some thought I have about that games should run on intel graphic chipsets. Diablo 3 doesn't for example. I wonder if D2 ressurected does...
You really can achieve amazing stuff with just plain e.g. OpenGL optimized for your rendering needs.
With todays GPU acceleration capabilities we could have town-building games with huge map resolutions and millions of entities. Instead its mostly only used to make fancy graphics.
Actually I am currently trying to build something like that [1]. A big big world with hundreds of millions of sprites is achievable and runs smoothly, video RAM is the limit. Admittedly it is not optimized to display those hundreds of millions of sprites all at once, maybe just a few millions. Would be a bit too chaotic for a game anyway I guess.
I recently took it upon myself to see just how far I can push modern hardware with some very tight constraints. I've been playing around with a 100% custom 3D rasterizer which purely operates on the CPU. For reasonable scenes (<10k triangles) and resolutions (720~1080p), I have been able to push over 30fps with a single thread. On a 5950x, I was able to support over 10 clients simultaneously without any issues. The GPU in my workstation is just moving the final content to the display device via whatever means necessary. The machine generating the frames doesnt even need a graphics device installed at all...
To be clear, this is exceptionally primitive graphics capability, but there are many styles of interactive experience that do not demand 4k textures, global illumination, etc. I am also not fully extracting the capabilities of my CPU. There are many optimizations (e.g. SIMD) that could be applied to get even more uplift.
One fun thing I discovered is just how low latency a pure CPU rasterizer can be compared to a full CPU-GPU pipeline. I have CPU-only user-interactive experiences that can go from input event to final output frame in under 2 milliseconds. I don't think even games like Overwatch can react to user input that quickly.
Just to be clear - you're writing a "software-based" 3D renderer, right? This is the sort of thing I excelled at back in the late 80s, early 90s, before the first 3D accelerators turned up around 1995 I think.
What features does your renderer support in terms of shading and texturing? Are you writing this all in a high-level language, e.g. C, or assembler? If assembler, what CPUs and features are you targeting?
> you're writing a "software-based" 3D renderer, right?
Yes. This is 100% what you are familiar with.
> What features does your renderer support in terms of shading and texturing?
I have a software-defined pixel shading approach that allows for some degree of flexibility throughout. Each object in the scene currently defines a function that describes how shade its final pixels based on a few parameters.
> Are you writing this all in a high-level language, e.g. C, or assembler?
I am writing this in C#/.NET6. I do have unsafe turned on for pointer access over low-level bitmap operations, but otherwise its all fully-managed runtime.
> And of course, why?
Because I want to see if I can actually build an effective gaming experience without a GPU in 2022. Secondary objective is simply to learn some new stuff that isnt boring banking CRUD apps.
That's awesome. I think the advantage of a software renderer is that you can adapt your inner loops to do things that a GPU can't do. You can create some new form of polygon-fill that isn't supported by Direct3D or OpenGL etc.
Plus, of course it will run on anything.
I hope you'll be willing to open the code at some point...
Unrelated but wrt. modern rendering versus 90s rendering I'd imagine that a lot of the performance shims used in the 90s might not apply because the critical problem is different.
Performance based development these days isn't so much on maximizing usage of the cycles of the machine (I mean, ok fundamentally it's still about that, but-), rather it's about getting the microcode to do the right thing. E.g. LUTs being extremely bad for caching performance. Branch predictions being a much more important predictor of performance than anything else. Huge rams make a lot of old tips around ram size usage invalid. SIMD / vector based operations and threading are a boon but require a very different way of working
Even if your mental model is as simple as "CPU processing + L1 cache is infinitely fast, having to fetch data from anywhere else is dog slow" you'll be able to optimize code pretty well given the characteristics of modern processors.
If modern high performance code relies on making the microcode do "the right thing", and making sure the right data is in cache then why don't CPU manufacturers allow control over such things?
I think it can reduce input delay enough to change streaming gaming economics, but the current state of cloud economy makes it difficult to scale in practice.
i'm just starting learning directx and noticed it can render a triangle at 12,000 fps! i had no clue this was possible. i don't think there's any room for input delay there, but i'll find out
Did you consider using an existing software rasterizer, like Mesa llvmpipe? Or part of the challenge was writing one yourself (nothing wrong with that)?
The upper rendering limit generally isn't explored deeply by games because as soon as you add simulation behaviors, it imposes new bottlenecks. And the design space of "large scale" is often restricted by what is necessary to implement it; many of Minecraft's bugs, for example, are edge cases of streaming in the world data in chunks.
Thus games that ship to a schedule are hugely incentivized to favor making smaller play spaces with more authored detail, since that controls all the outcomes and reduces the technical dependencies of how scenes are authored.
There is a more philosophical reason to go in that direction too: Simulation building is essentially the art of building Plato's cave, and spending all your time on making the cave very large and the puppets extremely elaborate is a rather dubious idea.
Is this not done because of technical limitations, or is it just not done because a town building game with millions of entities would not be fun/manageable for the player?
Although, there's a few space 4x games that try this "everything is simulated" kind of approach and succeed. Allowing AI control of everything the player doesn't want to manage themselves is one nice way of dealing with it. See: https://store.steampowered.com/app/261470/Distant_Worlds_Uni...
Neither of those numbers are particularly huge for modern GPUs.
I'd wager that a compute shader + mesh shader based version of this could hit 2M sprites at 200 fps, though at some point we'd have to argue about what counts as "cheating" - if I do a clustered occlusion query that results in my pipeline discarding an invisible batch of 128 sprites, does that still count as "rendering" them?
I've been able to reach 5M particles at 60 fps on a very naive (as a GPU noob) implementation that uses Qt's RHI, which has some unnecessary copying and safeties, with compute + vertex + fragment.
writing 100% of the code on the gpu can render 10,000,000 triangles per frame at 60fps ... even in the web browser! (because there's no javascript running) https://www.youtube.com/watch?v=UNX4PR92BpI
but yes, that's cheating, since it's impractical to work with
Using Goroutines, I also made 10k 2D rabbits wander on a map for 5% of my laptop cpu (they'd sleep a lot admitedly). One goroutine per rabbit, how amazing when you think about it. That's when Go really got me.
edit: oh they do rabbits in the video as well what a bunny coincidence
edit2: the goroutines werent drawcalling btw, they were just moving the rabbits. The drawcalls were still made using a regular for loop, in case you wonder.
it's a lot of fun! jai is my intro to systems programming.
so i haven't tried this in C++ (actually i have tried a few times over the past few years but never successuflly).
this is just a test of opengl, C++ should be the same exact performance considering my cpu usage is only 7% while gpu usage is 80%.
but the process of writing it is infinitely better than C++, since i never got C++ to compile a hardware accelerated bunnymark.
That only applies if you are a known name (probably being known among his fans works too), or have somebody in his circle vouch for you.
Regular people don't get in.
over a year ago. I explained that I worked on game engines in college and they were terrible and overengineered and wildly inefficient and I wanted to do things better going forward.
Neat. Isn't this akin to 400k triangles on a GPU? So as long as you do instancing it doesn't seem too difficult (performance wise) in itself. Even if there are many sprites, texture mapping should solve for the taking pixels to the screen part.
My guess is that the rendering is not the hardest part, although it's kinda cool.
Rendering only one large triangle can be faster than two. First one triangle needs less memory, less vertex processing, etc.
Second, modern GPUs render pixels in groups of 2x2 up to 8x8 "tiles". If only one pixel from this group is part of a triangle, the entire group will be rendered. When two triangles form a quad, the entire area along the diagonal "seam" will be rendered twice. The smaller quads you have, the more overhead.
I disagree, with the exception of the case you link to where half the pixels are outside the viewport or maybe where a sufficient percentage are outside the viewport.
> When two triangles form a quad, the entire area along the diagonal "seam" will be rendered twice
This may be true, but I'm pretty sure that this is more than made up for by the additional pixels in the single triangle circumscribing the quad. In fact, I'm willing to bet that it's a mathematical certainty for any rectangle, although I didn't do enough of the math to prove it.
Instead, I would say that most rendering, especially of hundreds of thousands of 2D shapes, are going to be pixel limited. So trading pixels for vertices is a poor trade.
It depends on the size of the sprites in this case. Small sprites will benefit from being drawn as single triangles.
These "shadow" pixel shader invocations are a very real pain when it comes to rendering highly detailed models. The hardware rasterization pipeline can't cope well with huge amounts of really tiny triangles. That's the reason why UE5 Nanite uses a software GPU rasterizer for the high geometry density sections of a model - it's faster! Large area primitives will be rendered normally AFAIK.
Pretty sure overdraw / fillrate bottlenecks before vertex processing. Also you could draw that quad using strips which would then amount for only one more processed vertex compared to triangle.
Edit: okay surely with modern architecture there is no pixel write because of some early alpha cut but you still have to fetch the texture to make it so texture fetch (memory) will bottleneck first. I guess.
You shouldn't use strips, they're slower than triangle lists on most GPUs.
If by alpha cut you mean "discard", that's going to be much slower than two triangles. Two triangles will have a tiny bit of quad overshading on the seam, compared to a full extra triangle's worth in the alpha cut case.
Yeah discard use to be slow because it flushes pipelines or mess with branching predictions I don't remember which, I just assumed they "fixed" that by now.
No, it's not either of those, it's just launching useless threads, plus all the down-stream effects of launching useless threads, e.g. if you have blending on, that will block the ROP unit which needs to wait for the threads for a given pixel in-order. If you have depth write on, that will move the write to late-Z.
More vertices is not a big problem, doubling your vertex count is not a big deal, since most GPUs process vertices in groups of 32 or more, and whether multiple instances get packed in the same group depends on the GPU vendor.
Oh, let me clear that for you. The trick discussed here is that you can draw a sprite (a quad) using one large triangle. The sprite is just inside it but the triangle has quite some "wasted" surface.
Right, so if you're rendering a sphere, it'd make sense to use lots of triangles to get a smoother surface and less overdraw in order to render more quickly, but that doesn't seem to be the case in reality as people use lower-poly meshes (with more overdraw) if they want it to be quicker.
Well there is a specific tradeoff involved in this particular case, i.e. one more processed vertex versus something like +100% added surface, if my triangle maths are still what they were.
I don't think that at 200k or 400k level will matter much. Math is probably easier on humans if you think about the sprites as rectangular (so two triangles), but you could in principle make each sprite a triangle, and texture map in a shader a rectangular area of the triangle.
Bit of a tangent and useless thought experiment, but I think you could render an infinite amount of such bunnies, or as many as you can fit in RAM/simulate. One the CPU, for each frame, iterate over all bunnies. Do your simulation for that bunny and at the pixel corresponding to its position, store its information in a texture at that pixel if it is positioned over the bunny currently stored there (just its logical position, don't put it in all the pixels of its texture!). Then on the GPU have a pixel shader look up (in surrounding pixels) the topmost bunny for the current pixel and draw it (or just draw all the overlaps using the z-buffer). For your source texture, use 0 for no bunny, and other values to indicate the bunny's z-position.
The CPU work would be O(n) and the rendering/GPU work O(m*k), where n is the number of bunnies, m is the display resolution and k is the size of our bunny sprite.
The advantage of this (in real applications utterly useless[1]) method is that CPU work only increases linearly with the number of bunnies, you get to discard bunnies you don't care about really early in the process, and GPU work is constant regardless of how many bunnies you add.
It's conceptually similar to rendering voxels, except you're not tracing rays deep, but instead sweeping wide.
As long as your GPU is fine with sampling that many surrounding pixels, you're exploiting the capabilities of both your CPU and GPU quite well. Also the CPU work can be parallelized: Each thread operates on a subset of the bunnies and on its own texture, and only in the final step the textures are combined into one (which can also be done in parallel!). I wouldn't be surprised if modern CPUs could handle millions of bunnies while modern GPUs would just shrug as long as the sprite is small.
[1] In reality you don't have sprites at constant sizes and also this method can't properly deal with transparency of any kind. The size of your sprites will be directly limited by how many surrounding pixels your shader looks up during rendering, even if you add support for multiple sprites/sprite sizes using other channels on your textures.
Does this work with large semi-transparent objects? (My 10-year-old experience with 2D game engines was that 10k objects wasn't really a problem, unless you were trying to make clouds or fog from ~200x100px sized, half-transparent images. Have a 100 of those, and you'd run at 5 FPS.)
You can do it in SO many ways! You can have one vertex buffer or double buffer it, or you can run the entire simulation on the GPU too. In general, uploading data to the GPU can be the slowest part. OpenGL, and more modern Graphics APIs have evolved in the direction of minimizing the communication between CPU and GPU since it is almost always a big bottle neck. Modern GPUs are designed to manage themselves with work queues, local data and sometimes even local storage to avoid the need to interact with the CPU.
you can write 100% of the code on the gpu. but that's impractical to work with. i did that here to see how fast webgl can go, since javascript is so slow https://www.youtube.com/watch?v=UNX4PR92BpI
for this bunnymark i have 1 VBO containing my 200k bunnies array (just positions). and 1 VBO containing just the 6 verts required to render a quad. turns out the VAO can just read from both of them like that. the processing is all on the CPU and just overwrites the bunnies VBO each frame
How much time is spent in Jai? How much time is spent presenting the graphics? Unfortunately, graphics benchmarks like this are hard because they don't tell us much. You have to profile these two parts separately.
Anecdote: In Unity, using DrawMeshInstancedIndirect, you can get >100k sprites _in motion_ and still maintain >100 FPS.
Using some slight shader/buffer trickery, and depending on what you're trying to do (as is always the case with games & rendering at this scale), you can easily get multiples of that -- and still stay >100FPS.
I agree, more of this approach is great. And I am totally flabbergasted at how abysmally poor the performance is with SpriteRenderer Unity's built-in sprite rendering technique.
That said, it's doable to get relatively high-performance with existing engines -- and the benefits they come with -- even if you can definitely, easily even, do better by "going direct".
Am I missing something that makes the video novel?