I had trouble connecting the music to the transitions and morphs. I perceived it just as music overlaid on fun-to-see GAN-generated images, though clearly that’s not the aim. The music is beautiful and the imagery is intriguing, however.
So the glasses are always present when the bass/808s are hitting, so is there something that maps the sound to the images?
What is it about the algorithms that make the images 'dance' so quickly between the 3.5 beat and the 1? Is it because there are static risers that move so quickly through the wave spectrum?
Wait... is light skin mapped to when highs dominate and dark skin to lows?
I'm glad you like it! Actually compared to the linked post I don't do any manual latent-space representation selection. It's just a bit of "smart" signal processing. I've written a framework that makes it really simple to do these visualizations (not open-source yet). Here's one more example: https://www.youtube.com/watch?v=X4r4njUjE2M
It could just be my brain, but it seems like there is a loose correlation between the mouths in the video and the lyrics.
In Phantom Part II they mostly have their mouths closed. In La La Land it varies but the mouths are mostly open. If you focus on the mouth you'll get little mental radar blips where the mouth could be tracking what is being said.
You could follow me on twitter @tsmcalister. I'll post there once it's released. Depends a lot on how much time I have to work on it. Hopefully by the end of the month!
For some reason this made me yearn for a GAN that generates motorcycles riding through landscapes. Love the storytelling potential of your work, great stuff!
yeah i when i read the title i was hoping the GAN visualizations would reveal the underlying structure of the music somehow, or the images would have a more compelling link to the music. EDIT: just realized it might be much more interesting to train GANs with movie soundtracks, thus forming a link between music and image
You may notice that the visual changes occur at the same frequency as the music's tempo. But you're right, there is no influence on the content itself.
GANs are an interesting frontier. Videos are much more engaging than photos. The next logical step is to make a real-time GAN, like a videogame you can walk around in.
Imagine using a Vive to explore a GAN interactively. You'd be able to control the GAN using vive controllers and by walking around your room.
Right now it takes 163ms to render a 1024x1024 frame on a K80 GPU. That's 6 FPS, which is within an order of magnitude of 60FPS.
I haven't timed a 256x256 GAN, but presumably it would be 16x faster to generate. If so, then you'd be able to achieve 98FPS.
Someone should train a 256x256 FFHQ and make a 90FPS interactive renderer for it.
Unfortunately it's not possible to take a large GAN like 1024x1024 FFHQ and only generate a 256x256 image. Each GAN is trained for a specific size, so you're stuck with 6 FPS at 1024x1024. I wish the FFHQ authors had saved a 256x256 checkpoint during training.
Training a 256x256 GAN from scratch costs somewhere in the range of $150 GCE credits. But you might be able to bootstrap a 256x256 FFHQ using the weights from the 1024x1024 FFHQ (aka transfer learning). That might train a lot faster.
There is also the recent NoGAN technique, which skips progressive growing by pretraining the generator: https://github.com/jantic/DeOldify/#what-is-nogan Supposedly it speeds up GAN training by a huge amount.
This is so incredible and well executed. It's crazy to see how quickly GANs are moving (e.g. check out this tweet [1] by Ian Goodfellow on 4.5 years of GAN progress) ... excited to see what they can do a few years from now!
Is there shareable / repeatable code to go behind it? Its cool but without seeing how it was produced, it could be just as well a fancy output of a video editing software
This may be GANs as the author stated but the end result looks surprisingly similar to a bunch of pixel shaders making transitions between source and target images with said transitions
driven either by pure algorithms and/or derived from blurred images themselves.
I've implemented music visualizer ages ago using similar concepts (pure algos though, no real images). It happened when nVidia released the first affordable consumer video card with decent shader support. I think it was 6600GT . My animation part that made video dance to music was a bit more sophisticated though.
Regarding the music synchronization, OK, this is ancient stuff.
However, in terms of graphics, this strikes me as different from anything that was possible before the recent advances in GANs. During the era you're talking about, the art of shader-based music visualizers was being pushed by projects like Milkdrop 2, and nowadays a lot of similar research still happens on Shadertoy, and the demoscene, of course, hasn't stopped blowing people's minds.
But this is on another level entirely. It's as is the content and seemingly human concepts themselves are being smoothly animated.
this strikes me as different from anything that was possible before the recent advances in GANs
- well this is because you did not see my vis. It looked just like the one you saw on GAN's related link with similar transitions. Except that all imagery was generated by math formulas running in pixel shaders instead of ready bitmaps/videos.
Actually I played with the actual music video clips as a source of the imagery and the results were really cool but obviously other then experiment at home could not really do this part due to copyrights etc.
I guess one way to make these effects dance to music would be to make a Mel spectrogram of the audio, then somehow use the shapes in the spectrogram to apply deltas to the rendered frames.
The music animation code worked something like this:
1) Each of my pixel shaders was driven by let's say 32 parameters (do not remember the exact value)
2) The code would generate first set of said parameters and the second set (values were random) and start transitioning ( lerp ) between 2 with the length of transition of about 60 seconds.
3) Upon completion of the transition the first set would
be replaced by second set and the second set would be replaced by freshly generated third set and an infinum.
4) Seps 2 and 3 allowed for non stop fluid motion.
5) Lerp value the degree of transition between sets for each parameter would be modulated by sound (one FFT band for each also passed through synth like attack / decay.
6) Finally there was beat detection part which upon detecting a beat would invert lerp direction
There were more steps and various tricks to make it more interesting and non repetitive but I am not writing article here ;)
The end result was quite artistic. The visualizer was part of much bigger enterprise grade media playback / management / delivery /scheduling platform I've developed for hospitality industry