Computing 10,000x more efficiently (2010) [pdf]

jules · on Sept 11, 2015

Unfortunately this solves the wrong problem. The bottleneck isn't arithmetic, it's data movement. The number of transistors doing arithmetic is already a very tiny fraction of a modern chip. Reducing that tiny fraction to an even tinier fraction by making arithmetic inaccurate isn't a good trade-off.

_delirium · on Sept 11, 2015

They seem to be focusing on specific applications where arithmetic might be (or so they hypothesize) a bottleneck. His LinkedIn says, "we recently showed a 6400x improvement in speed/power ratio tracking objects in video for the U.S. Navy".

alephnil · on Sept 12, 2015

What have happened the last 20 years is that the total amount of arithmetic that can be done has increased tremendously, while communication has not to the same extent. That means that more problems than before is communication bound (even inside a processor) than before. Since problems range from extremly IO bound to entirely CPU-bound, more problems are bound by communication than before.

A friend of mine that worked with supercomputers back in the 90s used to say that a supercomputer was a device that converted a computational problem to a communication problem. Today this is true for mainstream computing as well. There are still problems that can take advantage of the massive amount of computing available (Some mentioned Convolutional Neural Networks), but many tasks that used to be CPU bound is now effectively IO bound.

modeless · on Sept 11, 2015

There's a new application area where arithmetic is the bottleneck, and accuracy is less important: Convolutional neural networks. I would love to see a convnet chip built using these techniques.

varelse · on Sept 11, 2015

There's a clear and massive movement of data forward for prediction in a CNN, and all that plus saving intermediate state and then moving the same amount of data backwards when performing backpropagation.

But don't believe me, don't believe a word I type. Go prove the above wrong by building a killer AlexNet implementation out of this chip. See slide 9 of this presentation for why I don't think you can do this (too much data transfer):

http://www.slideshare.net/embeddedvision/tradeoffs-in-implem...

raverbashing · on Sept 12, 2015

Slide 9 makes it clear that, for most layers computation takes more time than data transfers

If you can reduce the size of the computation units you can have one (or several) for each layer, making it possible to hardwire the transfer between layers

anon4 · on Sept 12, 2015

What if we take the arithmetic and move it next to the data?

Yes, I'm still waiting for memristors.

karmakaze · on Sept 22, 2015

Which reminds me of the theoretical/thermodynamic perspective of specifically copying data rather than movement. State reduction. https://en.wikipedia.org/wiki/Reversible_computing

minthd · on Sept 11, 2015

The bottleneck actually is arithmetic. "GPUs have much higher ALU throughput since the GPU chip area is almost entirely ALU"

http://devblogs.nvidia.com/parallelforall/bidmach-machine-le...

Also on the horizon there is 3d chip manufacturing technology(3d-monolithic) ,with extremely large bandwidth between the two different layers of the chip,possibly being gpu + dram.

exascale1 · on Sept 11, 2015

The bottleneck is not arithmetic for a long time, it's data movement. Arithmetic is practically free nowadays. See presentation by Horst Simon (Deputy Director of Lawrence Berkeley National Laboratory) "No exascale for you!" [0]

The energy cost of transferring a single data word to a distance of 5mm on-chip is higher than the cost of a single FLOP (20 pico-Joules/bit). 5mm =~ the distance to L2 cache or another CPU core. The cost of transferring data off-chip (3D chip and/or RAM) is orders-of-magnitude higher, see graph.

[0] http://iwcse.phys.ntu.edu.tw/plenary/HorstSimon_IWCSE2013.pd...

foobar2020 · on Sept 11, 2015

The bottleneck is often RAM. This is especially clear when writing performance-oriented code in CUDA, where the amount of cores (threads) per one shared memory controller is in the order of thousands.

cbsmith · on Sept 12, 2015

...but since GPU's already exist, they kind of are the "already at large scale production" solution to the problem. For very little money you can get some pretty insane single precision throughput for SIMD calculations.

What you run in to are problems feeding the beast data fast enough.

Quanticles · on Sept 11, 2015

Our company works on this kind of stuff for those who are interested (http://isosemi.com)

We're seeing more like 10-100x improvements in energy efficiency and performance, not 10000x, unless the comparison point is a full blown CPU/GPU.

nickpsecurity · on Sept 12, 2015

I was recently digging into analog papers to try to figure out how to apply it more general-purpose or at least tap into it for special purpose functions. I'm not hardware trained so much as a systems guy who knows enough to give others tips on what to look into. Here's some links I discovered:

https://www.schneier.com/blog/archives/2015/07/friday_squid_...

Accidentally running into another group using analog selectively for acceleration is pretty neat. The coprocessor was a believable improvement showing analog power. What's your thoughts on the computing with free space and no transistors stuff? Do those other links come off as bogus to a pro or plausible enough to encourage local college students to try something with it? I think there's vast untapped potential in shifting certain functions back to analog and improving the integration of the two. Maybe in general-purpose, too. Almost certainly in INFOSEC w/ analog supporting obfuscation and tamper-detection.

Quanticles · on Sept 12, 2015

There is a lot of interesting research out there on analog computing, analog neural networks, and far-out stuff like transistor-free computing. Going from research project to product on Digikey is a really huge leap for most research though. Designing a chip is very expensive, so the product better be a slam dunk. Most of these analog neural network projects can do some sort of learning with small black and white patterns, which does not approach the accuracy or scale of software neural networks.

What we're working on is an accelerator for the convolutional neural networks that are winning competitions like ILSVRC. Even that by itself is insufficient for a business case, though. You also have to have end application in mind too, and that end application better be power intensive or performance constrained enough that software cannot accomplish what you need it to do. Because, if software is good enough, then why take a risk on a fancy new hardware component?

p1esk · on Sept 13, 2015

Right now, software (GPU based) implementations of neural networks are acceptable because the models are constantly changing. Whatever you build in hardware today will be obsolete in a year (unless your hw is flexible enough, but then it loses a lot of its efficiency, and GPUs will probably catch up with you soon).

However, as we discover more algorithms for general intelligence, we will reach a point where the model can learn on its own - just like a human baby does. That will be the point where we will need size, speed, and power efficiency, rather than flexibility. That will be a good moment to offer a hardware solution, and that's when an analog chip will suddenly become more attractive than a digital one.

Quanticles · on Sept 13, 2015

The products that we are creating are reprogrammable and reconfigurable, just like a GPU or FPGA. Updates are like a firmware update. Our hardware would be no more obsolete over time than a GPU or CPU running in its place, and given the huge improvements over CPU/GPU, it would be many years before CPU/GPU would catch up to any particular product anyway.

They are not able learn on chip - that is a non-starter and not particularly useful anyway. Customers dont want self-driving cars that need to learn how to drive, they want self-driving cars that already know how to drive.

p1esk · on Sept 13, 2015

it would be many years before CPU/GPU would catch up to any particular product anyway

Can you back up your claims with actual performance numbers? I looked at your website, and I don't see any products - do they exist? What is the flops/W for you best CNN implementation? How many ImageNet images can it process per second? What is the accuracy (assuming you can only do 8 bit precision)?

Also, how much does your chip cost?

Quanticles · on Sept 15, 2015

We're in the process of fabricating a prototype and are not publicly releasing detailed estimates at this time.

I can say that the cost depends on what you want to do - systems can range from less than 1mm^2 to the entire reticle depending how much performance you want.

p1esk · on Sept 15, 2015

Wait, you haven't even built a prototype? How can you possibly know if your chip will even work, let alone be better than any existing GPU?

I'm sure you're aware that since Mead's retina chip there have been dozens of attempts to build NN chips, both analog and digital, and very few of them got further than the simulation stage (ETANN or ANNA chips come to mind), and no one managed to produce a commercially successful product.

Nvidia Tegra X1 claims to have 1Tops @10W for 16 bit precision, and the cost is probably under $100. They can probably double that performance if they drop precision to 8 bit. That's what they ship today, and next year they will release the Pascal version, which will undoubtedly will be bigger, faster, and more efficient. What makes you sure you can compete with them?

Quanticles · on Sept 15, 2015

Actually, I can give you a better answer...

An ASIC is always going to be at least 10x better than a CPU/GPU for performing the same algorithm. The question isn't whether or not an ASIC can beat NVIDIA, the question is whether the target market is large enough to support an ASIC company.

At Isocline we assume that this market IS big enough to support an ASIC. Our competition is not NVIDIA, it's the future all-digital ASIC company that can do the same thing, but without all of the whiz-bang technology. If we have to, we could probably fall-back to be that all-digital company, but I'd prefer to maintain our technology advantage.

NVIDIA's advantage is flexibility, there's always going to be a lot of demand for that.

p1esk · on Sept 15, 2015

An ASIC is always going to be at least 10x better than a CPU/GPU for performing the same algorithm.

In theory, this has always been the case. Yet every single neural net ASIC built in the last 25 years has failed in the marketplace, for the same reason - "silicon steamroller". Invariably, when the ASIC was ready to ship, which was almost always much later than was hoped for, general purpose chips have caught up in performance.

I'm not attacking your startup in particular. I'm just pointing out the history behind the field of specialized neural hardware.

p.s. Your competition is Nvidia (or Intel, or Xilinx, etc), because they are well known, big players, who produce reliable products, with huge development infrastructure and expertise. Nvidia specifically has been focusing on deep learning applications, they are already targeting computer vision for cars with their mobile GPUs. If I'm Ford or Toyota, who would I consider for partnership when I need chips potentially making life or death decisions on the road? If your technology really works (big "if", because you haven't built anything yet), then your best hope is one of those big players acquires you.

Quanticles · on Sept 15, 2015

These are all good questions/points

History is something we need to contend with, not just for neural networks, but also for analog computing which has a similarly troubled past.

For NN history, there has not actually been a market for NN accelerators until recently. You can see this because:

1. No NN algorithm was worth accelerating until AlexNet came along in 2012

2. What commercial products even use NN now? Currently it is mostly just voice recognition which is processed server-side.

Right now we are not attempting to go after any markets that a GPU would be sufficient for the reasons you mention; we're sticking to products that can only work with our technology. By the time we went after an overlapping market our credibility would be established and that wouldn't be an issue.

p1esk · on Sept 15, 2015

Yes, the market for NN based products is still in its infancy. It can explode if Apple or Samsung decide to do image or voice processing locally on a smartphone, by using a coprocessor/accelerator chip alongside with CPU/GPU. It could make sense considering the expense (power, time, bandwidth costs) of sending every image off to a datacenter for processing.

I'm curious, have you considered using analog weights (e.g. floating gate transistors, or DRAM capacitors)? This could reduce multiplication from 32 transistors to just one!

Quanticles · on Sept 16, 2015

Analog weights can save a lot of delay/power/cost if you can implement them right, easier said than done

Quanticles · on Sept 15, 2015

I guess you'll have to wait and see

nickpsecurity · on Sept 13, 2015

"They are not able learn on chip - that is a non-starter and not particularly useful anyway. Customers dont want self-driving cars that need to learn how to drive, they want self-driving cars that already know how to drive."

That's a good point to not overlook. Plus, you mentioning this just gave me an idea for a Triad Semiconductor-style, via/metal-programmable, CNN chip tied to a specific FPGA architecture for easy prototyping and conversion. Could be some promise in there. Brain hasn't gotten further than that sentence so don't ask for details haha.

Not clear to me how you will do analog and reprogrammable at the same time unless your reprogramming is building things around the analog components that still perform pretty much the same function(s). I could see the weights, connections, location on chip, etc being configured while connected to analog, signal processing blocks scattered throughout chip kind of like FPGA's do with MAC's. My guess as a non-HW guy with a little research into these things. Am I anywhere close?

Quanticles · on Sept 15, 2015

We make use of non-volatile memories throughout the chip, which stores the configuration and weights

nickpsecurity · on Sept 15, 2015

I figured. I was talking more on how you mix analog and digital parts. Do you reconfigure analog like field programmable analog arrays do? Or do you use the same analog circuits while modifying digital part go just put different things through them?

Trying to see if there's a consensus emerging in how people accelerate w/ mixed-signal chips. Might help academics figure out better place to start on next project.

Quanticles · on Sept 15, 2015

Sorry, I'm going to keep those details secret for now :)

nickpsecurity · on Sept 16, 2015

Yeah, yeah... trade secrets... Hopefully we'll get to see what cool stuff you came up with once it's patent protected. ;)

p1esk · on Sept 13, 2015

They are not able learn on chip - that is a non-starter and not particularly useful anyway. Customers dont want self-driving cars that need to learn how to drive, they want self-driving cars that already know how to drive.

1. Learning does not have to happen inside a car. But it does have to happen somewhere, and that's where the learning in hardware will be much more efficient/faster than learning on a GPU.

2. There are scenarios where local learning would be necessary/preferable to remote learning (e.g. one shot learning or continuous online learning).

Quanticles · on Sept 15, 2015

We do the learning on GPU and efficiency is not a concern because the learning result goes out as a firmware update. Let's say it takes 2 weeks to train a neural network - the customer never sees that, they just get the firmware update. Similarly, we can use a lot of GPUs to train one network because that training result goes out to many chips.

nickpsecurity · on Sept 12, 2015

That sounds like a practical application. Good to see a company using analog for what's mostly an analog architecture (neural). Certain parts are easily modelled with digital circuits. Certain parts could benefit from continuous, simple, parallel processing. An analog domain. I'm sure it's tricky to find the right split and integration scheme esp if you're targeting CNN's like I read about here. Good luck on that as I'm sure it will make similarly interesting reading and potentially a useful product if I need CNN's. :)

"Because, if software is good enough, then why take a risk on a fancy new hardware component?"

Good point. Something that's done in many. Gotta have a clear benefit esp in price/performance/energy. This market has almost as many shut-downs as start-ups.

minthd · on Sept 12, 2015

The 10000x improvement is improvement in power * size. So you might be on the same ballpark.

What abut large scale neural networks - they aren't mentioned in your site . No plans for that ?

Quanticles · on Sept 13, 2015

Target applications like self-driving cars would require deep convolutional neural networks like NVIDIA's Drive PX

msandford · on Sept 11, 2015

I might also ask why only go for a 1% number? It seems like it'd be pretty doable to get a 0.1% approximation as that's only a 30dB SNR versus a 20dB SNR. Maybe I'm super naive but it doesn't seem like it should be tremendously difficult even if it does cut your core count by 20-50%.

Part of the reason I argue for this is that there are tons of sensors which are 0.1% sensors and if you can offer the rest of the computational pipeline at 0.1% then (so long as your errors don't accumulate)_you don't lose any accuracy processing your information this way.

It also seems like this would be pretty great for graphics cards, no? I mean it'd take a lot of work to make OpenGL run on it, but once you did you could have either very inexpensive cards, very powerful cards, or both.

oh_sigh · on Sept 11, 2015

> so long as your errors don't accumulate

I think that may end up being the hardest part of all this.

msandford · on Sept 11, 2015

Totally agree.

sklogic · on Sept 11, 2015

Many GPUs used to have a tiny, fast and very imprecise SFU - it's ok for GLES but useless for anything else.

boxfire · on Sept 11, 2015

Any problem where the data is imprecise or the model itself is very approximate. Machine learning type problems, machine vision, as the paper shows interferometry, lossy image processing. I can think of a few more data processing problems, but I think you might get the point. The use of a precision processor even in single precision when your input data is already 5%+ noise is a huge waste. This has immense application.

sklogic · on Sept 11, 2015

Yes, makes sense... These units are accessible via (undocumented) intrinsics in some OpenCL implementations (or in GLSL), but, unfortunately, there is no portable solution. And the FP precision requirements in the OpenCL standard are way too high, even for the FP16 extension.

asgard1024 · on Sept 12, 2015

I have wondered why don't we use only exponent in floating point numbers as a representation. If the exponent is close to 1 then you can represent mantissa of any size with it. It seems that the article is suggesting something similar.

bra-ket · on Sept 11, 2015

what happened to this project since 2010?

psilence · on Sept 11, 2015

Ongoing, by the look of the founder's LinkedIn:

https://www.linkedin.com/pub/joseph-bates/2/853/aa3

And his company's patents,

http://patents.justia.com/assignee/singular-computing-llc

Ah, I get the name now. "The Singularity Is Near" came out in 2005 https://en.wikipedia.org/wiki/The_Singularity_Is_Near

bjd2385 · on Sept 12, 2015

I read that book recently. It was alright, seemed a bit far-fetched to me, however. I mean, obviously we're moving along at a fast pace, but by the time just _some_ of us (let alone _all_) experience the author's projections we'll probably be in the real-life Star Trek era.

karmakaze · on Sept 12, 2015

Very interesting. Wondering if the Apple ISA will have anything to do with this.