Friday, June 25, 2021

Jim Keller talks about processor design

Dr. Ian Cutress, An AnandTech Interview with Jim Keller: 'The Laziest Person at Tesla',
6.16.21.

I've spoken about Jim Keller many times on AnandTech. In the world of semiconductor design, his name draws attention, simply by the number of large successful projects he has worked on, or led, that have created billions of dollars of revenue for those respective companies. His career spans DEC, AMD, SiByte, Broadcom, PA Semi, Apple, AMD (again), Tesla, Intel, and now he is at Tenstorrent as CTO, developing the next generation of scalable AI hardware. Jim's work ethic has often been described as 'enjoying a challenge', and over the years when I've spoken to him, he always wants to make sure that what he is doing is both that challenge, but also important for who he is working for. More recently that means working on the most exciting semiconductor direction of the day, either high-performance compute, self-driving, or AI.

Note: This interview is intended for an audience with technical expertise in chip design. If, like me, you lack such expertise, you just have to let if flow and be content with a mere flavor for what's going on.

Matrices, graphs, and vectors

IC: I think you said before that going beyond the sort of matrix, you end up with massive graph structures, especially for AI and ML, and the whole point about Tenstorrent, it’s a graph compiler and a graph compute engine, not just a simple matrix multiply.

JK: From old math, and I'm not a mathematician, so mathematicians are going to cringe a little bit, but there was scalar math, like A = B + C x D. When you had a small number of transistors, that's the math you could do. Now we have more transistors you could say ‘I can do a vector of those’, like an equation properly in a step. Then we got more transistors, we could do a matrix multiply. Then as we got more transistors, you wanted to take those big operations and break them up, because if you make your matrix multiplier too big, the power of just getting across the unit is a waste of energy.

So you find you want to build this optimal size block that’s not too small, like a thread in a GPU, but it's not too big, like covering the whole chip with one matrix multiplier. That would be a really dumb idea from a power perspective. So then you get this array of medium size processors, where medium is something like four TOPs. That is still hilarious to me, because I remember when that was a really big number. Once you break that up, now you have to take the big operations and map them to the array of processors and AI looks like a graph of very big operations. It’s still a graph, and then the big operations are factored down into smaller graphs. Now you have to lay that out on a chip with lots of processors, and have the data flow around it.

This is a very different kind of computing than running a vector or a matrix program. So we sometimes call it a scalar vector matrix. Raja used to call it spatial compute, which would probably be a better word.

IC: Alongside the Tensix cores, Tenstorrent is also adding in vector engines into your cores for the next generation? How does that fit in?

JK: Remember the general-purpose CPUs that have vector engines on them – it turns out that when you're running AI programs, there is some general-purpose computing you just want to have. There are also some times in the graph where you want to run a C program on the result of an AI operation, and so having that compute be tightly coupled is nice. [By keeping] it on the same chip, the latency is super low, and the power to get back and forth is reasonable. So yeah, we're working on an interesting roadmap for that. That's a little computer architectural research area, like, what's the right mix with accelerated computing and total purpose computing and how are people using it. Then how do you build it in a way programmers can actually use it? That's the trick, which we're working on. [...]

CPU Instruction Sets: Arm vs x86 vs RISC-V

IC: You’ve spoken about CPU instruction sets in the past, and one of the biggest requests for this interview I got was around your opinion about CPU instruction sets. Specifically questions came in about how we should deal with fundamental limits on them, how we pivot to better ones, and what your skin in the game is in terms of ARM versus x86 versus RISC V. I think at one point, you said most compute happens on a couple of dozen op-codes. Am I remembering that correctly?

JK: [Arguing about instruction sets] is a very sad story. It's not even a couple of dozen [op-codes] - 80% of core execution is only six instructions - you know, load, store, add, subtract, compare and branch. With those you have pretty much covered it. If you're writing in Perl or something, maybe call and return are more important than compare and branch. But instruction sets only matter a little bit - you can lose 10%, or 20%, [of performance] because you're missing instructions.

For a while we thought variable-length instructions were really hard to decode. But we keep figuring out how to do that. You basically predict where all the instructions are in tables, and once you have good predictors, you can predict that stuff well enough. So fixed-length instructions seem really nice when you're building little baby computers, but if you're building a really big computer, to predict or to figure out where all the instructions are, it isn't dominating the die. So it doesn't matter that much.

When RISC first came out, x86 was half microcode. So if you look at the die, half the chip is a ROM, or maybe a third or something. And the RISC guys could say that there is no ROM on a RISC chip, so we get more performance. But now the ROM is so small, you can't find it. Actually, the adder is so small, you can hardly find it? What limits computer performance today is predictability, and the two big ones are instruction/branch predictability, and data locality.

Now the new predictors are really good at that. They're big - two predictors are way bigger than the adder. That's where you get into the CPU versus GPU (or AI engine) debate. The GPU guys will say ‘look there's no branch predictor because we do everything in parallel’. So the chip has way more adders and subtractors, and that's true if that's the problem you have. But they're crap at running C programs.

GPUs were built to run shader programs on pixels, so if you're given 8 million pixels, and the big GPUs now have 6000 threads, you can cover all the pixels with each one of them running 1000 programs per frame. But it's sort of like an army of ants carrying around grains of sand, whereas big AI computers, they have really big matrix multipliers. They like a much smaller number of threads that do a lot more math because the problem is inherently big. Whereas the shader problem was that the problems were inherently small because there are so many pixels.

There are genuinely three different kinds of computers: CPUs, GPUs, and AI. NVIDIA is kind of doing the ‘inbetweener’ thing where they're using a GPU to run AI, and they're trying to enhance it. Some of that is obviously working pretty well, and some of it is obviously fairly complicated. What's interesting, and this happens a lot, is that general-purpose CPUs when they saw the vector performance of GPUs, added vector units. Sometimes that was great, because you only had a little bit of vector computing to do, but if you had a lot, a GPU might be a better solution. [...]

Thoughts on Moore's Law

IC: You've said on stage, and in interviews in the past, that you're not worried about Moore's Law. You’re not worried on the process node side, about the evolution of semiconductors, and it will eventually get worked out by someone, somewhere. Would you say your attitude towards Moore's law is apathetic?

JK: I’m super proactive. That’s not apathetic at all. Like, I know a lot of details about it. People conflate a few things, like when Intel's 10-nanometer slipped. People said that Moore's law is dead, but TSMC’s roadmap didn’t slip at all.

Some of that is because TSMC’s roadmap aligned to the EUV machine availability. So when they went from 16nm, to 10nm, to 7nm, they did something that TSMC has been really good at - doing these half steps. So they did 7nm without EUV, and that 7nm with EUV, then 5nm without, and 5+nm with EUV, and they tweaked stuff. Then with the EUV machines, for a while people weren't sure if they're going to work. But now ASML’s market cap is twice that of Intel's (it’s actually about even now, on 21st June).

Then there's a funny thing - I realized that at the locus of innovation, we tend to think of TSMC, Samsung, and Intel as the process leaders. But a lot of the leadership is actually in the equipment manufacturers like ASML, and in materials. If you look at who is building the innovative stuff, and the EUV worldwide sales, the number is something like TSMC is going to buy like 150 EUV machines by 2023 or something like that. The numbers are phenomenal because even a few years ago not many people were even sure that EUV was going to work. But now there's X-ray lithography coming up, and again, you can say it's impossible, but bloody everything has been impossible! The fine print, this what Richard Feynman said - he's kind of smart. He said ‘there's lots of room at the bottom’, and I personally can count, and if you look at how many atoms are across transistors, there's a lot. If you look at how many transistors you actually need to make a junction, without too many quantum effects, there are only 10. So there is room there.

There's also this funny thing - there's a belief system when everybody believes technology is moving at this pace and the whole world is oriented towards it. But technology isn't one thing. There are people who figure out how to build transistors, like what the process designers do at like Intel, or TSMC, or Samsung. They use equipment which can do features, but then the features actually interact, and then there's a really interesting trade-off between, like, how should this be deposited and etched, how tall should it be, how wide, in what space. They are the craftsman using the tools, so the tools have to be super sharp, and the craftsmen have to be super knowledgeable. That's a complicated play. There's lots of interaction and at some level, because the machines themselves are complicated, you have this little complexity combination where the machine manufacturers are doing different pieces, but they don't always coordinate perfectly, or they coordinate through the machine integration guys who designed the process, and that's complicated. It can slow things down. But it's not due to physics fundamentals - we're making good progress on physics fundamentals.[...]

Chips Made by AI, and Beyond Silicon

IC: In terms of processor design, currently with EDA tools there is some amount of automation in there. Advances in AI and Machine Learning are being expanded into processor design - do you ever envision a time where an AI model can design a purposeful multi-million device or chip that will be unfathomable to human engineers? Would that occur in our lifetime, do you think?

JK: Yeah, and it’s coming pretty fast. So already the complexity of a high-end AMD, Intel, or Apple chip, is almost unfathomable that any one person. But if you actually go down into details today, you can mostly read the RTL or look at the cell libraries and say, ‘I know what they do’, right? But if you go look inside a neural network that's been trained and say, why is this weight 0.015843? Nobody knows.

IC: Isn’t that more data than design, though?

JK: Well, somebody told me this. Scientists, traditionally, do a bunch of observations and they go, ‘hey, when I drop a rock, it accelerates like this’. They then calculate how fast it accelerated and then they curve fit, and they realize ‘holy crap, there's this equation’. Physicists for years have come up with all these equations, and then when they got to relativity, they had to bend space and quantum mechanics, and they had to introduce probability. But still there are mostly understandable equations.

There's a phenomenon now that a machine learning thing can learn, and predict. Physics is some equation, put inputs, equation outputs, or function output, right? But if there's a black box there, where the AI networks as inputs, a black box of AI outputs, and you if you looked in the box, you can't tell what it means. There's no equation. So now you could say that the design of the neurons is obvious, you know - the little processors, little four teraflop computers, but the design of the weights is not obvious. That's where the thing is. Now, let’s go use an AI computer to go build an AI calculator, what if you go look inside the AI calculator? You can't tell why it's getting a value, and you don't understand the weight. You don't understand the math or the circuits underneath them. That's possible. So now you have two levels of things you don't understand. But what result do you desire? You might still be designed in the human experience.

Computer designers used to design things with transistors, and now we design things with high-level languages. So those AI things will be building blocks in the future. But it's pretty weird that there's going to be parts of science where the function is not intelligible. There used to be physics by explanation, such as if I was Aristotle, 1500 years ago - he was wrong about a whole bunch of stuff. Then there was physics by equation, like Newton, Copernicus, and people like that. Stephen Wolfram says there’s now going to be physics by, by program. There are very few programs that you can write in one equation. Theorems are complicated, and he says, why isn’t physics like that? Well, protein folding in the computing world now we have programmed by AI, which has no intelligible equations, or statements, so why isn’t physics going to do the same thing?

IC: It's going to be those abstraction layers, down to the transistor. Eventually, each of those layers will be replaced by AI, by some unintelligible black box.

JK: The thing that assembles the transistors will make things that we don’t even understand as devices. It’s like people have been staring at the brain for how many years, they still can't tell you exactly why the brain does anything.

IC: It’s 20 Watts of fat and salt.

JK: Yeah and they see chemicals go back and forth, and electrical signals move around, and, you know, they're finding more stuff, but, it's fairly sophisticated.

IC: I wanted to ask you about going beyond silicon. We've been working on silicon now for 50+ years, and the silicon paradigm has been continually optimized. Do you ever think about what’s going to happen beyond silicon, if we ever reach a theoretical limit within our lifetime? Or will anything get there, because it won’t have 50 years of catch-up optimization?

JK: Oh yeah. Computers started, you know, with Abacuses, right? Then mechanical relays. Then vacuum tubes, transistors, and integrated circuits. Now the way we build transistors, it's like a 12th generation transistor. They're amazing, and there's more to do. The optical guys have been actually making some progress, because they can direct light through polysilicon, and do some really interesting switching things. But that's sort of been 10 years away for 20 years. But they actually seem to be making progress.

It’s like the economics of biology. It’s 100 million times cheaper to make a complicated molecule than it is to make a transistor. The economics are amazing. Once you have something that can replicate proteins - I know a company that makes proteins for a living, and we did the math, and it was literally 100 million times less capital per molecule than we spent on transistors. So when you print transistors it’s something interesting because they're organized and connected in very sophisticated ways and in arrays. But our bodies are self-organizing - they get the proteins exactly where they need to be. So there's something amazing about that. There's so much room, as Feynman said, at the bottom, of how chemicals are made and organized, and how they’re convinced to go a certain way.

I was talking to some guys who were looking at doing a quantum computing startup, and they were using lasers to quiet down atoms, and hold them in 3D grids. It was super cool. So I think we've barely scratched the surface on what's possible. Physics is so complicated and apparently arbitrary that who the hell knows what we're going to build out of it. So yeah, I think about it. It could be that we need an AI kind of computation in order to organize the atoms in ways that takes us to that next level. But the possibilities are so unbelievable, it's literally crazy. Yeah I think about that.

There's more at the link.

H/t Tyler Cowen.

No comments:

Post a Comment