Moore’s Law and 40nm Yield
To follow up one of my most popular blogs TSMC 40nm Yield Explained!, here is a closer look at the 40nm yield issues that currently plague the semiconductor industry. It’s a direct result of Moore’s Law, the climbing transistor count and shrinking geometries. It’s a process AND design issue and the interaction is at the transistor level.
Transistor level design which lnclude Mixed Signal, Analog/RF, Embedded Memory, Standard Cell, and I/O, are the most susceptible to parametric yield issues caused by process variation.
Process variation may occur for many reasons during manufacturing, such as minor changes in humidity or tempature changes in the clean-room when wafers are transported, or due to non uniformities introduced during process steps resulting in variation in gate oxide, doping, and lithography; bottom line it changes the performance of the transistors.
The most commonly used technique for estimating the effects of process variation is to run SPICE simulations using digital process corners provided by the foundry as part of the spice models in the process design kit (PDK). This concept is universally familiar to transistor level designers, and digital corners are generally run for most analog designs as part of the design process.
Digital process corners are provided by the foundry and are typically determined by Idsat characterization data for N and P channel transistors. Plus and minus three sigma points maybe selected to represent Fast and Slow corners for these devices. These corners are provided to represent process variation that the designer must account for in their designs. This variation can cause significant changes in the duty cycle and slew rate of digital signals, and can sometimes result in catastrophic failure of the entire system.
However, digital corners have three important characteristics that limit their use as accurate indicators of variation bounds especially for analog designs:
- Digital corners account for global variation, are developed for a digital design context and are represented as “slow” and “fast” which is irrelevant in analog design.
- Digital corners do not include local variation effects which is critical in analog design.
- Digital corners are not design-specific which is necessary to determine the impact of variation on varying analog circuit and topology types.
These characteristics limit the accuracy of the digital corners, and analog designers are left with considerable guesswork or heuristics as to the true effects of variation on their designs. The industry standard workaround for this limitation has been to include ample design margins (over-design) to compensate for the unknown effects of process variation. However, this comes at a cost of larger than necessary design area, as well as higher than necessary power consumption, which increases manufacturing costs and makes products less competitive. The other option is to guess at how much to tighten design margins, which can put design yield at risk (under-design). In some cases under and over-design can co-exist for different output parameters for a circuit as shown below. The figure shows simulation results for digital corners as well as Monte Carlo simulations which are representative of the actual variation distribution.
To estimate device mismatch effects and other local process variation effects, the designer may apply a suite of ad-hoc design methods which typically only very broadly estimate whether mismatch is likely to be a problem or not. These methods often require modification of the schematic and are imprecise estimators. For example, a designer may add a voltage source for one device in a current mirror to simulate the effects of a voltage offset.
The most reliable and commonly used method for measuring the effects of process variation is Monte Carlo analysis, which simulates a set of random statistical samples based on statistical process models. Since SPICE simulations take time to run (seconds to hours) and the number of design variables is typically high (1000s or more), it is commonly the case that the sample size is too small to make reliable statistical conclusions about design yield. Rather, Monte Carlo analysis is used as a statistical test to suggest that it is likely that the design will not result in catastrophic yield loss. Monte Carlo analysis typically takes hours to days to run, which prohibits its use in a fast, iterative statistical design flow, where the designer tunes the design, then verifies with Monte Carlo analysis, and repeats. For this reason, it is common practice to over-margin in anticipation of local process variation effects rather than to carefully tune the design to consider the actual process variation effects. Monte Carlo is therefore best suited as a rough verification tool that is typically run once at the end of the design cycle.
The solution is a fast, iterative statistical design flow that captures all relevant variation effects into a design-specific corner based flow which represents process variation (global and local) as well as environmental variation (temperature and voltage).
Graphical data provided by Solido Design Automation’s Variation Designer.




I remember reading about spin transistors and reading this made me wounder how close is anyone to manufacturing spin transistors?
I couldn’t view the graphs other than the Moore’s law graph which I really don’t need to see in detail anyway. Clicking the small graphs only opens advertisements. Is that what you intended?
The diagrams are linked to the company that created them. I’m sure they would be happy to provide more detail.
For Monte Carlo results to make sense and correlate with reality the parameters used must be correlated to each other. In other words the Monte Carlo runs are not 100% random because certain parameters do track with each other. Getting the fabs to supply correlated parameters would be just wonderful and allow circuit designers to run Monte Carlo with more realistic results.
You can also speed up Monte Carlo runs by using a Fast SPICE circuit simulator instead of a Classic SPICE circuit simulator.
Perhaps one day we’ll see dedicated hardward to accelerate Monte Carlo SPICE simulations. I recall that companies like Nascentric were able to show a 4X speed-up on their now defunct Fast SPICE simulator using a GPU.
Not bad for 20 years ago — when those were state of the art.
Not adequate for today.
The BEST way to design today is using analytic surrogates. They are much faster and much more accurate than sequential DOE’s or Monte Carlo.
Ed,
Please educate me a little bit here. I just did a Google search for:
“analytic surrogates” circuit simulation
and came back empty.
Daniel
Google probably offered you “shopping for analytic surrogate SPICE models.” Try Bing instead and you will find http://blackhawkanalytics.com/ near the top.
FYI: This is the 4th time I tried to reply. If it doesn’t take, I’m not trying again.
Ed, thanks for the link to http://blackhawkanalytics.com/. I’ll reread it until it starts making sense to me.
Daniel
Think of it like this: a designer can wander around in the equivalent of the desert either based on experience (most common), or according to a plan (the DOE approach). Or they can access the equivalent of google maps complete with satelite views. That’s analytic surrogates.
MC techniques are like randomly poking a stick in the dirt and seeing if ants swarm out. With analytic surrogates, you know where the design meets requirements and where it doesn’t.
Ed,
With analytic surrogates do I still simulate my netlist with a SPICE tool and model cards?
Yes — when you build the surrogate models, but it can be pretty sparse.
Afterwards, because the error is so small (0.1%) it’s not necessary unless you are trying to extrapolate outside the original hyperspace where the models were developed (the support).
Ed,
So does the foundry build these surrogate models or does the design engineer or is it a collaboration?
The design engineer does.
The number of SPICE runs made to build the surrogate models depends on the specific circuit being designed and how many design and process parameters are being varied. Once the runs are made, the surrogate models for each of the responses (performance goals) can usually all be built in less than an hour.
The SPICE runs used to build the surrogates vary the design and process parameters simultaneously so that the interaction between the design parameters and process parameters are captured. The surrogates can be built with the process variation acting as noise or as part of the surrogate. This provides surrogates which answer different types of questions the design engineer might be insterested in investigating.
If you have a specific design already (parameters are chosen), then you can build the surrogates for that specific design by varying the process only. This can be a good thing to do if you are, for example, designing a 6T cell for a memory and need highly accurate yield information across an extended process window (say 5 or 6 sigma) because there are millions of these cells on a uP.
It’s also usually easier to build surrogates for a specific design for which you want to explore device – process sensitivity to temperature, voltage, or current (or deltas in these between cells) for the customer’s usage window. I don’t know of any other good way to find schmoo holes in a design before silicon other than pure luck.
Ed,
What is the difference between running sensitivity analysis in SPICE and building surrogate models? Doesn’t sensitivity analysis show me how each process parameter can affect some net in my design?
Sensitivity analysis, as usually performed, is essentially a waste of time. On the website, you can see an simple example of sensitivities of a design in planes across an entire window, if we did a sensitivity study the usual way all you would know is what happens where the vertical and horizontal dotted lines cross +/- a small distance. You cannot, from those points of intersection even begin to understand what the process does to the performance.
With the surrogates, you can answer meaningful questions in virtualy real time like: 1) is it possible for this design to fail anywhere in the process + user condition window? 2) what is the worst case process for this design goal(s) even if the process does not cause failure? 3) is there a “hole” in the process for which this design does not perform? 4) which design provides the highest bin 1 yield in the worst case corner? 5) which design provides the highest overall yield at a (chosen) process point? 6) which design will yield equally in these 2 product bins?
But if you really like sensitivity even though it doesn’t really tell you anything useful, the derivative of the analytic surrogate is the sensitivity — and you have all the sensitivities across the entire design and process space.
No additinonl simulations are required, just use the surrogate models.
According to my sources: there’s a time and place for response surface methods (what Ed calls analytic surrogates), and a time and place for Monte Carlo methods.
Response surface methods typically need (2-10) x (number of variables) in order to get a decent model. For circuits with say 100-1000 devices, and 10 process variables per device, there are 1000-10,000 input variables. This means 2000-100,000 simulations for a decent RSM model. In contrast, MC methods are dimensionality-independent, so in many practical cases 30-100 simulations are all that is needed to get useful design information.(And of course there are variants of MC methods to get more speed, for certain types of problems, e.g. importance sampling, quasi MC sampling, control variates, etc.). However, for small circuits like the 6T bitcell and 10 process variables per device, a decent RSM model can be built in 120-600 simulations.
Well, your sources are a bit off in their advice.
Analytic surrogates as outlined above (they are not respose surface models in the usual sense) are often fit with only a couple hundred simulations — much less than a response surface design does. For a 6T cell, there should be on the order of 60 to 100 parameters to vary (every transistor has about 10 and you might want to add parasitics), plus the design parameters. The minimum number of runs required = the number of parameters being varied, but then you sometimes don’t meet the 0.1% error rule for a good surrogate. So you do a little more than the minumum, but not a hugh amount. You do need to make the right set of SPICE runs.
Traditional RSM designs are wasteful in comparison once you are looking at 8 or more things to vary. Designers are often provided with bad advice to use a tiny number of parameters and bad advice to use a simple quadratic response surface model because the techniques usually talked about can’t handle bigger tasks. It’s BAD ADVICE. We can do better.
On MC:
MC really is worthless because MC provides information mostly near the center of the process and doesn’t provide much info about corners when there are more than a couple of dimensions even if you do thounds of MC runs. If you develop the surrogates the right way (make the right set of runs), you can calculate yield from the same runs — and can recalculate the yield if the process center is shifted without making additional SPICE runs.
100 MC simulations even on a single parameter doesn’t do a lot. Yeah, you are possibly going to see a 3% yield loss. But if you have 40M cells on a device, you should be concerend about 1 fail per million cells. Geee that’s +/- SIX sigma and beyond simulations?!?!? How do you guarantee looking out that far in a typical 100 MC runs? You don’t. 100 MC runs is nothing. The truth is, you need at least 3 to 5 times as many MC runs as 1/fail-rate you are interested in finding to be reasonably certain you will see a single failure. That means 3 to 5 million MC runs for a 1ppm failure rate (40 failed cells). But then, what about the corners? With 5 million MC runs, you will maybe get to peak into 1 or 2 pairwise ” 3 sigma corners.” But those “corners” aren’t actually corners because they are still near the center in most of the other parameters. What about the real corners? Impossible in our combined lifetimes. MC is worthless.
>Well, your sources are a bit off …
Ed, I am one of Dan’s “sources.” I will be happy to have a technical dialogue with you in this forum. I do request that the tone be professional and courteous.
Before I respond more thoroughly, I would like some clarification, please. (Note that references are at the bottom.)
1. In the “analytical surrogates” method, your comments / blackhawk site / LinkedIn profile make reference to “local linear models”, “taylor sensitivities”, and “geostatistics”. This implies that you may be using kriging (gaussian nets), as in [1]; e.g. as applied to circuits in [2][3]. Is this the case? If not kriging, then what? (Do you have a reference?) If kriging, do you ignore the interaction terms between process variables as in [3]?
2. What do you perceive as the difference between an “analytic surrogate” and a “response surface”?
3. When you say “make the right set of runs”, do you mean an active learning scheme, as in [1]; e.g. as applied in circuits in [3][4][5]? If yes, what is your objective function when you choose your next sample points? (E.g. do you take into account model uncertainty? Model optimality? Do you combine them with “expected improvement” criterion [1], “least constrained bounds” criterion [6], or something else?) Or perhaps you are using a PWL-specific formulation, such [7] or [8]?
4. What is the range of circuit sizes that you are concerned with, in terms of devices? What is the range of times to simulate across all testbenches?
I look forward to your response.
Kind regards,
Trent McConaghy
References:
[1] D.R. Jones, M. Schonlau, and W.J. Welch, Efficient Global Optimization of Expensive Black-Box Functions. Journal of Global Optimization} Vol. 13, 1998, pp. 455-492″.
[2] T. McConaghy, G. G. E. Gielen, Analysis of simulation-driven numerical performance modeling techniques for application to analog circuit optimization, in Proc. IEEE Intern. Symposium on Circuits and Systems (ISCAS), May 23-26, 2005.
[3] Guo Yu, Peng Li: Yield-aware analog integrated circuit optimization using geostatistics motivated performance modeling. ICCAD 2007: 464-469
[4] M. Ding, and R. Vemuri, A Two-Level Modeling Approach to Analog Circuit Performance Macromodeling. Proceedings of the Design Automation and Test Europe Conference (DATE)}, pp. 1088–1089.
[5] T. McConaghy and G.G.E. Gielen, Globally reliable variation-aware sizing of analog integrated circuits via response surfaces and structural homotopy, IEEE Transactions on Computer-Aided Design 28(11), Nov. 2009, pp. 1627-1640.
[6] M.J. Sasena, Flexibility and efficiency enhancements for constrained global design optimization with Kriging approximations. PhD thesis, University of Michigan, 2002
[7] J. Wang, X. Li and L. Pileggi, Parameterized Macromodeling for Analog System-Level Design Exploration, Proceedings of ACM/IEEE Design Automation Conference, June 2007.
[8] C. Gu and J. Roychowdhury, An efficient, fully nonlinear, variability-aware non-Monte-Carlo yield estimation procedure with applications to SRAM cells and ring oscillators, in Proc.2008 Asia and South Pacific Design Automation Conf. (ASP-DAC ‘08), Piscataway, NJ: IEEE Press, 2008, pp. 754-761.
Hi Trent,
I believe we have met before. Solido is one of the better companies working in this area in circuit design. You guys have done pretty well and deserve compliments.
I’m not so interested in long technical discussions as I’ve already gone over a lot. I also don’t want to provide anyone with a commercial advantage by getting too specific on questions which may be more of interest to someone moving their specific methodology forward, especially when some of their direction has synergy. I will answer some of your questions (especially dead-ends), but there are technical issues I won’t discuss.
I haven’t applied Kreiging in chip design. It could be interesting, in some settings, with across-chip variation but would need to be anisotropic. Most designers would not be able to apply it properly and I can’t see a way to take the statistical knowledge required out to get a good anisotropic modeling process automated. On the other hand, semivariograms do provide useful information regarding whether or not you are dealing with nugget (essentially local) vs global variation and how that changes with distance. Could be useful in matching, not sure it is worth the effort.
The primary difference between an analytic surrogate, as I use them, and a classical response surface is that the surrogate, for all intents and purposes, IS a local replacement for SPICE for the circuit being designed. It gets the same results as SPICE with no more than 0.1% error anywhere across the support of the model. To be that accurate, it’s usually much more complex than a classical response surface and less parsimonious.
One key recognition is that all error is model specification error — except in those cases where the error you see from SPICE is roundoff and algorithm splices. That happens, typically appearing as hexagons or +/- ridges in residual patterns. Noting it’s all model specification error, no hypothesis test is meaningful. They do provide some guidance as to what can be dropped from the model, but not very much. It’s better to think in terms of error in truncated infinte series where you would NOT truncate some portions of the series but not others.
Some sampling schemes allow you to do much more than simply build a model. These properties are not commonly recognized in the statistical community at large but are recognized by some. One example, Satterthwaithe’s random balance designs, were heavily criticized and are avoided. The criticism was correct as made, but the baby was tossed out with the bath water. Random balance designs do have uses well beyond those envisioned by Satterthwaithe as some are beginning to finally discover.
No — I’m not talking about well published sequential sampling schemes like how to choose the next sample to maximally reduce uncertainty. I did fund a research project for software to do exactly that in the 80’s — it’s pretty cool to do when you are building knowledge about physical properties of new materials. Sequential methods simply take too long and provide little information other than optimization of specific properties. Given a SPICE deck, many digital circuit designs and some analog designs can be settled in a single day, why drag it out with inefficient techniques and end up with less information than you need? It’s better, day to day, compare circuit alternatives than to work on the same one whan that’s possible. Designers should be doing design, not spending weeks optimizing performance vs yield when that can be done rapidly.
I don’t use learning techniques for surrogates, although I have tried on the side as it would be easier to promote black boxes. Learning techniques tend to fail spectacularly apparently because they attempt to “capture” the data locally with some smoothing and don’t really make use of the underlying physics to interpolate well enough in directions where the data is sparse. In 100 dimensional spaces, there are 2^100 corners. You need to make use of the physics, sampling will NOT provide enough information alone. I definitely would not use Gaussian processes or other kernal methods in high dimensional spaces with lots of meaningful high-order interactions.
All “improvements” over the initial sampling are geared towards resolving model specification error. That is totally driven by residual analysis and sampling sparseness. No optimization search schemes (that can already be done but is really slow) or single samples. If your coverage isn’t good enough, it needs improvement across the entire support — or the support needs to be reduced and completely resampled.
Surrogates like this have been in use since at least the 50’s / 60’s. Not much has been published, apparently for very good reasons once you start understanding what is going on. You can figure some of what is unsaid in the little that has been published by considering the problem and some of the “useless” comments in the papers.
Consider the following application properties for which the surrogate techniques were originally developed: complex finite difference and finite element simulation codes with extremely long simulation times exceeding the compute capability of commercial computers, the physics is only partly understood, and there are hundreds of possibly significant design variables you cannot logically eliminate because noone has ever done what you are doing. You are going to build ONE incredibly expensive prototype which must have a high probability of working. How do you optimize, make it reasonably robust for what is unknown, and KNOW it will work? And how can you tell if the simulations have inconsistancies at the same time?
Hi Ed,
Thank you for the details. It seems that you, like me, are happy to use the techniques that _work_, no matter their age or current popularity.
For this dialogue to go forward, we need to improve the technical accuracy. You seem to have narrowed the definition of “Monte Carlo methods” and “response surfaces” to the cases where they have obvious weaknesses for some applications. Let’s give “Monte Carlo methods” and “response surfaces” their fair shake. After that, I will describe where MC methods vs. response methods fit.
Monte Carlo methods:
——————–
-I’m going straight to the wikipedia definition, because it’s as good as any — “Monte Carlo methods are a class of computational algorithms that rely on repeated random sampling to compute their results.”
-This includes pseudo-random sampling from the true pdf, which is what many assume the label “Monte Carlo” means.
-But it also includes importance sampling techniques, which do not sample from the true pdf, but instead from an altered pdf specific to the task at hand (e.g. integration). For example, the pdf can be altered towards regions of failure, in order to simulate low-probability-of-occurrence events.
-MC methods also include “quasi Monte Carlo” sampling, which gives better spread (lower discrepancy), to avoid the “clumping” behavior observed in pseudo-random sampling. These can get far better convergence in the number of samples, theoretically and in practice. These include latin hypercube sampling, Halton sampling, Niederreiter sampling, lattice rules, digital nets, etc.
-With MC methods (and non-MC too!), confidence intervals are key. For the information presented to be meaningful, have to know how good your estimates are.
If you review the previous posts, you will see that “importance sampling” and “quasi Monte Carlo” were explicitly included in the list of “Monte Carlo methods”. This was not by accident.
Using importance sampling on high-sigma designs (eg “1 fail per million” cell), one can make a reasonable yield estimate in 100 or so observations, not “our combined lifetimes” as you stated.
On lower-sigma designs, one can get useful estimates of yield, distributions, impacts, etc in just 30-100 observations; but it does take thoughtful use of quasi monte carlo methods, confidence intervals, and (sometimes) density estimation. I will gladly provide you with the benchmark data demonstrating this.
Response surfaces:
——————
You have been using the labels “quadratic response surface model” and “classical response surfaces” (implying quadratics). Obviously, quadratic models have major limitations. My comments did not imply such limited models.
In all the literature I follow, the term “response surface model”, it is used synonymously with “regression model”; meaning a model constructed from data, doing a static mapping from a set of input variables to (typically) one output variable. Model types include quadratic models, but also linear models, posynomials, polynomials, piecewise linear models, piecewise polynomials / splines, lookup tables, kriging / gaussian nets, CARTs, FFNNs, RBFs, SVMs, symbolic models, etc, and boosted / bagged versions of each.
Some of these models are very complex, and not parsimoniuos. E.g. RBFs, kriging, and lookup tables store all their training data, which could even have tens of thousands of input dimensions and millions of samples. E.g. bagged CARTs (i.e. random forests) can have thousands of replicated deep CART trees, taking up 10’s of MBs of memory.
And of course one can use such models as replacement for SPICE simulations. I’ve done so, you’ve done so, and so have scores of other researchers, as the literature shows. I’ve built models within .1% of SPICE (in testing data); of course it’s trivial to build models with 0% error on training data.
And of course one wouldn’t do dumb sampling techniques, like taking every corner in a 2^n space. One would use low-discrepancy sampling followed by active learning (this isn’t slow if you have the right approach), or perhaps your technique (if we knew what it was precisely:)
So it seems to me that “analytic surrogates” are simply response surfaces. No need to invent a new label.
Where to use:
————-
I will repeat: there is a time and place for response surface methods, and a time and place for Monte Carlo methods.
That choice should be made based on what gives the user, the analog designer, the most benefit. Let’s consider a few typical design scenarios.
1. Consider a circuit with 1000 devices, and 10 process parameters per device. It’s not replicated a bunch of times in the circuit, i.e. it’s low/medium sigma. Let’s say there are 2 testbenches, and 3 env. points per testbench. I’ll ignore global process variables and design variables, since local process variables swamp the variable count.
(a) To verify it with a _competent_ Monte Carlo approach, one needs about (100 observations) * (2 testbenches) * (3 env points) = 600 simulations. One can get corners from this data too, for fast design iterations.
(b) To verify it using a _competent_ response surface technique (let’s say ‘n’ samples needed for ‘n’ input variables, but even 2n is fine). One would need (1000 devices) * (10 process parameters per device) * (2 testbenches) = 20,000 simulations.
2. Consider a (high-sigma) bitcell with 6 devices, 2 testbenches, and 3 env. points per testbench.
(a) To verify with MC, one needs about (100 observations) * (2 testbenches) * (3 env points) = 600 simulations.
(b) To verify with response surfaces, one needs about (6 devices) * (10 process parameters per device) * 2 testbenches = 120 simulations.
3. Consider a (high-sigma) sense amp with 50 devices, 2 testbenches, and 3 env points per testbench
(a) To verify with MC, one needs about (100 observations) * (2 testbenches) * (3 env points) = 600 simulations.
(b) To verify with response surfaces, one needs about (50 devices) * (10 process parameters per device) * 2 testbenches = 1000 simulations.
In the first case, MC needs 600 simulations vs. response surface 20,000. In the second case, 600 vs. 120. In the third, 600 vs. 1000. With MC, one may need 1-3 design iterations for as much as 3x more simulations; but response surface approaches may need 3x more simulations too (e.g. the design change is in the topology, or the problem setup changes). In short, MC methods are may be a little more expensive than response surfaces on tiny circuits; the techniques are roughly on par for moderate-sized circuits, and for circuits of anything larger, MC methods are _dramatically_ faster. The key reason for their speed in high dimensions is their independence from dimensionality, unlike response surface approaches. The other challenge for response surface methods is for the designer to trust the model.
BTW (tiny bit of disclosure): Solido uses a variety of MC methods and response surface methods; and of course the user is spared from the complexities of the math. The user gets to focus on his design, even if some heavy lifting is done under the hood.
Kind regards,
Trent
Hi Trent,
Most people do consider MC to be pseudo random sampling using the PDF of the distribution. True. Importance sampling is sometimes (very loosely) called a MC technique but is generally called importance sampling because it requires a truncation of the pdf. Unfortunately, importance sampling requires you know where to look for failures. If you knew that much about your circuit, you wouldn’t be doing MC, you would be fixing the design.
I don’t mind you noting different types of automated simulations using sampling is similar to MC, but do call importance sampling importance sampling, because THAT is what importance sampling is.
The same is true for the term “response surfaces.” Every model of a response over a support is a response surface model of some type. But that conveys no information — it’s like saying MC is a correct description any oddball method of sampling through automation using quasi-random number generation. MC is understood to be more specific.
Classical response surfaces are what is commonly thought of when the term response surfaces are mentioned. People do not generally think of neural nets, decision trees, gaussian processes, and kreiging (to mention just a few) as response surface methods although these all fit a broad (and useless) definition of a response surface method. We call them what they are.
So, let’s not go and use terms with common technical meaning atypically according to some loose interpretation of what might be found somewhere on the Web. I do, and will, use more specific technical language with typical technical meaning.
Anyway, for a 11,000 independent variables (1000 elements x 10 process parameters (the test benches don’t count because the response is multivariate), you feel it is adequate to do 200 samples for 3 “environmental points.” Now stated this way, I’m at a bit of a loss to know what an “environmental point” is. Are you indicating that you feel it is adequate to examine 3 process points? Or 3 temperature / voltage/ current settings?
If you mean 3 process settings (like TT, FF, SS), you are so far from trying to do what I’m doing that it isn’t even funny, because I DO want to work with all 11,000 dimensions. So, your claim is, I take it, that 200 MC simulations (importance or not) are all that is required to examine an 11,000 dimensional space at a single VTC setting for a circuit.
And you will be able to, with this tiny MC sample, tell me all of this:
1) where the optimum response is for all 3 test bencehs in combination
2) where the circuit doesn’t work in that entire 11,000 dimensional space
3) the yield with the process centered anywhere in that same 11,000 dimensional space?
Because that’s what I can do with surrogates.
And you will do this even though the entire 200 run MC sample (importance or not) will fit within at most a 199 dimensional subspace which is infinitely thin when considered with the addition of any single one the 10,801 OTHER dimensions you didn’t bother to sample? The process can and will go into these other dimensions because that’s what processes do. Your tiny sample there only managed to cover about 2% of the dimensionality of the process.
Now, if your focus is on failure rates at a particular process point, it is true that, as a long term average, that you are reasonably certain the failure rate with a 200 pice sample at a particular VTC setting is under 1.46%. But any given sample, probably most of them in this case, will give an estiamte of 0 failures.
So, 0 fails in 200 MC runs might leave you feeling warm and cozy, not me. I know the fab process space wasn’t reasonably examined. You really need more runs than dimensions or you can get into serious trouble.
You can readily see why 0 fails is more likely if you examine what happens in 3 dimensions with a MC sample of size 2 where 1/8 of the space contains failures (say when all 3 dimensions are positive there are failures, none elsewhere). 75% of such MC samples of size 2 will have 0 failures, 25% of the samples will have 50% failures. The real failure rate in this example is 12.5%, you are more likely to see none.
But what does this kind of MC sampling represent? If you only varied the global variation about TT and got 0 fails, this implies that up to maybe 1.46% of your process space has 0 yield. Which means that if your process drifts a bit your yield may get really bad, really quickly. Seeing 0 fails isn’t comforting because it only places a possible upper confidence bound on what percentage of your sample space has 0% parametric yield.
If you varied only local variation around TT and got 0 fails, there is maybe up to 1.46% yield loss at when the wafers are exactly at TT. If the process center drifts even a little in the wrong direction, the failure rate may skyrocket. You don’t have a clue in this case as to how quickly it will go bad if the local variation is small compared to global variation.
What’s important in design is the maximum yield loss across the valid process space. A simple MC study can’t tell you this — it’s tied to a particular process points, commonly called corners. Unfortuantely, real circuits have real performance corners which do not sit on top of any of the “corners.” And Fabs typically aren’t anywhere near TT; typically many ET parameters are 1 or more sigma off from TT. It would be amazing to find a single wafer in a year at TT for all ET.
So, the reason MC studies aren’t worth much is they don’t tell you anything important other than if you have failures in MC you will likely have a disaster if you tape out. But, if you have no failures, you still might have a yield disaster. MC can’t tell you, especially with small samples, because it doesn’t cover the process space.
I definitely won’t back off of the MC is worthless claim. I’ve seen enough cases where even large sample MC (50,000 runs) was clean and the circuits had parametric yield disasters. There is a better way to go — and that’s using the surrogates. You can “ask” the analytic surrogates to tell you IF there is any possibility of parametric yield failure anywhere in the entire process space. And if a failure is possible, it will tell you exactly where in the process space this can happen.
Here’s my view of MC: Is today your lucky day? You might do as well asking a fortune teller!
Ed,
Re dimensionality: the error associated with MC integration scales at least as good as N^-1/2, where N is the number of samples. Note that there is no dependence on the dimension. This is why Monte Carlo methods do so well on high-dimensional problems, e.g. why they don’t need 20,000 simulations to handle a 1000-device circuit. Note that quasi MC methods scale even better.
Re lucky day: this is what confidence bounds on statistical estimates are for. No luck needed.
Re FF/SS: you assumed wrong about my use of “environmental points”. I meant temp, vdd, etc.
Re “importance sampling” label: claim that this is not a MC method if you wish, that’s your perogative. I’m going with what the literature says, e.g. “Importance sampling is one of the classical variance reduction techniques for increasing the efficiency of Monte Carlo algorithms for estimating integrals.” (PW Glynn, DL Iglehart, Importance sampling for stochastic simulations, Management Science, 1989). You don’t have to take my word — simply open up any book or technical paper on Monte Carlo methods.
Similarly re “response surfaces” label: claim that this is restricted to quadratics if you wish, that’s your perogative. Covering a broad set of models does not dilute its definition, response surfaces are all still approximations of the mapping y = f(x), where x is a set of input variables. You don’t have to take my word — simply open up any book or technical paper on DOE or machine learning.
Trent
Hi Trent,
I don’t think you followed my point. Likely that’s my fauult. And you didn’t tell me how you plan to answer my 3 questions with a tiny MC sample. if you recall:
1) where the optimum response is for all 3 test benches in combination
2) where the circuit doesn’t work in that entire 11,000 dimensional space
3) the yield with the process centered anywhere in that same 11,000 dimensional space?
I’ll go slower and leave less to the imagination.
Yes, MC long term uncertainty does scale with n^-0.5! And yes, it’s true that the rate of improvement does not depend on dimension. I know that, as does just about any statistician (which is my background). That’s why I provided an approximate 0.95 upper confidence bound assuming 0 fails. What we don’t know at sample #200 with 0 fails in any particular case is what the failure rate should have been — and we have taken a sample which only occupies a small fraction of the total dimensionality of the space. While that can happen with any sample, it is more worrisome in this case as the sample DOES have structure — it all lies on an infinitely thin hyperplane in the full space. We can construct theoretical confidence bounds based on 0 fails, but do they really apply? Kinda maybe.
So, in this particular case with 200 MC runs and 0 fails: is the real fail rate less than 0.5% or was it really about 1.5% or was it closer to 2.3% and we just didn’t see any fails yet? We would certainly feel a whole lot better about the confidence bound if we had seen a handfull of failures although that would surely indicate a design disaster.
If the real failure rate is 1%, doing 200 MC runs and seeing 0 fails is just as reasonable as seeing 6 fails — and both outcomes are totally plausible. So is anything in-between. The right way to pick the number of MC runs for 0 fails is to specify the maximum failure rate you can tolerate and then pick the number of runs where the worst tolerable failure rate is at the .99 or .997 (better) upper one-sided bound.
But how do we pick a maximum tolerable failure rate? To start that conversation, we need to be clear about what a failure implies. So, what does a MC failure represent? And how does that depend on the way the simulation is performed?
Again, I’ll repeat that (assuming uniform density sampling) if the variation is considered to be global variation, a failure rate represents the percentage of the process window which produces 100% failures should the process create a wafer there.
The consideration in this case is that the process will drift around, sometimes extending into a failure-producing region and — usually this is on a lot basis, because lot variation tends to dominate wafer variation. The real process in Fab is virtually never on nominal and may drift for extended periods well away fromm nominal. So if global variation failures exist in the process window, large percentages of the product (for some of lots, not others) may begin failing as the process begins to drift over towards the failure producing region. A 1% fail rate in the simulations may easily cause a 100% fail rate in unfortunate lots which happen to reside in the failing region. Unlikely? That’s up to Murphy and his Laws.
If the variation for the simulations is considered to be local variation, then we are assuming the processes is centered at a particular point for the lot (wafer) simulated and the failure rate provides an estimate of the yield. Presumably the simulations focus on yield at global nominal or some “corner point.” It is possible that local variation falures could increase dramatically if the process center drifts even a little from that nominal or “corner” in the wrong direction — we may again see extremly high fallout in one or more lots.
So, this brings us back to what failure rate can we tolerate in a production lot while the lot is “in spec.” Remembering that lots “near” a global process point where there will be 100% failures, will also have a high percentages of failures, you can bet the tolerance level is not 1% or even 0.5%. It’s going to be lower. It would be reasonable to ask that less than 1 in 1000 lots should have 100% parametric failures. So, let’s go with that and see what sample size is needed. Gee it’s about 6000 with 0 fails permitted. Not 200. Not 600.
Unfortunately, with modern processes (like 40nm), the local variation is getting kinda big. Which implies failure rates start increasing to uncomfortale levels long before the process center drifts out to where the failures occur. Which means 0.1% fails while still in spec is too high. You might want 0.1% fails at 4 sigma instead of 3sigma. Unfortunately, sample size scales as n^2, you could end up needing 60,000 simulations or more to do a decent yield analysis with MC.
Funny, you can’t even use Importance Sampling to answer the sample size question UNTIL you see the failures and thus have some idea of where they are. Yep — stuck with old fashioned MC still.
But the WORST part about MC alone is that even with 60000 simulations you still can’t answer my 3 questions.
1) where the optimum response is for all 3 test bencehs in combination
2) where the circuit doesn’t work in that entire 11,000 dimensional space
3) the yield with the process centered anywhere else in that same 11,000 dimensional space?
You need a surrogate model to answer these definitively.
I’ll say it again: small sample MC is worthless. Large sample MC is an INCREDIBLE waste of resources; essentially worthless as it tells you very little — especially if done with gaussian sampling instead of uniform. You can do a lot more with a lot less effort and know the answers to the 3 questions — and more!
The rest of my comments were oriented to the other assumption which seemed more likely — and that was 3 particular Temp, Voltage, and Current settings. When I build surrogates across TV (usually not I) I cover all possible use conditions with the surrogate, not just 3 special ones. It would be a shame to restrict oneself to just 3 settings as the design may have a “hole” in the performance as TV is varied. How would you know with a plan to look at just 3? Ask Carnac?
Importance sampling is highlighted as an enhancement for MC, which means it isn’t common garden variety MC. It’s still not useful for optimization because you would have to know which direction to focus the restriction on the PDFs. If you knew that much, you wouldn’t need simulations to optimize but you could (maybe) fine-tune.
Re: response surfaces… When I mention classical response surfaces, those are restricted to full quadratics (including all first order interactions). I don’t know anyone who would consider them cubics or anything else. But, they could be called equations or analytic surrogates by someone. Gee, Monte Carlo is just a use of equations. So is SPICE — it’s just equations forming a model. Does that mean they are all the same? No — that’s why they have different names.
Now, I could see there might be an objection to the analytic surrogates if they weren’t easy to do and weren’t so darned efficient. You make them ONCE with a relatively sparse sample, then you do everything with the surrogate. Usually you are off and running with the surrogates in under an hour after the simulations are complete.
Want to optimize? 10 seconds or less. Want to see if there is any chance of a yield failure anywhere in the ENTIRE process window for 12 test benches? 10 seconds or less. Want yield with MC on the surrogate — 8 MILLION simulations a second on a 7 year old laptop is possible on some circuits — with a process center anywhere in the process window. Want to back off from the performance optimum to improve bin yield to maximum for a particular product bin? A few minutes to an hour. Want to constrain the footprint and simultaneously optimize performance? A few minutes.
Now, I’ll concede there may be some design problems for which analytic surrogates are difficult and MC + sequentially using response surfaces may be better , but I haven’t run into one yet in either analog or cmos design.
Hi Ed,
So you now agree that “importance sampling” is a Monte Carlo method: good. It would have been nice to agree to this straight away.
Re response surface label: “Gee, Monte Carlo is just a use of equations. So is SPICE — it;s just equations forming a model. Does that mean they are all the same? No — that’s why they have different names.”
Of course SPICE etc uses equations, that doesn’t mean they’re the same thing as response surfaces. I will repeat what I said: “Covering a broad set of models does not dilute its definition, response surfaces are all still approximations of the mapping y = f(x), where x is a set of input variables” and “it is used synonymously with ‘cregression model’; meaning a model constructed from data, doing a static mapping from a set of input variables to (typically) one output variable.” In contrast, SPICE solves ODEs with the form f(x,x’,t)=0.
Here’s an example from the literature: “We have observed that it can be useful to fit a response curve to the levels of a quantitative factor so that the experimenter has an equation that relates the response to the factor. Ths equation might be used for interpolation, that is, for predicting the response at factor levels between those actually used in the experiment. When at least two factors are quantitative, we can fit a response surface for prediction y at the various values of design factors” [D.C. Montgomery, Design of Experiments 5th Ed., 2001, p. 201].
My main point is: There is no need to invent a new label for “response surface” or “regression model”.
Re: “We can construct theoretical confidence bounds based on 0 fails, but do they really apply? Kinda maybe.” Yes; it’s called “Wilson Score” for a binomial confidence interval; they apply 100% in theory and in practice (been using it industrially for years).
Re: “10 seconds or less. … 10 seconds or less. …”
I fully agree with these numbers… once you have the model. But you can’t ignore the 20,000 simulations that it takes to build the model.
BTW, to estimate yield on an ultra-high sigma circuit using response surfaces, if you sample on the true pdf can be running into 10^9+ samples, which could take days, not 10 s (e.g. see [Wang et al, DAC 2009]). To have reasonable runtime, you need to do importance sampling on the model (or integrate analytically).
Re the 3 questions and “direction for importance sampling”: one can learn a lot more than may you realize, using just 200 samples. The key is to do the right style of post-processing on the MC samples data. The traditional (least-squares) view is that one needs >=n observations just for a linear model having n input variables, because when <n observations there are an infinite number of solutions for the least-squares problem "minimize sum_i((fhat(x_i) – f(x_i))^2)" (minimize the sum of squared errors). But with _regularization_, an objective is added which aims to minimize the sum of expected squared errors on unseen observations. E.g. "minimize sum_i((fhat(x_i) – f(x_i))^2) + alpha * sum_i(|w_i|)" (the lasso formulation). This is a convex optimization problem. The extra term minimizes the confidence bounds across the expected set of unseen observations. This has big implications in practice — with very little data one can get a surprisingly decent estimate of the response and especially the important variables. This technique has become pervasive in the machine learning community, with extensive applications in image analysis and more. It has even been recently demonstrated in the CAD literature for linear functions [X. Li, DAC 2009], and nonlinear functions [T. McConaghy and G. Gielen, IEEE TCAD, Aug. 2009][T. McConaghy, GPTP 2009]. One can alter the least-squares objective in ways other than regularization too, e.g. explicit margin-maximization (such as in SVMs), or implicit margin-maximization (such as in random forests). There are plenty of _other_ approaches to determine regions of interest for importance sampling on the cheap too: adaptively increasing each random variable's stddev until approx 1/2 the samples are infeasible; uniform sampling in the 6-sigma hypercube or hypersphere; adding extra gaussian components to the pdf, centered where infeasible samples were found; and cross-entropy variations of the above. If you're an engineer with a tight schedule and limited simulation budget, you don't need perfect answers to those questions, you need to get your design out the door quickly and with confidence. MC sampling with the right post-processing gives you this.
I will state again: there is a time and place for using MC methods, and a time and place for using active learning + response surfaces. If the design is tiny, active learning + response surfaces will work nicely (and MC too). If the design is moderate or large and you have 20,000 simulations to blow, then use active learning + response surfaces. If the design is moderate or large and time is important to you, then use MC methods.