DFT: the chip designer's new best friend

|

In his keynote at the recent Design Automation and Test in Europe conference, ARM CTO Mike Muller spent a little bit of time to say thank-you for some of the electronic design automation (EDA) tools out there. One perhaps unexpected beneficiary - because it usually gets no respect - was design for test (DFT).

Indeed, when DFT engineer John Ford spotted how Mentor Graphics' Joe Sawicki referred to the technique as being instrumental to the future of chipmaking on Twitter, John Blyler of Chip Design magazine, and probably others, thought it a little strange. As the man who has headed up Mentor's design for manufacturing (DFM) strategy for more than ten years, surely Sawicki meant that, and not DFT?

For Muller, DFT or test automation makes it possible to test chips. "We couldn't have hand-stitched all those scan chains," he said. But lots of EDA is about automation. The other aspect of DFT that has emerged in the past few years is its role in yield prediction and planning, and Muller very briefly alluded to this use for DFT. As yield more or less equals profit (assuming you're not in dumping territory), it is easy to believe that DFT has a crucial role in chipmaking.

I'd link to an interview I did with Mentor's chief scientist and test expert Janusz Rajski several years ago, but it has effectively disappeared offline. So, the rest of this post is that interview. I have done a more recent interview with him but but it hasn't been published as yet, so I'll do a follow-up with changes. Hopefully, this should go some way to explaining why some people in the chip-design business suddenly love DFT. This is the pre-edit copy, so there might be some typos in it - I just took one out.

Interview

As chief scientist and director of design for test engineering at Mentor Graphics, Janusz Rajski has worked on many of the issues that face chip companies when trying to work out if their products will work in the field based on just seconds of time on an expensive tester. His work led to the development of test compression in a bid to keep a lid on the time it takes to run millions of test vectors on a chip. His work on test led to the president of his native Poland awarding him the title of professor of sciences in 2003.

It took many years for design engineers to get to grips with the idea of planning for production test but now, for him, test is moving to the point where it underpins not just the means to ensure that chip-level products work, but to improve yields and make designs more manufacturable.

“In the past, people focused on cleanroom technology. But we now see a lot of systematic and parametric issues. Things have changed since we entered the age of subwavelength lithography,” said Rajski. It is not just that the tiny distortions made to masks at the smaller geometries that are leading to differences in the way that devices are failing. New materials such as copper that are used to form on-chip connections between gates have had a dramatic effect on the way that chips fail.

“Every time a new material is introduced with a new process, it causes more problems than you would see with just a shrinking of geometry. Foundries differ in the effects that you see with copper but there tend to be a lot of open-type defects,” claimed Rajski. “There are two types of interconnect defect: bridges and opens. There are resistive lines where the lines are not fully open, so you need at-speed tests to see them, and some will be fully open and will show a memory-like behaviour.”

Those interconnect defects tend to be closely related to the physical layout. Two lines that run close to each other have a higher probability of having a conductive bridge form between them than two that are widely space. But if a solitary line has no metal around it, there is a good chance that it will suffer a break and become an open line, or a high resistance one if a tiny strip of metal remains in place. This is caused by dishing: a consequence of the chemical-mechanical polishing (CMP) used to flatten layers before the next one is deposited.

Parametric problems were identified in a paper by Intel in a 2003 paper. As well as traditional defects, essentially caused by grit on the wafer, chips are now suffering from small variations in process parameters across the wafer. This may be caused by some parts of the wafer being polished slightly more aggressively at the edges rather that the centre, or by implantation processes not being consistent across the wafer.

According to a report prepared by International Business Strategies, parametric problems can account for 20% of failed devices at 90nm, with another 20% due to systematic defects and just 10% failing because of classical, random defects.

Rajski explained: “In a functional node, you get a certain temperature distribution and the impact of a parametric change could be significant. If everything gets slow by a certain amount, the longer paths may exit the system timing. Some parts of the design will be fast and on the edge; others will be so fast that you are not concerned about changes.”

As parametric changes can be highly temperature dependent, there is a knock-on effect on test quality and its ability to pick out defective devices. “The temperature gradient may change dramatically based on the load. During test or operation, some parts of the device may be turned off. Therefore, the performance will be different.”

The additional, and possibly difficult to detect, yield loss from parametric variation would not be so bad if it were not for a counter-effect that is placing a greater focus on yield. “One of the big careabouts for semiconductor companies is the quality of shipped parts. Companies who are known for their quality have programmes to reduce the number of defective parts shipped by three to ten times.

This push for higher quality is driven by consumer products. In the past, a customer return to Frys was not a big deal. Not anymore. The retailers are very concerned about quality and semiconductors can lose business if they don’t have good quality.”

The connection between quality and yield has been in place and means, without extra attention, both will move in the wrong direction. “The quality of shipped product is directly related to yield. The escape rate on devices from wafers with poor yield is higher. So, people want to understand how to apply more demanding tests to control the quality of outgoing product,” said Rajski. “Today, they may just decide to scrap the entire wafer if it has poor yield.”

The information they need is there, he added: “At foundries, if wafers have very low yield, they may not send them to customers because they contain a lot of information they can use to diagnose problems. Test chips only have a fab-monitoring role. They don’t test features. Production chips are a goldmine of information, which, if properly extracted, can give us a lot of insight into why chips fail.”

The problem is that indirect means are needed to identify many of the new problems. But Rajski said he believes test has a central role: “Design for manufacture and yield management are very closely tied to design for test.”

The problem that faces chipmakers is that more types of defects means more tests, and the amount of time that chips spend on the tester is already a problem. Can the industry work out ways to detect failures without watching the number of test patterns soar? Rajski said the number of tests is going up but there are ways to limit the effects.

The classical fault model was built around stuck-at tests and focused on the logic gates. The assumption of the stuck-at model was that a test pattern could quickly reveal whether a logic gate was broken by watching its behaviour based on switching its inputs. As long as the gate was accessible from something like a scan chain, it could be tested. With a bit more work, gates that were not directly controllable could be tested by focusing tests on the logic in front of them.

“Traditionally, if you had a fault you would try to make sure it is detected once,” said Rajski. “This is moving to where you try to detect a fault multiple times and monitor what happens to the signal in other places.

“Previously, test was based on the netlist. Now, timing information has to be incorporated into test.”

The new fault models are built on top of the traditional stuck-at models but apply probabilities to identify faults that tend to cause delays rather than outright failures. Different test patterns will activate different paths that follow the node you are trying to test. If the path is long enough and has insufficient negative slack, what is simply a delay on another path will result in the wrong level appearing at the input of the flip-flop on the long path. “If you are lucky, and you detect the wrong value at this point, you are done,” said Rajski. However, the tests do not rely on trying to probe a fault once, they do it two or three times in different ways. By trying to detect the same fault using different patterns, you improve the probability of picking up a delay fault.

To help with these multiple-test techniques, automatic test pattern generation (ATPG) can use the results of static timing analysis to improve its chances of finding long, susceptible nets. However, that introduces some issues with the reliability of tests as static timing analysis brings with it the false-paths problem. These false paths result from the analyser picking up paths through the logic that can never be taken by the working chip but flagging them up because the analyser cannot determine whether they are unreachable using static techniques. A related issue is that of multicycle paths, where it may take the logic along that path two or more cycles to complete, but the designer has allowed for that situation.

“False paths may be captured on a tester and people can spend a long time trying to debug these issues,” said Rajski, adding that the number of false paths is generally just a single percentage although he came across one design that had 60% false paths. The technique to deal with the false paths is similar to as that used to improve the results of static timing analysis: make sure that known false and multicycle paths are flagged to the test tool. “At the moment, we treat false and multicycle paths the same way but we want to get to the point where we analyse multi-cycle and test their timing,” said Rajski, noting that it is possible for parametric effects and other delay problems to cause the timing of a multicycle path to stretch beyond its allotted number of clock cycles. Multi-cycle paths should not be a source of test escapes.

“There is a limit to multiple detect because in test we typically don’t have layout information,” said Rajski. There is not a lot of point trying to work out whether two lines are bridged, for example, if they never get close to each other. Traditionally, ATPG tools only needed to understand the netlist because defects were understood to be random. Who needed layout data when a fault could occur anywhere? With systematic and parametric failures, the defects now depend on the relative and absolute positions of on-chip features, respectively.

One source of that layout information lies in the physical extraction performed at the design-rules check (DRC) stage performed just before tapeout. These rules are gradually turning into design for manufacturing (DFM) rules, indicating where problems might occur if guidance on spacing and other parameters are not obeyed. One example is in identifying possible bridging sites. “The DFM rules show where you should focus effort.

“In the past we would take the netlist and run ATPG. Now we do bridge extraction and feed the results to ATPG,” explained Rajski. He cautioned that DFM data depends on the target process, which means chipmakers will end up needing different test programs if they move processes. “You have to set the rule-deck parameters for a specific process. If you outsource, you may have to change the deck as the external foundry’s defectivity rates may be very different from yours. That is not just the case for logic, but also for memory.”

The rule deck also needs to be tuned to catch on-chip features that are most likely to result in faults, he said: “As you do yield learning, you will want the process to become more deterministic based on the extraction from layout. If a DFM rule is not significant, you won’t use it for targeting. But if it becomes a priority, you will want to use it in your rule deck.”

The result of this trend is that test is becoming more of a configurable process, rather than one where the ATPG software is run once the design is complete and then left as it is. Rather than scrapping a low yield wafer, a chipmaker might decide to increase the number of tests to make sure at least some good chips can be salvaged but without risking bad chips making it to the warehouse because they happen to escape the regular tests. “It does not require that you change the test dynamically for every wafer. For wafers with low yield, you may want to generate some additional test paths and run those. You might decide to package the chips that pass the wafer sort process. A small number might fail but the packaged test could be enhanced,” said Rajski.

Rajski said the concept of configurable test has been embraced by chipmakers for use with embedded memories, particularly with built-in self-test ((BIST) cores. “Memory BIST has gone through several of evolutionary phases. First they use fixed algorithms, then user-definable algorithms were introduced. But people noticed they didn’t know what defects to expect before a chip went to production and wanted to change the tests as they got more information. Then they started outsourcing and noticed different fabs had different sensitivities.

“So, they implemented field-selectable BIST. Usually the tests are designed to be fast, but sometimes you want more aggressive tests. And you tune the tests for each foundry.”

On-chip cores are also being used to speed up test programs and overcome the problems caused by having to use multiple-detect techniques. Test compression uses on-chip cores to decompress patterns supplied by an external tester and run tests in parallel. “With all the things happening in the fault models, the industry will rely on deterministic test with compression. It is an area that is still evolving. When we launched TestKompress, I envisaged compression of 100:1. Some work now approaches compression of three orders of magnitude and we have some techniques that can go higher than that.

“In the past, people though logic BIST was the next-generation solution to manufacturing test. Our market research says not. People say they want to handle advanced fault models and that the volume of test is killing them. So, they realised they needed compression.”

When yields are bad, the other task chipmakers have to undertake is to work out why the devices are failing. The problem, Rajski noted, is that some of the defects that chips suffer from today are not easily detectable by eye. One common issue with leading-edge chips is that of vias not being filled properly, leading to opens or high-resistance faults that cause delay or transition faults. These voids are undetectable until the chip has been sawn into cross-sections. However, electrical tests can narrow down the possible culprits that lead to a failed chip, if the information can be related to the layout.

“Existing fault models report behaviour with respect to nets. They give very long names, saying the output of this gate gave some issue in this way. What is needed is the next step, to go deeper. You do not want to see just that two nets can be bridged, but how they can be bridged.

“Once you have physical information added, you can see what can be bridged and what cannot. Lines may be running too far in parallel, or you are getting corner-to-corner interactions. You may find that there are a lot of bridges on metal two, and they are systematic. Once you have that information, you can use it to improve yield,” explained Rajski.

“You can do the same for transition faults. If you can zero in and see that they are caused by resistance between layers, then you can address the problem. The harder part is failure analysis. There you need to be very precise with the tests you use to identify the source of the problem. One customer generated new patterns to make sure they were looking at the right fault. Once you have destroyed the chip looking in the wrong place for a fault, there is nothing you can do.”

As quality and yield issues continue to take centre stage in optimising design, engineers will need to focus more on what test can uncover for them.