Power Conversion Reliability
The Need for Power Conversion Reliability:
Electronic equipment is often fundamental to business and commerce, where loss of function, service or data results in large monetary penalties and/or compromised safety. Quite often, systems are designed with redundancy so that no single failure event shuts the system down. However, even with redundancy, it is crucial that system operation not be compromised through propagation of a single point failure to other equipment via safety, fire, smoke, noise, or other issues. Power conversion equipment reliability is vital to realizing this goal.
Premature failure or wear out of power conversion equipment continues to be a major concern in various industries and applications. By its nature, power conversion equipment is subject to unique reliability challenges. Energy density, operating voltages, size and weight of components, thermal management, size and routing of conductors, and relatively low production volumes on unique designs are just a few of the issues that must be dealt with in providing a reliable and dependable power conversion system.
Historic Approaches to Power Conversion Reliability:
Historically, power conversion reliability was pursued by an "after-the-fact" approach of inspection, unit-by-unit parametric testing and static burn in. Many times in military, products use of JAN-TX approved parts was required in the hope that if the units' parts were manufactured in a highly controlled and traceable manner, this would result in a more reliable end product.
Eventually it became apartment these approaches were not effective. No amount of inspection of unit-level testing was enough to overcome a design that was either overly complex or had inherent flaws that caused components to be used beyond their practical capabilities. To address these limitations, more focus was placed on understanding the underlying problems that cause non-reliable operation and addressing these at the design stage of the product.
Numerous study groups were formed to determine best practices toward achieving power conversion reliability, such as the Naval Ad-hoc Committee for Power Supply Reliability, formed in the early 1980's. The result of these efforts was the identification of a series of best practices, such as those related in the document "NAVMAT P4855A" and eventually superseded by the current document "NAVAJO P-3641A."
Total Quality Management:
Total Quality Management, or TQM, is the overriding philosophy toward achieving reliable products. A process built around TQM looks at all facets of product realization and optimizes these with respect to a balanced agenda of performance, cost and time. Crucial to TQM is a cross-functional approach to product design and manufacturing. This demands that all stakeholders in the product, such as Engineering, Manufacturing, Quality, Materials and the end-customer, be involved at the design phase so that the end product will adequately incorporate their concerns. In this way, issues such as design for manufacturability or design for test can properly be incorporated.
Conservative Design Margins:
Experience has shown commercial parts can provide superb reliability if adequate margin is provided between their worst case operating point and their rated capabilities. Maximum part ratings are often influenced by the competitive nature of the marketplace, where manufactures will push the maximum ratings of their parts to the point where reliability begins to suffer. Providing adequate design margin will lower failure rates so as to support extremely high fielded reliability. The US Navy's document NAVSO P-3641A presents what many consider best-in-class recommendations for component de-rating guidelines.
A general rule of thumb in most systems is that as temperature increases, reliability decreases. The Arrhenius Model is often accepted as an accurate predictor of semiconductor, and other device reliability. This model covers many of the non-mechanical (or non-material fatigue) failure modes that cause electronic equipment failure. It is particularly useful in describing failure mechanisms that depend on chemical reactions, diffusion or migration processes. The model suggests the rate a reaction occurs is given by the following equation: R(t) = A* e-(EA/kT).
Effective Design Tools:
Once a decision had been made to employ conservative design margins, it's important that this intention be effectively carried into the design. Modern computer aided Electronic Design Automation (EDA) tools provide the means to implement effective layout design rules within complex designs. Likewise, there are many design analysis tools on the market that aid the designer in assuring components are operating within reliable limits.
Latent Failure Modes:
The ingredients that provide latent failure modes are in place at the time of the products manufacture, but require the effects of time, temperature, humidity, vibration and other environmental factors before they result in compromised reliability. Typical latent failure modes in power supplies include:
- Compromised insulation due to dendrite growth or metallic migration
- Compromised insulation due to environmental effects or Corona
- Semiconductor die cracking due to mechanical stress
- Semiconductor degraded performance due to humidity infusion
- Semiconductor degraded performance due to ESD exposure
- Electrolytic capacitor wear out
One of the more prevalent failure modes observed in power conversion equipment is compromised voltage insulation spacing. Product safety specifications such as EN-60950 or UL-1950 provide mandated spacing requirements from energized conductors to earth ground, along with recommended in-circuit spacing requirements for functional insulation. (Safety agency specifications normally allow in-circuit spacing to be violated if it can be demonstrated that compromising the spacing does not result in an unsafe condition). Experience has shown this criterion is not necessarily a formula for extended reliability. Many environments are prone to airborne contaminants and infusion of these contaminants is one of the leading causes of premature unit failure.
Beyond conductive particle infusion, gradual infusion of normally non-conductive dust, along with pre-existing sources of ionic contaminants, humidity and the presence of significant electrical fields within power conversion equipment can lead to conductive dendrite growth.
*Dendrites are microscopic conductive paths that are formed when ionic materials, in the presence of moisture and an electric field, disassociate into negatively and positively charged materials.
Latent Failure Modes in Semiconductors:
Semiconductor components have shown themselves to be particularly susceptible to latent failures such as the well documented effects of Electrostatic Discharge (ESD). Modern power semiconductors can be very cost effective and reliable, however, over time the infusion of moisture through plastic package over-mold materials can be problematic. Care must be take that the internal construction of the device is adequately protected so that infused moisture does not result in dendrite growth or corrosion.
Best in class supplies assure semiconductor reliability through a part qualification process that includes long term testing under the conditions of high voltage, temperature and humidity. Destructive physical analysis of that parts' internal design features is also critical. Oftentimes a design weakness can be spotted through a critical review of the parameters such as mask alignment, guard ring structures, clearances from conductors, etc.
Magnetic Component Reliability:
Magnetic components, such as transformers and inductors, are often seen in greater numbers on power supplies versus other types of equipment. Safety agency requirements provide a good basis for reliable power design regarding winding insulation and spacing. However, there are a number of other areas that should also be considered, including the following:
- Corona inception and deterioration of thin sheet insulating material used to separate high frequency switching windings. This can become a problem at voltages as low as 200VRMS, a situation quite often found in switching power supplies. Corona is a partial breakdown of air due to high electric field intensity. Microscopic air bubbles in thin sheet insulation can provide locations for corona inception. Corona discharges can begin to eat away at the insulation, leading to premature failure of the insulator.
- The effects of simultaneous aging and high switching flex densities on certain powered iron core types. This can cause the binder used in core material to degrade, ultimately leading to increased core and winding losses and potential catastrophic component failure.
- Imperfections of wire or foil terminations causing mechanical abrasion of internal insulators, ultimately causing insulation punch through.
- Core loss characteristics that transition from a negative power dissipated versus core temperature coefficient to a positive coefficient. This can cause core permeability to drop off under extreme operating conditions with corresponding unit failure.
Crucial to reliability assurance is testing of the design to determine its actual limits with regard to ambient temperature, vibration stress, input voltage (both surges and steady state,) output current overload, and any other stressful parameters pertinent to the application. Tests in this manner are generally referred to as "Highly Accelerated Life Testing" or HALT. HALT is a destructive test that determines the margin between the products' intended environment and where it will fail. Crucial to effective HALT testing is a well structured plan and the proper equipment to carry it out.
HALT testing should be conducted at a point during the products development late enough so that the sample being tested is a reasonable representation of the final design, but not so late that any design improvements uncovered cannot find their way into the final design.
Production Reliability Screening Testing:
Once HALT testing has identified the actual capacity limits of the design, this information is utilized to devise production reliability screening tests. NAVMAT Suggested Stress Screening or ("ESS") as a means of screening production product. In ESS, the subject unit is exposed to alternating high and low operating temperatures, with modest transitional temperature ramps between tests.
While ESS provides a better screen than burn in, it was ultimately determined that a more aggressive test that exposes the subject unit to faster temperature transitions, along with other stresses, was better at assuring unit performance. This test is generally referred to as "Highly Accelerated Stress Screening" or HASS.
A number of mathematical models are utilize to ascertain the effects of stresses imposed during HASS. Utilizing these, the HASS profile is generally designed to present the equivalent of 40 to 60 days of operation in the intended real life environment. In this way, the unit it subjected to the infant mortality period of operation while it's still in the factory.
As is only takes a handful of field failures to corrupt a product line's average reliability performance in moderate production volumes (i.e., less than 10,000 units per year), it is crucial that HASS be applied to each and every production unit that is manufactured. Otherwise, crucial reliability information will be impossible to trend.
Best practice is achieved when reliability trends can be mapped back to the particular production lot where the trend surfaced. This requires a high degree of production control, along with the traceability to the component lot level. Production procedures that follow specific protocols in the event of uncovered trends must be put into place. These include stopping the production line, quarantine of specific units, elevation of situation to proper authorities, etc.
With adequate systems in place comes the opportunity to quarantine portions of produced product for corrective actions, as appropriate. In power supplies this capability is especially productive, as on occasion component suppliers (especially power semiconductor manufacturers) can lose their recipe on lot-to-lot basis. By maintaining traceability, appropriate actions can be targeted at those units where they are required.
Adequate controls of the manufacturing process are crucial to effective reliability realization. Without these in place, processes, such as HASS will only serve to screen out an unacceptable number of failed units, resulting in an untenable position from both the cost and time aspects.
Best in class reliability assurance practices, based on a Total Quality Management philosophy and including HALT and HASS testing, provide power system reliability levels well above the traditional expectations of military standards. The achievement of these results requires a significant commitment by the supplier to provide the necessary infrastructure and expertise to support the comprehensive systems required.