#### **MSU / AAPS Reconfigurable Computing Demonstration**

Marshall Space Flight Center 1/7/10

### **Research Statement:**

### **"Exploit Reconfigurable Hardware to Create Resilient Computing Systems for Military & Aerospace Applications"**

**Presenters:** 

Dr. Brock J. LaMeres Assistant Professor Clint Gauer MSEE Candidate (5/10)

Department of Electrical and Computer Engineering Montana State University Bozeman, MT







## **Overview of Project Funding**

• This work has been funded through a variety of NASA & Space Grant Programs:





- Space Grant Funding Requires NASA Mentor/Collaboration:
- Special thanks to our project mentor from NASA's Advanced Avionics & Processor Systems (AAPS) Project

**Dr. Andrew S. Keys** Marshall Space Flight Center AAPS Project Manager

• And also to the APPS Individual Task Leaders

**Dr. Robert E. Ray** Marshall Space Flight Center Reconfigurable Computing Task Michael A. Johnson Goddard Space Flight Center High Performance Processor Task



### **Overview of Work to Date**

- 1) Fall 2008 Capstone: "TMR Soft Processor System on an FPGA"
  - Tony Thomason, Colin Tilleman
  - EE & CpE Undergraduate Students
- 2) Spring 2009 Capstone: "64 Processor Computing System with Spatial Fault Avoidance"
  - Patrick Kujawa, Dan Dunbar, Dave Racek
  - CpE Undergraduate Students
- 3) Fall 2009 Capstone: "Dynamic Recovery of IO Faults using Spare Lines"
  - Sam Harkness, Devin Mikes, Jeff Bahr
  - CpE Undergraduate Students
- 4) Graduate Research Project: "Many-Core Computing System using Partial Reconfiguration for fault detection, avoidance, and recovery"
  - Clint Gauer
  - EE Graduate Student
- 5) Graduate Research Project: "Spatial Radiation Sensor"
  - Brian Peterson, Eric Gowens
  - EE Graduate Student



## **Overview of Work to Date (Project #1)**

1) Fall 2008 Capstone:"TMR Soft Processor System on an FPGA"<br/>Anthony Thomason & Colin Tilleman

**Summary:** Develop an FPGA-based computer system that can recover from emulated radiation-induced faults using triple modular redundancy of soft processors.

The system will continually service basic peripherals (keyboard & LCD) in the presence of faults. Upon a fault in a processor, the system will finish its current operation, reset & resynchronize the three processors, and continue operation.







Lab Setup (Xilinx V5 FPGA)



**Block Diagram** 

## **Overview of Work to Date (Project #1)**

#### 1) Fall 2008 Capstone:

"TMR Soft Processor System on an FPGA" Anthony Thomason & Colin Tilleman

**Highlights:** - successfully demonstrated to Robert Ray and Clint Patrick at Fall-08 Design Fair - won 2<sup>nd</sup> place (\$500) in the IEEE NW Section Student Paper Contest (April 2009)





System Operation Measured by ChipScope Logic Analyzer

Fall 2008 Senior Design Fair



## **Overview of Work to Date (Project #2)**

- 2) Spring 2009 Capstone: "64 Processor Computing System with Spatial Fault Avoidance" *Pat Kujawa, Dan Dunbar, & David Racek* 
  - Summary: Develop an FPGA-based computer system that can recover from emulated radiation-induced faults using spare processors. The system should contain 64 soft processors. 3 of the processors will be active at any given time and be running in TMR. Upon a fault, the system will bring a spare processor online to replace the faulted processor. A GUI should be developed to induce faults and display the active, faulted, and spare processors.



NASA TMR

#### Initial Operation

- Processors 0, 1, and 2 are active (blue) and operating in TMR
- Processors **3-63** provide 61 spare *picoBlaze* processors (gray)



(showing address lines between uP and memory for all 64 processors)



NASA TMR

### Soft Fault Recovery

- Processors 0, 1, and 2 are active (blue) operating in TMR
- Processors **0** undergoes a soft fault and then recovers and resynchronizes





NASA TMR

### Hard Fault Recovery

- Processors 1 undergoes hard fault (induced by GUI, red)
- The system shuts down uP #1 and brings on spare processor uP #3 into TMR





NASA TMR

### • Multiple Hard Faults

- Multiple hard faults are present
- uPs 1, 6, and 12 form TMR





# **Timing/Area Impact**

• **Soft Fault Recovery** (reset, reload variable information)

#### **Timing Overhead**

 - TMR interrupt
 2 clocks

 - Reset
 2 clocks

 - Read variable data from good processors:
 128 clocks
 (2 clks/inst, 64 bytes of RAM)

 - Write variable data to reset processor:
 128 clocks
 (2 clks/inst, 64 bytes of RAM)

 - Write variable data to reset processor:
 128 clocks
 (2 clks/inst, 64 bytes of RAM)

 - Write variable data to reset processor:
 128 clocks
 (100 MHz V5 Clock)



## **Overview of Work to Date (Project #2)**

2) Spring 2009 Capstone: "64 Processor Computing System with Spatial Fault Avoidance" Pat Kujawa, Dan Dunbar, & David Racek

Highlights: - successfully demonstrated to Robert Ray at Spring-09 Design Fair

- demonstrated at 09 Europa Jupiter Systems Mission (EJSM) Instrument Workshop
  - published at 2009 MAPLD Conference



**GUI & System Operation Measured by ChipScope Logic Analyzer** 

Workshop. (Robert Ray shown here in front of demo)



## **Overview of Work to Date (Project #3)**

- 3) Spring 2009 Capstone: "Dynamic Recovery of IO Faults using Spare Lines" Sam Harkness, Devin Mikes, & Jeff Bahr
  - **Summary:** Develop an IO system that can continue to operation when a fault occurs on the physical lines of the bus (due to radiation strikes or broken conductors). The system should be able to detect faults and switch the active signals to spare lines on the bus. A GUI should be developed to monitor which lines of the IO system have been faulted.









Devin

Sam

Prototype System IO Bus Implemented with Wires between two Virtex-5 FPGAs GUI (Green=active, Red=faulted, gray=spare)



## **Overview of Work to Date (Project #3)**

**3)** Spring 2009 Capstone: "Dynamic Recovery of IO Faults using Spare Lines" *Sam Harkness, Devin Mikes, & Jeff Bahr* 

#### Theory of Operation: -

1) Spare Lines are included on the bus to be used in case of a line failure

2) A Hamming code is used to check for errors on the bus and are transmitted on the bus

- 3) When an error is detected, the system begins a detect/ & recovery process
  - Agent A sends all 1's
  - Agent B looks for all 1's, logs failures
  - Agent A sends all 0's
  - Agent B looks for all 0's, logs failures
  - Agent B sends all 1's
  - Agent A looks for all 1's, logs failures
  - Agent B sends all 0's
  - Agent A looks for all 0's, logs failures
  - The bus lines are remapped into good lines

Total Time =  $(10 + n + \log(n) + \text{spare lines})$ 

Our system = 10 + 18 + 6 + 6 = 40 clocks





where n = # of lines on bus

## **Overview of Work to Date (Project #3)**

3) Spring 2009 Capstone:

"Dynamic Recovery of IO Faults using Spare Lines" Sam Harkness, Devin Mikes, & Jeff Bahr

**Highlights:** - successfully demonstrated to Robert Ray & Leigh Smith at Fall-09 Design Fair - currently filing an invention disclosure with MSU (first time for the students)







System Demonstration at MSU Fall-2009 Design Fair (Sam Harkness giving Leigh Smith Demo)

IO bus in Tact, GUI indicates all lines good



Wire pull on line 15, GUI indicates fault and that a spare has been brought online

## **Overview of Work to Date (Project #4)**

- 4) Graduate Research Project: "Many-Core Computing System using Partial Reconfiguration for fault detection, avoidance, and recovery" *Clint Gauer*
  - **Summary:** Develop an FPGA-based computer system which can recover from, avoid, and repair radiation induced faults in both the circuit fabric and configuration SRAM. The system uses a many-core architecture where three soft processors run in TMR with *n* spares. Each processor resides in a partially reconfigurable *tile* on the FPGA. Upon a fault, the system brings on a spare processor to replace the faulted processor (SEU/TID recovery & avoidance). The faulted tile is then partially reconfigured to repair and re-introduce it as a spare (SEFI recovery).







Lab Setup (Virtex-5 FPGA)



Block DiagramFPGA Floor plan(3+13 soft processors)(16 picoBlaze processors)



### **Overview of Work to Date (Project #4a)**

4) Graduate Research Project: (2007-present)

"Many-Core Computing System using Partial Reconfiguration for fault detection, avoidance, and recovery" *Clint Gauer* 

# System Operation of 3+13 <u>picoBlaze</u> Architecture: Recovery from SEU in Circuit Fabric (Processor 0)





## **Overview of Work to Date (Project #4a)**

4) Graduate Research Project: (2007-present)

on

"Many-Core Computing System using Partial Reconfiguration for fault detection, avoidance, and recovery" Clint Gauer

#### System Operation of 3+13 picoBlaze Architecture: Spatial Avoidance of Faulted Tile/uP



**SEFI or TID** on **Processors** 2, 4, 6, and 7





## **Overview of Work to Date (Project #4a)**

4) Graduate Research Project: (2007-present)

"Many-Core Computing System using Partial Reconfiguration for fault detection, avoidance, and recovery" *Clint Gauer* 

#### System Operation of 3+13 <u>picoBlaze</u> Architecture: SEFI Repair using Partial reconfiguration of faulted tile





## **Overview of Work to Date (Project #4b)**

- 4) Graduate Research Project: (2007-present)
- "Many-Core Computing System using Partial Reconfiguration for fault detection, avoidance, and recovery" Clint Gauer

#### System Operation of 3+1 microBlaze Architecture: Spatial Avoidance of Faulted Tile/uP



SEFI or TID on uP 1 uP 3 is brought online





### **Overview of Work to Date (Project #4)**

- 4) Graduate Research Project: "Many-Core Computing System using Partial Reconfiguration for (2007-present) "Many-Core Computing System using Partial Reconfiguration for fault detection, avoidance, and recovery" *Clint Gauer* 
  - **Highlights:** published work twice at *Military & Aerospace Programmable Logic Devices* (*MAPLD*) Conference (08 & 09)
    - published work twice at *IEEE Aerospace* Conference (09 & 10 accepted)
    - this work will be submitted as Clint's Masters thesis in May 2010.





System Demonstration in MSU Research Lab 12/14/09 (Clint Gauer giving Robert Ray & Brock LaMeres Demo)



## **Overview of Work to Date (Project #5)**

- 4) Graduate Research Project: "Spatial Radiation Sensor" Brian Peterson, Eric Gowens
  - **Summary:** Develop a sensor which can give the location and trajectory of incoming radiation strikes. This sensor is designed to be used in conjunction with a many-core computing system. The computer system can use the spatial radiation information to more effectively avoid faults in the circuit fabric and repair faults in the configuration SRAM.







LaMeres, Smith, Gowens, and Kaiser At MSU 12/14/09

Sensor & Packaging Prototype

**Prototype System** 



## **Overview of Work to Date (Project #5)**

5) Graduate Research Project: "Spat

"Spatial Radiation Sensor" Brian Peterson, Eric Gowens

**Highlights:** - idea presented at 09 Europa Jupiter Systems Mission (EJSM) Instrument Workshop - prototype demonstrated to Robert Ray & Leigh Smith at MSU on 12/14/09





• Background



# Motivation

- Radiation has a detrimental effect on electronics in space environments.
- The root cause is from electron/hole pairs creation as the radiation strikes the semiconductor portion of the device and ionizes the material.





#### **Types**

- *alpha particles* (Terrestrial, from packaging/doping)
- *Neutrons* (Terrestrial, secondary effect from Galactic Cosmic Rays entering atmosphere)
- *Heavy ions* (Aerospace, direct ionization)
- *Proton* (Aerospace, secondary effect)



# Motivation

- Two types of failures mechanics are induced by radiation
  - 1) Total Ionizing Dose (TID)
    - The cumulative, long term ionizing damage to the device materials
    - Caused by low energy protons & electrons

- 2) Single Event Effects (SEE)
  - Transient spikes caused by Heavy Ions and protons
  - Can be both destructive & non-destructive



# **Motivation (TID)**

#### 1) Total Ionizing Dose (TID)

- As the electron/holes try to recombine, they experience different mobility rates  $(\mu_n > \mu_p)$
- Over time, the ionized particles can get trapped in the oxide or substrate of the device prior to recombination
- This can lead to:
- Threshold Shifting
- Leakage Current
- Timing Skew





# **Motivation (SEEs)**

### 2) Single Event Effects (SEEs)

- Transient voltage/current induced in devices
- This can lead to both Non-Destructive and **Destructive effects**



#### **Non-Destructive**

**Behavior** 

**Single Event Transient (SET)** Single Event Upset (SEU) **Single Event Func. Interrupt (SEFI) Multi-Bit Upsets (MBU)** 

#### **Destructive**

Single Event Latchup(SEL) **Single Event Burnout (SEB) Single Event Gate Rupture (SEGR)** 

#### **Behavior**

Multiple, simultaneous SEUs

Transient biases the parasitic bipolar SCR in CMOS causing latchup Transient causes the device to draw high current which damages part The energy is enough to damage the gate oxide

A fault that cannot be recovered from using a reset.





# **Mitigation of TIDs**

### 1) Current Mitigation Techniques (TID)

- Parts can be "hardened" to TID through:

- layout techniques (sizing of Q<sub>crit</sub>, enclosed layout)
- guard rings
- substrate doping
- redundant circuitry
- Parts are specified in terms of:
  - "the amount of energy that can be tolerated by ionizing particles before the part performance is out of spec"
  - units are given in krad (Si), typically 300krad+
- Shielding <u>Does</u> Help
  - low energy protons/electrons can be stopped at the expense of weight



# **Mitigation of SEEs**

### 2) Current Mitigation Techniques (SEEs)

- Triple Modular Redundancy (TMR)



- Reboot/Recovery Sequences

- Shielding Does NOT eliminate all SEEs
  - impractical to shield against high energy particles and Heavy Ions due to necessary mass



# **Drawback of Mitigation**

- Radiation Hardening = Slower Performance
  - All TID mitigation techniques lead to slower performance



- TID mitigation **DOES NOT** prevent SEEs



# **FPGAs & Radiation**

### Radiation Mitigation in FPGAs

- RAM based FPGAs are traditionally *soft* to radiation
- Fuse-based FPGAs provide some hardness, but give up the flexibility of real-time programmability



- Exploiting Reconfiguration
  - The flexibility of FPGAs enables novel techniques to radiation tolerant computing

ex) Dynamic TMR, Spatial Avoidance of TID failures,

- The flexibility of FPGAs is attractive to weight constrained Aerospace applications

ex) Reduction of flight spares, internal spare circuitry



## **FPGAs as a Solution?**

Field Programmable Gate Arrays





# **Many-Core Architecture**

### Radiation Tolerance Through Architecture





## **Many-Core Architecture**

Types of Radiation Faults Seen in FPGAs



#### 1) Soft (SEU, SET)

- SEUs that can be recovered from using a reset

#### 2) Medium (SEFI)

- SEUs in reconfiguration memory, can only be recovered using reconfiguration

#### 3) Hard (TID / Displacement Damage)

- Damage to part of the chip due to TID or Displacement Damage



## **Potential Flight Computer**

• microBlaze Soft Processor





# **Final Acknowledgements**

• This work was supported by:



Montana Space Grant Consortium (NASA EPSCoR) http://spacegrant.montana.edu



NASA Exploration Systems Mission Directorate "Higher Education Program"

http://education.ksc.nasa.gov/esmdspacegrant/

• Special thanks to our project mentors from NASA's Advanced Avionics & Processor Systems (AAPS) Project

**Dr. Robert E. Ray** Marshall Space Flight Center Reconfigurable Computing Task **Dr. Andrew S. Keys** Marshall Space Flight Center AAPS Project Manager **Dr. Michael A. Johnson** Goddard Space Flight Center High Performance Processor Task





# **Questions?**







