Firmware

The project’s fitmware is split in three parts: the bootloader, the main firmware and the test firmware.

PMBus command infrastructure

A common command handling infrastructure has been put in place, such that both the main firmware and the bootloader can easily implement different subsets of PMBus commands. The basic construct of this implementation is the cmd_t structure:

struct cmd_t

Public Members

const uint8_t addr

CMD code.

int8_t *const data_len

transaction length for this command

uint8_t *const data_pnt

pointer to data

const fp_t a_callback

invoked when accessing the command, before any data transfer

const fp_t w_callback

invoked after writing data

const fp_t r_callback

invoked after reading data

const uint8_t query_byte

data for the query command

const uint8_t wr_pec_disabled

always disable PEC for this command if non-zero

An array of these structs makes up a command space:

struct cmd_space_t

Public Members

const uint8_t n_cmds

holds number of commands implemented

cmd_t *const cmds

where the command structure list is stored

From the user’s point of view, these structures are defined and used just once, in the function

void setup_I2C_slave(cmd_space_t *impl_cmds)

This function will configure the inturrupt handlers, below, with the command spaces defined in the specific user implementation (main firmware or bootloader). From that point on, the only interaction will be through the user-defined callbacks.

static void __xMR I2C_rx_complete(const struct i2c_s_async_descriptor *const descr)
static void __xMR I2C_tx_pending(const struct i2c_s_async_descriptor *const descr)
static void __xMR I2C_tx_complete(const struct i2c_s_async_descriptor *const descr)
static void __xMR I2C_error(const struct i2c_s_async_descriptor *const descr)

Bootloader

The bootloader, after bringing up the device, will check for the special word 0xBEC0ABCD in the flash storage (see struct below) and, depending on the value, will either hand control to the main FW, or enter remote programming (bootloader) mode.

struct user_flash_t

This struct defines 256 bytes of user data, stored in non-volatile memory, including a special 4-byte word which is used to turn on remote programming.

Public Members

uint32_t copy_fw

check if we want to enable the remote programming functionality

uint16_t setfrpms[3]

store fan configuration and speeds

uint8_t user_data[228]

provide some (optional) user data storage

PMBus commands overview

The full list of PMBus commands implemented by the MoniMod can be found in Table 9 and Table 10.

All physical quantities (except output voltage, which is discussed below) are expressed in the 16-bit PMBus Linear data format (LINEAR11, Fig. 6), instead of the (somewhat more complex) Direct format PMBus also supports. An 11-bit mantissa (Y) and a 5-bit exponent (N), expressed in 2’s complement, form a floating-point number X according to \(X = Y \cdot 2^N\).

_images/linear.png

Fig. 6 The PMBus Linear data format

The output voltages use a different, 21-bit format, called LINEAR16 (in contrast to the 16-bit LINEAR11, the names are derived from the mantissa width). This format comprises a 5-bit 2’s complement exponent, reported by the 5 MSBs of the VOUT_MODE command; and a 16-bit unsigned mantissa, reported by READ_VOUT. Many COTS PSUs use LINEAR11 for everything, and this behavior is also possible using a compile-time switch. In the MoniMod, the exponential factor for the LINEAR16 format is fixed to -10, so the voltages reported can be obtained with \(X = Y \cdot 2^-10\), where Y is the 16-bit value returned by READ_VOUT.

Table 9 PMBus commands implemented by the MoniMod

Cmd code

Command name

Transaction type

Data len

Description

00 +

PAGE

Byte write / read

1

set get page

19 +

CAPABILITY

Byte read

1

capabilities of the device

1A +*

QUERY

Block w / r proc. call

1

query cmd props

20 +

VOUT_MODE

Byte read

1

read voltage format

3A +

FAN_CONFIG_1_2

Byte write / read

1

config fans 1&2

3B +

FAN_COMMAND_1

Word write / read

2

set fan 1 speed

3C +

FAN_COMMAND_2

Word write / read

2

set fan 2 speed

3D +

FAN_CONFIG_3_4

Byte write / read

1

config fan 3

3E +

FAN_COMMAND_3

Word write / read

2

set fan 3 speed

78 +

STATUS_BYTE

Byte read

1

status byte

7E +

STATUS_CML

Byte read

1

status CML

8B +

READ_VOUT

Word read

2

read voltage

8C +

READ_IOUT

Word read

2

read current

8D +

READ_TEMPERATURE_1

Word read

2

read temp. sensor 1

8E +

READ_TEMPERATURE_2

Word read

2

read temp. sensor 2

8F +

READ_TEMPERATURE_3

Word read

2

read temp. sensor 3

90 +

READ_FAN_SPEED_1

Word read

2

read fan 1 speed

91 +

READ_FAN_SPEED_2

Word read

2

read fan 2 speed

92 +

READ_FAN_SPEED_3

Word read

2

read fan 3 speed

96 +

READ_POUT

Word read

2

read power

99 +*

MFR_ID

Block read

var

manufacturer ID

9A +*

MFR_MODEL

Block read

var

model

9B +*

MFR_REVISION

Block read

var

revision

9C +*

MFR_LOCATION

Block read

var

location

9D +*

MFR_DATE

Block read

var

date

9E +*

MFR_SERIAL

Block read

var

serial number

AE +*

IC_DEVICE_REV

Block read

var

git commit id

Commands marked with + or * are:

+ - supported by main fw

* - supported by bootloader

Table 10 Manufacturer specific PMBus commands implemented by the MoniMod

Cmd code

Command name

Transaction type

Data len

Description

D1 *

WRITTEN_FW_SIZE

Word write / read

2

size of the FW to be written

D2 *

WRITTEN_FW_BLOCK

MultiByte write / read

8

FW block to be written

D3 *

WRITTEN_FW_CHKSUM

Word write / read

2

checksum of the written FW

D4 *

LOCAL_FW_CHKSUM

Word write

2

calculated checksum of the written FW

D5 +*

BOOT_NEW_FW

Byte write / read

1

on write turn on btldr pgm mode, reset; on read get which fw is running

D6 +*

UC_RESET

Byte write

1

reset the uC

D7 +

UPTIME_SECS

Block read

var(1+4)

get the uptime in seconds

D8 +

TMR_ERROR_CNT

Block write / read

var(1+4)

clear / get TMR error count

D9 +

USE_PEC

Byte write / read

1

turn PEC on / off

E0 +

TEMP_CURVE_POINTS

Block write / read

var(1+13)

set / get temp. curve points

E1 +

TEMP_MATRIX_ROW

Block write / read

var(1+7)

set / get temp. matrix points

E2 +

TC_ONOFF

Byte write / read

1

turn temp. control on / off

Commands marked with + or * are:

+ - supported by main fw

* - supported by bootloader

Block registers contain in the first byte the size of data that follows. This also applies to the registers with the fixed size like UPTIME_SECS.

Detailed list of PMBus commands

PAGE

Command code: 00
Transaction type: Byte write / read
Data length: 1

The PAGE command is used to select a power rail for the READ_VOUT, READ_IOUT and READ_POUT commands. Allowed values for the page parameter are \(0 \leq N \leq 3\) for the DI/OT Rad-Tol System Board revision and \(0 \leq N \leq 2\) for the fan-tray, RaToPUS and generic prototype revisions.

CAPABILITY

Command code: 19
Transaction type: Byte read
Data length: 1

The CAPABILITY command returns one byte of information with some key capabilities of a PMBus device.

Table 11 CAPABILITY Data byte format

Bit(s)

Meaning

7

Packet Error Checking is supported

6:0

Unused

Bit 7 of the CAPABILITY register reflects the use of PEC set by the USE_PEC register.

QUERY

Command code: 1A
Transaction type: Block w / r proc. call
Data length: 1

The QUERY command takes a command code as an argument and replies with information on the command: whether it is supported, if read or write is supported, and what data format it works with.

VOUT_MODE

Command code: 20
Transaction type: Byte read
Data length: 1

The VOUT_MODE command reports the format the device uses for measured voltage related data. The 3 MSBs indicate whether that’s Linear (0b000), VID (0b001) or Direct (0b010), and in the MoniMod’s case it’s always 0b000 for Linear format. The 5 MSBs return either 0x16, for LINEAR16 format used (fully PMBus-compliant operation, fixed \(2^-10\) exponential); or 0x00, for LINEAR11 format used (common for COTS PSUs). See Linear for more details on the specifics of these formats.

FAN_CONFIG_n_m

Command codes: 3A, 3D
Transaction type: Byte write / read
Data length: 1

The FAN_CONFIG_1_2 and FAN_CONFIG_3_4 commands are used to configure the fans at positions 1, 2, and 3. The format of the configuration byte can be seen in Table 12. The two bits that set the tachometer pulses / revolution, which take the values 0–3, correspond to 1–4 pulses per revolution.

Table 12 FAN_CONFIG_1_2 and FAN_CONFIG_3_4 data byte format

Bit(s)

Value

Meaning

7

1

Fan 1 / 3 installed

0

Fan 1 / 3 not installed

6

1

Fan 1 / 3 commanded in RPM

0

Fan 1 / 3 commanded in duty cycle

5:4

0–3

Fan 1 / 3 tachometer pulses / rev

3

1

Fan 2 installed

0

Fan 2 not installed

2

1

Fan 2 commanded in RPM

0

Fan 2 commanded in duty cycle

1:0

0–3

Fan 2 tachometer pulses / rev

FAN_COMMAND_n

Command code: 3B, 3C, 3E
Transaction type: Word write / read
Data length: 2

The FAN_COMMAND_n commands set the desired speed of the attached fans. The value set is either in RPMs (when the fan is configured to be controlled like that) or duty cycle, in the range 0–1000.

STATUS_BYTE

Command code: 78
Transaction type: Byte read
Data length: 1

The STATUS_BYTE command returns one byte of information with a summary of the most critical faults.

Table 13 STATUS_BYTE Data byte format

Bit(s)

Meaning

7:2

Unused

1

A communications, memory or logic fault has occurred

0

Unused

The value of STATUS_BYTE is calculated based on other status registers (for now only STATUS_CML). To clear value of STATUS_BYTE, please clear information in other status registers.

STATUS_CML

Command code: 7E
Transaction type: Byte write / read
Data length: 1

The STATUS_CML command returns one data byte with contents as follows:

Table 14 STATUS_CML Data byte format

Bit(s)

Meaning

7:6

Unused

5

Packet Error Check Failed

4

TMR Error

3:2

Unused

1

A communication fault

0

Unused

Write 1’s in the desired bits positions to clear corresponding errors in the register.

Clearing TMR Error flag sets TMR_ERROR_CNT register to 0.

READ_VOUT

Command code: 8B
Transaction type: Word read
Data length: 2

The READ_VOUT command is used to get the measured voltage of the rail indicated by the last PAGE command (by default that would be the first one).

READ_IOUT

Command code: 8C
Transaction type: Word read
Data length: 2

The READ_IOUT command is used to get the measured current of the rail indicated by the last PAGE command (by default that would be the first one).

READ_TEMPERATURE_N

Command code: 8D, 8E, 8F
Transaction type: Word read
Data length: 2

The READ_TEMPERATURE_n commands return the measured temperature from the three installed temperature sensors.

READ_FAN_SPEED_N

Command code: 90, 91, 92
Transaction type: Word read
Data length: 2

The READ_FAN_SPEED_n return the fan speed of an installed fan, or 0 in case no fan is installed in the pertinent location.

READ_POUT

Command code: 96
Transaction type: Word read
Data length: 2

The READ_POUT command is used to get the measured power of the rail indicated by the last PAGE command (by default that would be the first one).

MFR_ID

Command code: 99
Transaction type: Block read
Data length: var

This returns the manufacturer ID string, “CERN (BE/CO)”.

MFR_MODEL

Command code: 9A
Transaction type: Block read
Data length: var

This returns the manufacturer model string, “DI/OT MoniMod”.

MFR_REVISION

Command code: 9B
Transaction type: Block read
Data length: var

This returns the manufacturer revision string.

MFR_LOCATION

Command code: 9C
Transaction type: Block read
Data length: var

This returns the manufacturer ID string, “Geneva”.

MFR_DATE

Command code: 9D
Transaction type: Block read
Data length: var

This returns the manufacturer date string, which currently corresponds to the date of the last release (and not the build used, for example).

MFR_SERIAL

Command code: 9E
Transaction type: Block read
Data length: var

This returns a manufacturer serial string (currently unused, returns “123456789”).

IC_DEVICE_REV

Command code: AE
Transaction type: Block read
Data length: var

This returns a git commit id of the used firmware.

WRITTEN_FW_SIZE

Command code: D1
Transaction type: Word write / read
Data length: 2

Before writing a new FW binary through the bootloader, its size in bytes divided by 8 has to be given using this command. The written value can be read.

WRITTEN_FW_BLOCK

Command code: D2
Transaction type: MultiByte write / read
Data length: 8

A new binary is written to the bootloader in consecutive chunks of 8 bytes, using this command. The written data can be read.

WRITTEN_FW_CHKSUM

Command code: D3
Transaction type: Word write / read
Data length: 2

After setting the size of the FW binary with WRITTEN_FW_SIZE and writing it with the WRITTEN_FW_BLOCK command, its SYS-V checksum should be written with this command. This command also resets the write pointers. The written checksum can be read. Please note that this is not the calculated checksum. The calculated checksum is stored in LOCAL_FW_CHKSUM (LOCAL_FW_CHKSUM).

LOCAL_FW_CHKSUM

Command code: D4
Transaction type: Word read
Data length: 2

After writing the new firmware, this command can be used to get the checksum calculated by the MoniMod. Please note that it is up to the writer to verify (compare) the checksum.

BOOT_NEW_FW

Command code: D5
Transaction type: Byte write / read
Data length: 1

The BOOT_NEW_FW command changes the execution mode of the Firmware. When the byte 0xAD is written to this register the running firmware is switched between bootloader and main mode. When the proper byte is received, a special code is written to or removed form the flash memory to let the bootloader to stay in bootloader mode or proceed to the main firmware. More information is available in the section Bootloader.

The read gives the information which firmware is actually running.

Value

Meaning

1

Bootloader

2

Main firmware

UC_RESET

Command code: D6
Transaction type: Byte write
Data length: 1

Writing 0x5A byte to this register triggers a uC reset. Other writes are silently ignored.

UPTIME_SECS

Command code: D7
Transaction type: Block read
Data length: var(1+4)

Get the uptime of the MoniMod (in seconds).

The fist byte of the register contains the number of bytes of data. For this register, its value is fixed to 4.

TMR_ERROR_CNT

Command code: D8
Transaction type: Block write / read
Data length: var(1+4)

When software mitigation through COAST is enabled (see TMR using COAST), one can access the TMR_ERROR_CNT counter using this command. Setting TMR Error bit in STATUS_CML clears the value of this register.

It is possible to clear the TMR Error counter by writing 0’s as data to this register.

The fist byte of the register contains the number of bytes of data. For this register, its value is fixed to 4.

USE_PEC

Command code: D9
Transaction type: Byte write / read
Data length: 1

The SMBus specification indicates that a device’s PEC support could be enabled or disabled at will.

The read gives the information whether the PEC is enabled.

Value

Meaning

0x00

PEC disabled

0x01

PEC enabled

To change the state of the PEC, the proper magic value has to be written to this register. All other values are silently ignored.

Value

Meaning

0x0F

Disable PEC

0x37

Enable PEC

The command itself is used without a PEC byte appended, no matter whether the function is enabled or not.

TEMP_CURVE_POINTS

Command code: E0
Transaction type: Block write / read
Data length: var(1+13)

As described in the Temperature control section, the temperature curve can be set separately for each fan. To do this, the format in Fig. 7 has to be used.

_images/temp_curve_format.png

Fig. 7 Temperature curve data frame

The first byte specifies the fan number (0-2), following bytes contains data specific for temperature curve.

The first read after the write to this register allows to get just stored data for the particular fan. The following reads of this register iterates over all fans and allow to get their data.

The fist byte of the register contains the number of bytes of data. For this register, its value is fixed to 13.

TEMP_MATRIX_ROW

Command code: E1
Transaction type: Block write / read
Data length: var(1+7)

As described in the Temperature control section, the temperature matrix can be set separately for each fan. The data format for the operation is illustrated in Fig. 8.

_images/temp_matrix_format.png

Fig. 8 Temperature matrix data frame

The first byte specifies the fan number (0-2), following bytes contains data specific for temperature control.

The first read after the write to this register allows to get just stored data for the particular fan. The following reads of this register iterates over all fans and allow to get their data.

The fist byte of the register contains the number of bytes of data. For this register, its value is fixed to 7.

TC_ONOFF

Command code: E2
Transaction type: Byte write / read
Data length: 1

Using the TC_ONOFF command with a zero argument disables Temperature Control, while any non-zero value enables it.

Fan control PID

When the fans connected provide a tachometer output, fan speed control can be enabled. This is implemented using PID controllers, with each fan having its own instance. The main data structure of the PID implementation is

struct pid_cntrl_t

Public Members

float setpoint

controller setpoint

float last_input

the input of the last timestep

float output_sum

storage for integration

uint16_t id_cnt

timestep counter

This is used by the main software to set the PID setpoint, and by the PID controller to hold integration data. The main function that has to be called every timestep is described below:

float pid_compute(pid_cntrl_t *pid_inst, float input)

use this function with a PID structure and an input to calculate the output for each timestep.

Compute the PID output for the next timestep

Return

the PID controller output

Parameters
  • pid_inst: struct that holds the PID controller’s configuration

  • input: the current input to the PID controller

Temperature control

The MoniMod implements a very flexible temperature control scheme. Each fan can be assigned its own 3-point temperature–speed curve, as in Fig. 9. Temperatures outside the set range will adopt the speed of the minimum and maximum temperature, accordingly.

_images/temp_curve.png

Fig. 9 Temperature curve

Moreover, the temperature each fan considers for its curve is a weighted product of all three monitored temperatures, as in Fig. 10. This allows one to easily configure the MoniMod to match a wide variety of fan / sensor setups, e.g.:

  • each fan is assigned its own temperature sensor

  • all three temperatures are averaged to give a more precise system temperature

  • one fan blows directly on a sensitive component which is monitored, the other two fans handle the rest of the system

_images/temp_matrix.png

Fig. 10 Temperature mixing matrix

Test firmware

To help with development, a test firmware has been written for a Feather M0 Basic minimal development board.

Mitigation measures

The MoniMod will be used in radiation environments. Although its function is not critical and it can be remotely reset upon loss of communication, some measures have been taken to minimize interruptions and data corruption, leading to an improved QoS.

TMR using COAST

The COAST LLVM passes can be optionally used to automatically implement TMR (Triple Modular Redundancy) in important and long-lived variables. This can particularly benefit the integrity of dynamic configuration data that gets set in the memory once and then gets read periodically, or state machines such as the one in the I2C interrupt handlers which is critical for stable communications.

NOPs and trampolines

The Program Counter is also sensitive to SEUs; in fact, execution can sometimes jump to an invalid address. To help mitigate failures owed to this mechanism, any region of unused memory space has been filled with NOP instructions, and a small trampoline function as an epilogue that will reset the stack pointer and jump to the device initialization code. Furthermore, the instruction that comprises the main loop has been placed at a “strategic” location, aligned by 0x8000: that way, a bit-flip in any of the lower bits will send execution to the upper memory region, filled with the NOPs and concluding at the trampoline.

Watchdog

The uC integrates a watchdog peripheral: this is fed every time the main timer callback runs, i.e. every 10ms. The watchdog is set to trigger if it doesn’t get fed for 20ms – as soon as the main loop skips a beat. That ensures a quick revival of the uC and should lead to minimal downtime.

Bling scrubbing

Blindly scrubbing the configuration of peripherals can be used to reduce gradual corruption of their configuration during operation. The frequency has to be carefully selected to minimize downtime.

Note

This hasn’t been implemented yet, this is a reminder to do it.

Stack protection

The compilers’ stack protection feature is enabled to catch the corner case that some loop goes awry and corrupts the stack due to some SEU. In case that happens, the uC quickly gets reset.

Toolchains

The project can be built with GCC and Clang / LLVM compilers; one can switch between the two simply by setting a Makefile variable. Note, however, that TMR only works with Clang.