SctRodDaq Wiki: ErrorEventDecoder

This page describes the output of the Error Event Decoder which has now been incorporated into the DSP code running on the SCT ROD.

Note chip numbers reported by the MDSP are numbered 0 to 5 and 8 to 13.

How are error events marked by the ROD?

As event data arrives on the ROD, the data receivers route the data to 8 Formatter FPGAs which convert the serial data streams to a parallel format. As part of this process a number of error conditions are detected and flagged in the header word for each link (input data stream). The resulting partial event fragments are stored in a buffer pending their transmission to the EFB, which performs further error checking. The resulting data sits in a buffer until the router elects to move the event, either to a slave dsp for further processing or to the slink and hence off the ROD.

If an event trapped by a slave DSP is found to contain errors, it is not passed to the usual histogramming routine, instead it is passed to an error event decoder. This has two outputs: a stream of ASCII text which is passed to the slave's text buffer, and a special data block which is subsequently read out by the master DSP. (This data block can also be read out by the crate controller over VME, however SctApi does not yet support this feature.)

During calibration, how will I know if I had any error events?

The present version of the mdsp code will abort the scan if one single event contains any errors. (This may change in the future, perhaps to allow a greater but still small number of errors to occur before the scan is aborted.) So your first observation will probably be that the scan has ended prematurely.

What if I want to know more?

The next place to look is the SctApiCrateServer? log file. On most SctRodDaq installations at CERN you will be able to open the latest log file by:

~/scripts/vitail   (to open in VI)
~/scripts/katetail (to open in Kate)

Go straight to the bottom of the file, then do an upward search for

EVENT_ERROR: SDSP

This occurs at the start of a block of text generated by the MDSP which gives an indication of which module(s) associated with the quoted slave reported errors in that slave's last event, namely the event which caused the scan to abort. You should observe information as described below, although the first line and probably some others will be prefixed by a header message which shows the slot number of the affected ROD:

EVENT_ERROR: SDSP 1 count 1
sdsp 1 nErrEvents 1 nErrors 2
sdsp 1 link 2 (module 1 stream 0 sn 20220330200999) nErrEvents 1
sdsp 1 link 3 (module 1 stream 1 sn 20220330200999) nErrEvents 1

This particular example shows us that there has been one event for which errors were recorded, and that both sides of module 20220330200999 reported an error. In this context, link numbers run from 0 to 95, module numbers from 0 to 47 and stream numbers from 0 to 1. Note that, of course, there may be errors from more than one slave or even more than ROD, so you might have to repeat the upwards search a few times to spot all the errors.

That's not good enough. I need more detail!

If you are lucky, there will also be some text buffer output from the slaves. Unfortunately this does not always make it into the log file, for reasons which we do not fully understand. If it does appear, it should help to pinpoint the error. Example messages are shown below: note that a search for EVENT_ERROR will find all such entries.

EVENT_ERROR: header OK L1id %d BCid %d

Not an error message as such, but confirmation that the event header is valid. The L1id and BCid are also shown.

EVENT_ERROR: header bad %08x %08x %08x %08x: skipping

This is really bad news: something went wrong with the ROD's flow control procedures and the event got mangled. Call the DSP code experts!

EVENT_ERROR (0x%04x) Link %d Chip %d ABCD_ERROR: code %d

The event was found to contain a stream which was interpreted as an ABCD chip error for the quoted link and chip. There are four possible error codes, strictly three of them are error codes and the other has a special meaning:

001 (1) - no data available, the chip did not receive L1A
010 (2) - Buffer Overflow (too many triggers?)
100 (4) - "Buffer Error (Soft Reset is needed)"
111 (7) - configuration readback mode, the chip did not receive the command to enter data taking mode

Unless you did something to send a bunch of closely spaced triggers to a really noisy module, chances are that if you see any of these error codes there is a problem with the integrity of the TX signal.

EVENT_ERROR (0x%04x) Link %d HEADER_ERROR:

This tag will be followed by one or more of the following:

preamble
- the event did not begin with 11101 (?)
- event corruption, try opto tuning

timeout
- the link did not return any data
- look for PS trips, failed opto links or fibre mapping errors

L1error
- the module returned the wrong L1ID
- Try HardReset?. If the problem persists, call SctApi and DSP code experts.

BCerror
- the module returned the wrong BCID
- Try HardReset?. If the problem persists, call SctApi and DSP code experts.

EVENT_ERROR (0x%04x) Link %d TRAILER_ERROR:

This tag will be followed by one or more of the following:

trailer error
- the event did not end with 100000000000000 (?)
- event corruption, try opto tuning

header trailer limit error
- the data rate at the input has exceeded the available bandwidth
- look for noisy modules and/or check trigger conditions and flow control

data overflow error
- the data rate at the input has greatly exceeded the available bandwidth
- look for noisy modules and/or check trigger conditions and flow control

EVENT_ERROR (0x%04x) Link %d Chip %d Chan %d RAW_ERROR

The data was so mangled that ROD gave up. If this is the first error to be logged, it is probably due to event corruption and may be cured by opto tuning. However if this follows a HIT_ERROR, the ROD gave up decodign the data as a result of the previous error. Again, you may find that opto tuning solves the problem, but you may also need to mask off the channel(s) associated with the HIT_ERROR.

EVENT_ERROR (0x%04x) Link %d Chip %d Chan %d HIT_ERROR

This tag will be followed by one or more of the following:

first hit
- once communication issues have been excluded, mask off channel n as listed above

second hit
- once communication issues have been excluded, mask off the channel n+1

I wish I'd never asked. Is there an easier way?

Trevor has recently contributed a script

~/scripts/errortail

which aims to strip out all the lines related to error events from the latest CrateServer? log file and stream them to the console. This may be tweaked a little over the coming weeks, but it's already a great help.

I'm old-fashioned. Can I still look at ScanErrors.txt?

Yes, of course you can. At CERN try ~/scripts/logtail.

NB it's now called SctApi.ScanErrors?.log. This means it's will be archived with the rest of logs and will only contain errors from the current DAQ session.