Overview | Releases | Download | Docs | Links | Help | RecentChanges

ErrorEventDecoder

This page describes the output of the Error Event Decoder which has now been incorporated into the DSP code running on the SCT ROD.

Note chip numbers reported by the MDSP are numbered 0 to 5 and 8 to 13.

How are error events marked by the ROD?

As event data arrives on the ROD, the data receivers route the data to 8 Formatter FPGAs which convert the serial data streams to a parallel format. As part of this process a number of error conditions are detected and flagged in the header word for each link (input data stream). The resulting partial event fragments are stored in a buffer pending their transmission to the EFB, which performs further error checking. The resulting data sits in a buffer until the router elects to move the event, either to a slave dsp for further processing or to the slink and hence off the ROD.

If an event trapped by a slave DSP is found to contain errors, it is not passed to the usual histogramming routine, instead it is passed to an error event decoder. This has two outputs: a stream of ASCII text which is passed to the slave's text buffer, and a special data block which is subsequently read out by the master DSP. (This data block can also be read out by the crate controller over VME, however SctApi does not yet support this feature.)

During calibration, how will I know if I had any error events?

The present version of the mdsp code will abort the scan if one single event contains any errors. (This may change in the future, perhaps to allow a greater but still small number of errors to occur before the scan is aborted.) So your first observation will probably be that the scan has ended prematurely.

What if I want to know more?

The next place to look is the SctApiCrateServer? log file. On most SctRodDaq installations at CERN you will be able to open the latest log file by:

~/scripts/vitail   (to open in VI)
~/scripts/katetail (to open in Kate)
Go straight to the bottom of the file, then do an upward search for
EVENT_ERROR: SDSP
This occurs at the start of a block of text generated by the MDSP which gives an indication of which module(s) associated with the quoted slave reported errors in that slave's last event, namely the event which caused the scan to abort. You should observe information as described below, although the first line and probably some others will be prefixed by a header message which shows the slot number of the affected ROD:
EVENT_ERROR: SDSP 1 count 1
sdsp 1 nErrEvents 1 nErrors 2
sdsp 1 link 2 (module 1 stream 0 sn 20220330200999) nErrEvents 1
sdsp 1 link 3 (module 1 stream 1 sn 20220330200999) nErrEvents 1
This particular example shows us that there has been one event for which errors were recorded, and that both sides of module 20220330200999 reported an error. In this context, link numbers run from 0 to 95, module numbers from 0 to 47 and stream numbers from 0 to 1. Note that, of course, there may be errors from more than one slave or even more than ROD, so you might have to repeat the upwards search a few times to spot all the errors.

That's not good enough. I need more detail!

If you are lucky, there will also be some text buffer output from the slaves. Unfortunately this does not always make it into the log file, for reasons which we do not fully understand. If it does appear, it should help to pinpoint the error. Example messages are shown below: note that a search for EVENT_ERROR will find all such entries.

EVENT_ERROR: header OK L1id %d BCid %d
Not an error message as such, but confirmation that the event header is valid. The L1id and BCid are also shown.

EVENT_ERROR: header bad %08x %08x %08x %08x: skipping
This is really bad news: something went wrong with the ROD's flow control procedures and the event got mangled. Call the DSP code experts!

EVENT_ERROR (0x%04x) Link %d Chip %d ABCD_ERROR: code %d
The event was found to contain a stream which was interpreted as an ABCD chip error for the quoted link and chip. There are four possible error codes, strictly three of them are error codes and the other has a special meaning:

Unless you did something to send a bunch of closely spaced triggers to a really noisy module, chances are that if you see any of these error codes there is a problem with the integrity of the TX signal.

EVENT_ERROR (0x%04x) Link %d HEADER_ERROR:
This tag will be followed by one or more of the following:

EVENT_ERROR (0x%04x) Link %d TRAILER_ERROR:
This tag will be followed by one or more of the following:

EVENT_ERROR (0x%04x) Link %d Chip %d Chan %d RAW_ERROR
The data was so mangled that ROD gave up. If this is the first error to be logged, it is probably due to event corruption and may be cured by opto tuning. However if this follows a HIT_ERROR, the ROD gave up decodign the data as a result of the previous error. Again, you may find that opto tuning solves the problem, but you may also need to mask off the channel(s) associated with the HIT_ERROR.

EVENT_ERROR (0x%04x) Link %d Chip %d Chan %d HIT_ERROR
This tag will be followed by one or more of the following:

I wish I'd never asked. Is there an easier way?

Trevor has recently contributed a script

~/scripts/errortail

which aims to strip out all the lines related to error events from the latest CrateServer? log file and stream them to the console. This may be tweaked a little over the coming weeks, but it's already a great help.

I'm old-fashioned. Can I still look at ScanErrors.txt?

Yes, of course you can. At CERN try ~/scripts/logtail.

NB it's now called SctApi.ScanErrors?.log. This means it's will be archived with the rest of logs and will only contain errors from the current DAQ session.