9. Data reconstruction    Reconstruction

We will now go through the general data reconstruction for data taken with the Septemboard detector. Starting with a short introduction to the data analysis framework TimepixAnalysis developed for this thesis, sec. 9.1. Then we cover the data parsing of the ASCII based data format produced by TOS (see appendix 17.2.1 for a description of the data format) in sec. 9.2. At this point we shortly discuss our expectation of the data properties recorded with the detector, sec. 9.3, as it motivates the kind of reconstruction that is performed. From here the explanation of the data reconstruction starts, sec. 9.4, including cluster finding (sec. 9.4.1) and calculation of geometric properties, sec. 9.4.2.

The FADC data reconstruction follows in section 9.5. Finally, the scintillators are mentioned in section 9.6.

There is an additional long section in the appendix 34 that goes through the software used for the data reconstruction intended for people using these tools in the future. And appendix 35 shows how the full data reconstruction for all CAST data presented in this thesis is performed.

9.1. TimepixAnalysis and Nim

The data reconstruction software handling the processes mentioned in the remainder of this chapter is the TimepixAnalysis (Schmidt 2022c) 1 framework. It is only a "framework" in a loose sense, as it is a set of programs to parse data from different Timepix (usually GridPix) DAQ software packages, process and analyze it. In addition, it contains a large number of tools to visualize that data as well as analyze auxiliary data like lists of data taking runs, CAST log files and more.

The entire code base is written in the Nim programming language (Rumpf 2022) 2. Nim is a statically typed, compiled language with a Python-like whitespace sensitive syntax, taking inspirations from Pascal, Modula and in particular ideas regarding type safety from Ada. Further, it has a strong metaprogramming focus with full access to its abstract syntax tree (AST) at compile time, offering Lisp-like macro functionality. This allows to construct powerful and concise domain specific languages (DSLs). Nim compiles its code by default first to C code, which can then utilize the decades of compiler optimization techniques available via GCC (GNU Project 2022) or Clang (Lattner and Adve 2004), while allowing to target every single platform supported by C (which effectively means almost all). As such it also achieves performance on par with C, while providing high-level features normally associated with languages like Python.

Nim was selected as the language of choice for TimepixAnalysis, due to its combination of concise and clean syntax, high-level features for productivity, high performance due to native code generation, easy interfacing with existing C and C++ code bases and its strong metaprogramming features, which allow to reduce boilerplate code to a minimum.

Appendix 34.3 contains a detailed overview of the important tools part of TimepixAnalysis and how they are used for the context of the data analysis presented in this thesis. Read it if you wish to understand how to recreate the results presented in this thesis.

9.2. TOS data parsing

The first part of the GridPix data reconstruction is the parsing of the raw ASCII data files, presented in 17.2.1. This is implemented in the raw_data_manipulation program 3, part of TimepixAnalysis (Schmidt 2022c). Its main purpose is the conversion of the inefficient ASCII data format to the more appropriate and easier to work with HDF5 4 (The HDF Group 1997) format, a binary file format intended for scientific datasets. While this data conversion is the main purpose, pixels with ToT or ToA values outside a user defined range can be filtered out at this stage 5. Each data taking run is processed separately and will be represented by one group (similar to a directory on a file system) in the output HDF5 file. See appendix section 34.3.4.1 for an explanation of the produced data layout.

As the data is being processed anyway, at this time we already compute an occupancy map for each chip in the run. This allows for a quick glance at the (otherwise unprocessed) data.

9.2.1. Example of running raw_data_manipulation [0/1]    extended

  • [ ] GIVE EXAMPLE -> Need an actual run to work with. Use TPAResources?
raw_data_manipulation -p <path> --runType rtBackground --h5out /tmp/run_foo.h5

9.3. Expectation of event shapes

Based on the theoretical aspects of a gaseous detector as explained in chapter 6 and the expected kinds of signal sources at an experiment like CAST (see chapter 10), we can have a good expectation of the kinds of signals that a GridPix detector records for different types of events.

The signal source we are interested in for an axion helioscope are soft energy X-rays, below \(\SI{10}{keV}\). The main goal in later determining a background rate and computing a physics result from data is to filter out these X-rays from the rest of the data the detector records. The dominant source of background in any gaseous detector at surface level is due to cosmic muons.

Fortunately, muons and X-rays behave very different in the detector. X-rays generally produce a single photoelectron, which creates further primary electrons in a local region. These drift under transverse diffusion to the readout plane, which effectively gives them a roughly circular shape. Muons on the other hand produce electrons (which each produce further local primaries) on a track along their entire path through the gas volume. Under most angles this implies their shape is very eccentric, i.e. 'track-like'.

Two example events, one of a \(\sim\SI{5.9}{keV}\) \cefe X-ray and the other of a typical muon is shown in fig. 1.

gridpix_example_events.svg
Figure 1: Figure 33: Two example events one might see in the detector, left a common background event of a (likely) muon track, which enters the readout plane (hence the slightly triangular shape) and right a classical \(\SI{5.9}{keV}\) X-ray from a \cefe calibration source.

Given the distinct geometric properties of these different types of events and the fact that a GridPix provides extremely high spatial resolution and single electron efficiency, the data reconstruction fully embraces this. Most of the computed properties, which we will introduce in the next sections, are directly related to geometric properties of the events.

9.3.1. Generate example events for known events    extended

While we have two pretty nice events to plot as examples, in principle, they are generated by karaPlot (an extensive plotting tool part of TPA) and thus not quite suited to a thesis.

As we know the run number and event number, we can just generate them quickly here. The two events (and plots) are:

  • background_event_run267_chip3_event1456_region_crAll_hits_200.0_250.0_centerX_4.5_9.5_centerY_4.5_9.5_applyAll_true_numIdxs_100.svg
  • calibration_event_run266_chip3_event5791_region_crAll_hits_200.0_250.0_centerX_4.5_9.5_centerY_4.5_9.5_applyAll_true_numIdxs_100.svg

so run 267 and 266, events 1456 and 5791.

Ah, I had forgotten that these are not the event numbers, but their indices of the clusters. Therefore, we'll just search for pretty events in the same runs.

# Laptop
#const calib = "/mnt/1TB/CAST/2018_2/CalibrationRuns/Run_266_181107-22-14"
#const back  = "/mnt/1TB/CAST/2018_2/DataRuns/Run_267_181108-02-05"
# Desktop
# All raw files found in `/mnt/4TB/CAST`. The two runs needed here copied to
from std/os import expandTilde
const calib = "~/CastData/data/2018_2/Run_266_181107-22-14" # calibration
const back  = "~/CastData/data/2018_2/Run_267_181108-02-05" # data

const cEv = 5898
const bEv = 1829 # this event is nice
import ingrid / tos_helpers
import std / [strformat, os, strutils, sequtils]
import ggplotnim

proc toFile(i: int, path: string): string =
  let z = align($i, 6, '0')
  path / &"data{z}.txt"

proc drawPlot() =   
  let protoFiles = readMemFilesIntoBuffer(@[toFile(cEv, calib.expandTilde), toFile(bEv, back.expandTilde)])
  var df = newDataFrame()
  var names = @["X-ray", "Background"]
  for (pf, name) in zip(protoFiles, names):
    let ev = processEventWithScanf(pf)
    let pix = ev.chips[3].pixels
    if pix.len == 0: return
    let dfL = toDf({ "x" : pix.mapIt(it.x.int), "y" : pix.mapIt(it.y.int),
                     "ToT" : pix.mapIt(it.ch.int), "type" : name })
    df.add dfL
  echo df
  ggplot(df, aes("x", "y", color = "ToT")) +
    facet_wrap("type") +
    geom_point() +
    xlim(0, 256) + ylim(0, 256) +
    #theme_font_scale(2.0) +
    #margin(left = 3, bottom = 3, right = 5) +
    margin(right = 3.5) + 
    #facetHeaderText(font = font(12.0, alignKind = taCenter)) +
    xlab("x [Pixel]", margin = 1.5) + ylab("y [Pixel]", margin = 2) + 
    legendPosition(0.88, 0.0) +
    themeLatex(fWidth = 0.9, width = 800, height = 400, baseTheme = singlePlot) + 
    ggsave("/home/basti/phd/Figs/reco/gridpix_example_events.pdf", width = 800, height = 400, useTeX = true, standalone = true)

drawPlot()

9.4. Data reconstruction

With the data stored in an HDF5 file after processing the raw data with raw_data_manipulation, the actual data and event reconstruction can begin. This is handled by the reconstruction 6 program. It continues from the HDF5 file created before and proceeds to reconstruct all runs in the given input file.

For each run, each GridPix chip is processed sequentially, while all events for that chip are then processed in parallel using multithreading. For each event, the data processing is essentially a two step process:

  1. perform cluster finding, see section 9.4.1.
  2. compute geometric properties for each found cluster, see section 9.4.2.

9.4.1. Cluster finding

The cluster finding algorithm splits a single event into possibly multiple clusters. Clusters are defined based on a certain notion of distance (the details depend on the clustering algorithm used). The multiple clusters from a single event are then treated fully equally for the rest of the analysis. The fact that they originate from the same event has no further relevance (with a slight exception for one veto technique, which utilizes clustering over multiple chips, more on that in section 12.5.3).

There are two different cluster finding algorithms implemented for use in TimepixAnalysis. The default one is strictly used for the general cluster finding as part of the reconstruction, the other is intended to be used for one of the vetoes (again, sec. 12.5.3). The choice is user configurable however. 7

Default
The default one is the same clustering algorithm introduced for the data reconstruction of the 2014/15 GridPix detector in (Christoph Krieger 2018). It defines a cluster by all pixels within the squares of side length \(N\) centered around each pixel. It is best thought of as a recursive square neighbor search around each pixel. For each neighbor in the search square, start another search square. Once no neighbor finds any neighbors not already part of the cluster, it is finished.
DBSCAN
The secondary clustering algorithm is the \textbf{D}ensity-\textbf{b}ased \textbf{s}patial \textbf{c}lustering of \textbf{a}pplications with \textbf{n}oise (DBSCAN) (Ester et al. 1996) algorithm. In contrast to the default algorithm it is – as the name implies – a density based algorithm. This means it distinguishes points which have more neighbors (high density) from those with few neighbors (low density). The algorithm has a parameter minSamples, which defines the density threshold. If a point has at least minSamples neighbors within a (euclidean) distance of \(ε\) (the second parameter) it is considered a "core point". All core points build a cluster with all other points in their reach. Those points in reach of a core point, but do itself not have minSamples neighbors are still part of the cluster. Any point not in reach of a core point is a "noise point". The main advantage of this algorithm over many other more classical algorithms is the ability to separate clusters close to one another, which are not separateable by a linear cut. This results in a more humanly "intuitive" clustering. DBSCAN is one of the most widely used clustering algorithm in many scientific fields and even in 2017 was still considered highly relevant (Schubert et al. 2017).

Another clustering algorithm (currently not implemented) is CLASSIX (Chen and Güttel 2022), which promises fast clustering based on sorting along the first principal component axis. Based on its properties as presented in its paper it could be an extremely useful algorithm for our application and should be investigated in the future.

9.4.1.1. CLASSIX clustering algorithm [/]    extended
  • [ ] OR SHOULD THIS GO INTO MAIN SECTION BEFORE? OR AS :optional:?

There is one further clustering algorithm, which is extremely exciting and seems like a great candidate for a clustering algorithm for TimepixAnalysis. That is the CLASSIX algorithm, introduced as "a fast and explainable clustering method" (Chen and Güttel 2022).

See the GitHub page with many examples here: https://github.com/nla-group/classix

It is an algorithm, which first sorts the data based on the first principal component in the data.

9.4.1.2. Clustering bug in MarlinTPC for 2014/15 data    extended

One of the initial goals of TimepixAnalysis was the reproduction of the background rate computed for the 2014/15 data with the MarlinTPC framework. While that whole ordeal wasted a lot of time trying to achieve the exact same results from both frameworks to satisfy other people, among other things it lead to the discovery of a clustering bug in MarlinTPC (which was finally the point that let me drop this pursuit).

  • [ ] INSERT DISCUSSION FROM STATUSANDPROGRESS ABOUT MARLIN CLUSTER BUG See section sec:marlin_vs_tpa_output in TPA for the Marlin clustering bug.

9.4.2. Calculation of geometric properties

For each individual cluster the geometric event reconstruction is up next. As the basic differentiator between X-rays and common background events is their circularity, most properties are in some sense related to how eccentric clusters are. Therefore, the first thing to be computed for each cluster, is the rotation angle 8.

The rotation angle is found via a non linear optimization of

\begin{align*} x'_i &= \cos(θ) \left( x_i - \bar{x} \right) · P - \sin(θ) \left( y_i - \bar{y} \right) · P \\ y'_i &= \sin(θ) \left( x_i - \bar{x} \right) · P + \cos(θ) \left( y_i - \bar{y} \right) · P \end{align*}

where \(θ\) is the rotation angle (in the context of the optimization the parameter to be fitted), \(x_i, y_i\) the coordinates of the \(i\text{-th}\) pixel in the cluster, and \(\bar{x}, \bar{y}\) the center coordinates of the cluster. \(P = \SI{55}{μm}\) is the pixel pitch of a Timepix. The resulting variables \(x'_i, y'_i\) define a new rotated coordinate system. From these coordinates, the RMS 9 of each of these new axes is computed via

\begin{align*} x_{\text{RMS}} &= \sqrt{ \frac{1}{N} \left( \sum_i x^{\prime2}_i \right) - \frac{1}{N²} \left( \sum_i x'_i \right)² }\\ y_{\text{RMS}} &= \sqrt{ \frac{1}{N} \left( \sum_i y^{\prime2}_i \right) - \frac{1}{N²} \left( \sum_i y'_i \right)² }. \end{align*}

Based on these we then simply redefine

\begin{align*} σ_{\text{transverse}} &= \text{min}(x_{\text{RMS}}, y_{\text{RMS}}) \\ σ_{\text{longitudinal}} &= \text{max}(x_{\text{RMS}}, y_{\text{RMS}}), \end{align*}

which then define the eccentricity \(ε\) to (see also fig. 2(b))

\[ ε = \frac{σ_{\text{longitudinal}}}{σ_{\text{transverse}}}, \]

guaranteeing \(ε \geq 1\).

During the non linear optimization, the algorithm attempts to maximize the eccentricity. In a track like cluster, the maximum eccentricity is found under the rotation angle \(θ\), which points along the longest axis of the cluster. The resulting rotated coordinate system after the fit has converged, is illustrated in fig. 2(a).

Once the rotation angle and therefore the rotated coordinate system of a cluster is defined, most other properties follow in a straight forward fashion. In the rotated coordinate system the axis along the long axis of the cluster is called "longitudinal" and the short axis "transverse" in the following. The higher moments skewness and kurtosis for each axis are computed as well as the length and width of the cluster based on the biggest spread of pixels along each axis. In addition to the geometric properties a few other properties like the number of pixels are also computed. Three of the most important variables are illustrated in fig. 2. These enter the likelihood cut definition as we will see in sec. 12.1.

Figure 2(a): Rotated axes
Figure 2(b): Eccentricity
Figure 2(c): Fraction in transverse RMS
Figure 2(d): Length divided by transverse RMS
Figure 2: Schematic explanation of the basic cluster reconstruction and the three most important geometric properties. 2(a) defines the rotated coordinate system found by non-linear optimization of the long and short cluster axis. Along the long and short axes, 2(b), the transverse standard deviation $σ_{\text{transverse}}$ is computed, which then defines the eccentricity by this ratio. 2(c) shows the definition of a less obvious variable: the fraction of pixels within a circle of one $σ_{\text{transverse}}$ radius around the cluster center. Similarly 2(d) shows the full cluster length defined by the furthest active pixels in the cluster divided by $σ_{\text{transverse}}$ as another variable. These three variables enter the likelihood cut used for background suppression.

The following is a list of all properties of a single cluster computed by the reconstruction tool. The ig prefix is due to the internal naming convention. All but the likelihood, charge and energy properties are computed during the first pass of the tool, namely in the context discussed above. 10

igEventNumber
The event number the cluster is part of (multiple clusters may share the same event number).
igHits
The number of pixels in the cluster.
igCenterX / igCenterY
The center position of the cluster along the x / y axis of the detector.
igRotationAngle
The rotation angle of the long axis of the cluster over the chip coordinate system.
igLength
The length of the cluster along the long axis in the rotated coordinate system, defined by the furthest pixel at each end in that direction.
igWidth
The equivalent of igLength for the short axis.
igRmsLongitudinal
The root mean square (RMS) along the long axis.
igRmsTransverse
The RMS along the short axis.
igSkewnessLongitudinal / igKurtosisLongitudinal
The skewness / kurtosis along the long axis.
igSkewnessTransverse / igKurtosisTransverse
The skewness / kurtosis along the short axis.
igEccentricity
The eccentricity of the cluster, defined by the ratio of the longitudinal RMS over the transverse RMS.
igLengthDivRmsTrans
The length of the cluster divided by the transverse RMS (see fig. 2(d)).
igFractionInTransverseRms
The fraction of all pixels within a radius of the transverse RMS around the center (see fig. 2(c)).
igTotalCharge
The sum of the charge of the ToT calibrated charges of all pixels in the cluster (see sec. 9.4.4).
igEnergyFromCharge
The calibrated energy of the cluster in \(\si{keV}\) (see sec. 11.1).
igLikelihood
The likelihood value of the cluster for the likelihood cut method, explained in detail in section 12.1.

After the calculation of all geometric properties for all events and chips, the data is written to an output HDF5 file (similar in format to the output of raw_data_manipulation) for each run. This concludes the first pass of reconstruction over the data. See appendix section 34.3.5.2 for an explanation of the produced data layout.

9.4.3. Example of data reconstruction [/]    extended

  • [ ] GIVE EXAMPLE
reconstruction -i <h5file> --out /tmp/reco_foo.h5

9.4.4. Data calibration

The next step of the reconstruction is the data calibration. This is a separate pass over the data as it is optional on the one hand and requires further inputs about each used GridPix than just the raw data (different calibration files) on the other hand.

There are different calibrations to be performed:

  1. the charge calibration via the application of the ToT calibration as introduced in section 8.1.1.
  2. the calculation of the gas gain, introduced previously in section 8.1.2 and more in sec. 11.2.2.
  3. the energy calibration (see sec. 11.1 and 11.3).

The ToT calibration is in principle performed simply by converting each ToT value to an equivalent charge in electrons using the calibration as presented in section 8.1.1. For each GridPix used in a detector, a ToT calibration must be available.

TimepixAnalysis comes with a library and helper program, which manages a simple database about different GridPixes, their calibrations and their validity (in time and runs they apply to). The user needs to add the chips for which they wish to perform a ToT calibration to the database before it can be performed. See appendix 34.3.10 for a detailed overview. For any chip part of the database, the ToT calibration is a single pass over the ToT values of all runs. This generates a calibrated charge for every pixel of every cluster and a combined property, the totalCharge of the full charge of each cluster.

Gas gain values are computed in \(\SI{90}{min}\) time intervals for each chip. This strikes a good balance between enough statistics and reduced sensitivity to variation in gas gain due to external effects. As this deserves its own discussion, more on this in sec. 11.2.2.

Finally, while the energy calibration is also handled by reconstruction, we will cover it in section 11, due to its more complex nature.

9.4.4.1. Example of data calibration [/]    extended

The following three commands perform the three calibration steps mentioned above:

ToT calibration
--only_charge
Gas gain calc
--only_gas_gain
Energy calibration
--only_energy_from_e
reconstruction -i <h5file> --only_charge
reconstruction -i <h5file> --only_gas_gain
reconstruction -i <h5file> --only_energy_from_e

9.4.5. Event duration

During the reconstruction of the data, another important parameter is computed, namely the event duration of each individual event. In principle each event has a fixed length, because the Timepix uses a shutter based readout, with the shutter length predefined. However, as the FADC is used as an external trigger to close the shutter early, if it recorded a signal, all events with an FADC trigger have a shorter duration.

For the fixed length duration events their length is computed by the shutter length as indicated in TOS. In appendix 17.2.1, listing 2 the shutterTime and shutterMode fields are listed. These define the absolute length of the shutter opening in (effectively) number of clock cycles. The shutterMode acts as a modifier to the number of clock cycles:

\[ t_{\text{clocks}}(\mathtt{mode}, t) = 256^{\mathtt{mode}} · t \]

where \(t\) is the shutterTime and \(\mathtt{mode}\) corresponds to the shutterMode. The available modes are:

  • short: \num{0}
  • long: \num{1}
  • verylong: \num{2}

In case of the FADC triggering, the clock cycles after shutter opening that were recorded up to the trigger is also reported in the data files, see appendix sec. 17.2.1, listing 3. With the number of clock cycles the shutter was open, the total event duration can then be computed in either case via:

\[ d(t_{\text{clocks}}) = \frac{t_{\text{clocks}} · 46}{40 · \num{1000000}}. \]

9.5. FADC reconstruction

The data files created from the FADC data sent upon a trigger are essentially memory snapshots of the circular register of the FADC. We will go through the necessary steps to convert that raw data into usable signals, given the FADC settings we use and the data TOS generates from it. See appendix sec. 17.2.1.1 for an overview of the raw FADC data files. For a detailed overview of the FADC readout process see the FADC manual (CAEN 2010) 11.

In TimepixAnalysis FADC data is automatically parsed from the ASCII data files into HDF5 files as part of raw_data_manipulation if FADC files are present. The spectrum reconstruction is done automatically as part of the reconstruction program, but calculation of the baseline, rise and fall time is an optional step.

9.5.1. FADC pedestal calculation

As alluded to in sec. 8.2 the pedestal values cannot only be taken from a pedestal run recorded before data taking, but can also be extracted from real data, under the condition that a decent fraction of FADC registers in a single FADC event is on the baseline and normally distributed between events.

The idea is to look at an ensemble of values for each register taken from different events and remove all those events in each register, in which it was involved in a real signal. Due to the cyclic nature of the FADC registers, different registers will capture signals in each event. At least in typical signals recorded with a GridPix the signal lengths are \(\mathcal{O}(\SI{10}{\percent})\) of the window length, leaving plenty of registers free to recover pedestal information. Regular noise affects things, but is partially taken care by the truncation and partially cancels out as real noise is normal distributed around the actual pedestal. This latter approach is the one used in the data analysis by calculating:

\begin{equation} \label{fig:reco:fadc_pedestals_trunc_mean} p_i(r) = \text{mean}_{20^{\text{th}}}^{98^{\text{th}}}(\{r_i\}) \end{equation}

where \(p_i(r)\) is the pedestal in register \(r_i\) and the mean is taken over all data \(\{r_i\}\) in that register within the 20-th and 98-th percentile. All data refers to a full data run of \(\sim\SI{1}{day}\). The highly biased nature is due to the real signals being negative. Removing the smallest \(\SI{20}{\percent}\) of data guarantees in the vast majority of events the full physical signal is excluded given the typical signal lengths involved. A small upper percentile is used to exclude possible significant outliers to the top. While such a biased estimator will not result in the real mean (and in case of signal and noise free input data thus the real pedestals), a slight bias is irrelevant, as the baseline is still calculated for each reconstructed signal which is used to correct any global offset.

9.5.2. FADC spectrum reconstruction

The first step to reconstruct the FADC signals, is to perform the pedestal correction. This is simply done by subtracting the pedestals register by register from the data file

\[ N_{i, \text{corr}} = N_{i, \text{raw}} - N_{i, \text{pedestal}} \]

with the raw data \(N_{i, \text{raw}}\) and the pedestals \(N_{i, \text{pedestal}}\) in register \(i\) (as computed according to eq. \eqref{fig:reco:fadc_pedestals_trunc_mean}).

With the pedestals removed, the temporal correction is next to unfold the data into the correct order. This needs to be performed on each of the \(\num{2560}\) registers for each channel separately. The temporal rotation is performed by shifting all registers by

\[ n_\text{rot} = (\mathtt{TRIG\_REC} - \mathtt{POSTTRIG}) · 20 \]

places to the left. The constants \(\mathtt{TRIG\_REC}\) and \(\mathtt{POSTTRIG}\) are from appendix section 17.2.1.1, written in each data file in the header.

The final step is to convert the ADC values of each register into voltages in \(\si{V}\). Given that the ADC covers the range of \(\SIrange{-1}{1}{V}\) as the ADC values 0 to 4096 (16384) with 12 (14) bit 12, this means the conversion from ADC to volts is simply

\[ U_i = \frac{N_{i, \text{corr}} - 2048}{2048} \]

when using the 12 bit operating mode for each register.

With these corrections applied, the recorded FADC spectrum is recovered, centered around the trigger position.

9.5.3. Signal baseline, rise time and fall time

Assuming a singular event is recorded with the FADC, the main properties of interest of the resulting signal pulse are the signal baseline and based on that the rise and fall time.

Computing the position of the baseline is generally a non trivial problem, as a priori the position, width and number of signals in the spectrum is unknown. A reasonable expectation though is that the majority of points in a signal should lie close to the baseline, as the fraction of the FADC window covered by a signal is typically less than a quarter. As such a somewhat empirical way to compute the baseline \(B\) was chosen using a biased truncated mean

\[ B = \text{mean}_{30^{\text{th}}}^{95^{\text{th}}}(S) %\text{median}(S) + 0.1 · \max(S) \]

between the \(30^{\text{th}}\) and \(95^{\text{th}}\) percentile of the data. The bias is intended to remove the impact of the negative signal amplitude and remove the worst positive outliers. An optimal solution would perform a rigorous peak finding for a signal pulse, remove those points and compute the mean of the remainder of the points. 13

Once the baseline is defined it can be used to determine both the rise and the fall time. These are generally computed based on the number of registers between the minimum of the signal and some threshold slightly below the baseline (compare with fig. 3) in order to reduce the effect of local noise variations. While configurable, the default value of the threshold \(c_B\) (\(B\) for baseline) is

\[ c_B = B - 0.1 · \left| B - \text{min}(S)\right|, \]

\(\SI{10}{\%}\) of the difference between the baseline \(B\) and the minimum value of the spectrum \(S\) from the baseline. Similarly, for the end of the rise time / beginning of the fall time a similar offset is used. In this case the threshold value \(c_P\) (\(P\) for peak) is defined as

\[ c_P = \text{min}(S) + 0.025 · \left| B - \text{min}(S)\right|, \]

so \(\SI{2.5}{\%}\) of the amplitude above the minimum of the signal \(S\). The position where either threshold is crossed in registers is based on where the simple moving average (of window size 5) crosses \(c_B\) and \(c_P\). The number of registers between \(c_B\) and \(c_P\) defines the rise time (left of the peak) and fall time (right of the peak). 14 At the used clock frequency of \(\SI{1}{GHz}\) each register corresponds to \(\SI{1}{ns}\) in time.

A reconstructed FADC spectrum including indications for baseline, rise and fall time as well as the minimum is shown in fig. 3 15, together with the corresponding event on the Septemboard.

septem_fadc_run_239_event_1068_region_crAll.svg
Figure 3: Figure 34: Example of a fully reconstructed InGrid event and FADC spectrum from a \SI{5.9}{keV} X-ray recorded with the Septemboard detector during a calibration run. On the left of each plot are all properties computed for the data. In the FADC plot the blue line indicates the baseline. Green vertical: rise time from full to dashed line. Red vertical: point of spectrum minimum. Light red: fall time from dashed to full line. Rise / fall time stops \(\SI{2.5}{\%}\) before baseline is reached.

9.5.4. Generate the FADC baseline plot [/]    extended

  • [ ] GENERATE THE PLOT CURRENTLY USED IN THE ABOVE BODY HERE -> The current version comes from the TPA test suite!
import nimhdf5, ggplotnim
import std / [strutils, os, sequtils]
import ingrid / [tos_helpers, fadc_helpers, ingrid_types, fadc_analysis]

proc stripPrefix(s, p: string): string =
  result = s
  result.removePrefix(p)

proc plotIdx(df: DataFrame, fadcData: Tensor[float], runNumber, idx: int) =
  let xmin = df["xmin", int][idx]
  let xminY = df["minvals", float][idx]
  let xminlineX = @[xmin, xmin] # one point for x of min, max
  let fData = fadcData[idx, _].squeeze
  let xminlineY = linspace(fData.min, fData.max, 2)

  let riseStart = df["riseStart", int][idx]
  let fallStop = df["fallStop", int][idx]
  let riseStartX = @[riseStart, riseStart]
  let fallStopX = @[fallStop, fallStop]
  let baseline = df["baseline", float][idx]  
  let baselineY = @[baseline, baseline]
  
  let dfLoc = toDf({ "x"         : toSeq(0 ..< 2560),
                     "baseline"  : baseline,
                     "data"      : fData,
                     "xminX"     : xminlineX, 
                     "xminY"     : xminlineY,
                     "riseStart" : riseStartX,
                     "fallStop"  : fallStopX })
                     # Comparison has to be done by hand unfortunately
  let path = "/t/fadc_spectrum_baseline_$#.pdf" % $idx
  ggplot(dfLoc, aes("x", "data")) +
    geom_line() +
    geom_point(color = color(0.1, 0.1, 0.1, 0.1)) +
    geom_line(aes = aes("x", "baseline"),
              color = "blue") +
    geom_line(data = dfLoc.head(2), aes = aes("xminX", "xminY"),
                     color = "red") +
    geom_line(data = dfLoc.head(2), aes = aes("riseStart", "xminY"),
                     color = "green") +
    geom_line(data = dfLoc.head(2), aes = aes("fallStop", "xminY"),
                     color = "pink") +
    ggtitle("FADC spectrum of run $# and index $#" % [$runNumber, $idx]) +
    xlab("FADC Register") + ylab("FADC signal voltage U [V]") + 
    ggsave(path)
  copyFile(path, "/t/fadc_spectrum_baseline.pdf")

proc toDf[U: object](x: U): DataFrame =
  result = newDataFrame()
  for field, val in fieldPairs(x):
    type T = typeof(val[0]) 
    when T isnot int and T is SomeInteger:
      result[field] = val.asType(int)
    elif T isnot float and T is SomeFloat:
      result[field] = val.asType(float)
    else:
      result[field] = val

proc plotFadc(h5f: H5File, runNumber, sleep: int) =
  var run = h5f.readRecoFadcRun(runNumber)
  var data = h5f.readRecoFadc(runNumber)  
  var df = data.toDf()
  df["minvals"] = run.minvals
  for idx in 0 ..< df.len:
    plotIdx(df, run.fadc_data, runNumber, idx)
    sleep(sleep)

proc main(fname: string, runNumber: int, sleep = 1000) =
  var h5f = H5open(fname, "r")
  let fileInfo = h5f.getFileInfo()
  for run in fileInfo.runs:
    if run == runNumber:
      plotFadc(h5f, run, sleep)
      
when isMainModule:
  import cligen
  dispatch main
  • [ ] MAKE PLOT PRETTY AND RERUN FOR THIS: run 281 and index 1533

9.5.5. Generate plot of InGrid and FADC event    extended

The command to produce the plot as seen in the thesis is:

W1=825 W2=675 G_LEFT=0.65 F_LEFT=0.3 L_MARGIN=10 R_MARGIN=4 USE_TEX=true SCALE=1.3 plotData \
       --h5file ~/CastData/data/CalibrationRuns2018_Reco.h5 \
       --runType=rtCalibration \
       --eventDisplay --septemboard \
       --events 1068 --runs 239

The parameters used here are the default now outside the USE_TEX=true and SCALE=1.3.

The resulting plot needs to be copied over to the phd repository manually, as plotData currently does not support outputting a plot to a specific file name (it's intended to produce many plots automatically and act more as a fast way to visualize data under different cuts).

9.5.6. Noise sensitive    extended

Because of the small amplitude of the associated signals induced on the grid, electromagnetic interference is a serious issue with this FADC setup. Ideally, the detector should be installed in a Faraday cage and a short, shielded LEMO cable should be used to connect it to the pre-amplifier and amplifier.

9.5.7. FADC amplifier settings    extended

The FADC does not require a proper calibration in the same sense as the Timepix needs to be calibrated. However, the amplifier settings have a large impact on the resulting signals. Better measurements of the effect of the different integration / differentiation settings of the amplifier would have been very valuable, but were never performed for lack of time. The same holds for different amplifications as to properly understand the data ranges the FADC can record and how the trigger threshold in equivalent \(\si{keV}\) on the center GridPix is related. This would have made smarter decisions about the settings possible (e.g. to optimize the lowest possible activation to have the FADC as a trigger for lower energies available). Only a single set of measurements exists comparing the FADC integration time of \(\SI{50}{ns}\) and \(\SI{100}{ns}\), which is only partially useful (in particular because the other parameters (differentiation time, amplification etc.) were not properly recorded for these. The problem is especially that the TOS data does not record these parameters anyway, as they are physical rotary knobs on the amplifier.

We will talk about these things in sec. 11.4 again when discussing the impact of noise on the FADC data taking and changing parameters that were done to mitigate that to an extent.

9.5.8. Example to reconstruct FADC data    extended

As mentioned in the beginning of sec. 9.5 reconstruction is done as part of raw_data_manipulation:

raw_data_manipulation -p <run path> --runType rtBackground --h5out /tmp/run_foo.h5

In reconstruction:

reconstruction -i <h5file> --only_fadc

9.6. Scintillator data

For the scintillator signals we only record a trigger flag and the number of clock cycles since the last scintillator trigger from the moment the FADC triggered. These two pieces of information are part of the Septemboard data files included in the header. 16 , 17

Footnotes:

4

HDF5 (The HDF Group 1997) is the Hierarchical Data Format of version 5. It is a binary data format intended for scientific datasets, which uses an in-file layout similar to a virtual file system. Datasets (equivalent to files) are stored in groups (equivalent to directories). Metadata can be attached to either and linking between datasets, even across files is supported. It supports a large number of compression filters to reduce the file size of the stored data.

5

These type of cuts are applied at this stage of the processing, because for certain use cases or certain detectors specific ToT or ToA ranges are of no interest / contain junk data (because of a faulty chip for example). In this case it is useful to remove such data in this preprocessing stage to lighten the workload for anything after.

8

Note that the absolute value of the rotation angle is of secondary importance. For X-rays the rotation angle is going to be random, as the definition of a long and short axis in a (theoretically perfect) circle depends on the statistical distribution of the pixels. However, for pure muons it allows to map the rotation angle to the incidence angle.

9

The term 'root mean square' is used although we actually refer to the standard deviation of the sample. We follow (Christoph Krieger 2018), but this ambiguity is often encountered unfortunately.

11

A PDF of the FADC manual is available here: https://archive.org/details/manualzilla-id-5646050/

12

The FADC can be operated in a 12 or 14-bit mode. We run in the 12-bit mode. Also see 17.2.1.1.

13

Aside from performing peak fitting (which is difficult and requires understanding of the expected signal shapes) another approach might be a local linear smoothing (e.g. a Savitzky-Golay filter with polynomial of order 1) in a suitable window range. The result would be a much more stable spectrum. This could then be used to compute the numerical derivative from which all those intervals with a slope smaller than some epsilon provide the dataset from which to compute the mean. The tricky aspect would be choice of window size and the behavior in very noisy events.

14

The naming of the rise and fall time in the context of a negative pulse is slightly confusing. Rise time refers to the negative rise towards the minimum of the pulse and the fall time to the time to return-to-baseline.

15

Excuse the small text for the annotations. They are not important, but may be interesting for some readers!

16

Important note for people potentially investigating the raw data from 2017: There was a small bug in the readout software during the beginning of the 2017 data taking period, which wrote the scintillator trigger clock cycle values into subsequent output files even if no FADC trigger was received (and thus no scintillator trigger was actually read out). However, there is a flag for an FADC trigger. To correctly read the first data runs it is therefore required to not only look at the scintillator trigger clock cycles, but also at whether the FADC actually triggered. This is handled in the analysis framework.

17

In addition to the above bug, there was unfortunately a more serious bug, which rendered the scintillator counts useless in the end of 2017 / beginning of 2018 data taking period. The polarity of the signals was inverted in the detector firmware, resulting in useless "trigger" information.

Click on any heading marked 'extended' to open it