34. Software Software Appendix
In this appendix we will go over the software developed and used in the course of this thesis with a bit more detail than in the main thesis (beginning of chapter 9). The focus is more on the technical and usability side and not the physics application. Read this appendix either if you
- are simply interested,
- are intending to reproduce the analysis or use (parts of) the software for your own data,
- wish to further process data produced by these tools with your own software.
34.1. Why did I start writing my own analysis framework? extended
Some of you may wonder why I wrote all this code in order to analyze my data. It is a huge undertaking with not much upside in the context of finishing my PhD after all.
The reason boils down to 2 main points:
- Christoph (Christoph Krieger 2018) used the software framework
MarlinTPC (see also 1). It is an extension
to the 'Marlin' framework intended for the application of TPCs in
the context of the International Linear Collider (ILC).
I had multiple issues with this:
- it is mostly intended for TPCs. While a GridPix is a TPC of sorts, most regular TPCs use strips or pads as a readout and not pixels. This made the GridPix introduction inefficient.
- Christoph's existing code was written for a single GridPix. The addition of the Septemboard and more importantly detector vetoes would have anyhow meant significant time investment. From a simple glance it did not seem like MarlinTPC would make for an easy platform to introduce the required features.
- it is a framework based on a base program, which is controlled by XML filed called 'steering files'. These describe the processes to be run in order. Instead of building small programs that do one thing, the main program is large and decides at runtime which processes to run. All this is built in a very heavy object oriented manner, resulting in massive amounts of boilerplate code for each new feature. In addition assembling processes in such a way results in generally low performance.
- it uses CERN's ROOT internally, of which I am not a fan, especially not for projects that are not LHC scale.
- I was a bit naive in some respects. Less so in terms of
underestimating the amount of code needing to write to reproduce
the basic reconstruction of Christoph's code. Yes, that also was
quite some work, but manageable. Firstly, I was naive enough to
think people would "accept" my results easily, in case their are
discrepancies between my and Christoph's results (of which there
were many). As it turns out two competing pieces of code often
don't produce the exact same results. Understanding and minimizing
discrepancies was a serious hindrance. Some boiled down to bugs in
my own code, but others due to bugs in MarlinTPC.
Secondly, by choosing Nim as the target language I underestimated
the amount of code I would / had to write completely independent of
the actual
TimepixAnalysis
code base. Things like a plotting library, dataframe library etc. Initially I thought I would either not need these ore simply use Python for additional processing, but the joy of building things and my urge for "purity" (few external dependencies) lead me down the path to replace more and more of my dependencies by my own code.
It was a lot of work of course. But while it certainly delayed the end of my thesis by a significant time, I learned way more than I would have otherwise.
On another note: Having looked into MarlinTPC now (https://znwiki3.ifh.de/MarlinTPC/ mentions a migration to DESY's GitLab here: https://gitlab.desy.de/users/sign_in. Let me be blunt: screw non public code access!
) again, the choice was very much the right one. Development seemingly has halted or at the very least is not public. This page
Second note: REST was not known to me at the time when I started
development. But more importantly, it pretty much follows in the
footsteps of MarlinTPC (in terms of XML steering files plus a base
rest-manager
program, being based on CERN's ROOT, heavy OOP paradigm
etc.). We'll see how REST holds up in 10 years time (maybe it blooms
in the context of BabyIAXO!). At the very least I'm pretty confident
I'll be able to get this code up and running with little work at that
point.
34.2. Nim extended
As briefly mentioned already in the main body, Nim is a – relatively – young programming language that offers C-like performance combined with Lisp-like metaprogramming, Python-like whitespace-sensitive, few operator based syntax and an Ada-like type system. Combined, these provide the perfect base for the single developer to be productive and build fast, safe software.
In other words: the language gets out of my way and let's me build stuff that is fast and should in theory be understandable to people coming after me.
34.3. TimepixAnalysis
Introduced in the main part, in sec. 9.1, TimepixAnalysis
(Schmidt 2022c) is the name for the repository containing a large collection
of different programs for the data reconstruction and analysis of
Timepix based detectors.
Generally, the README
in the repository gives an overview of all the
relevant programs, installation instructions and more. For further
details therefore check there or simply open an issue in the
repository (Schmidt 2022c).
Here we will now go over the main programs required to handle the Septemboard data taken at CAST so that in appendix 35 we can present the commands to for the entire CAST data reconstruction.
Note: in the PDF and HTML version of the thesis I provide some links to different parts of the repository. These generally point to Github, because the main public repository of
TimepixAnalysis
is found there. This is mostly out of convenience though. It should be straightforward to map them to the paths inside of your local copy of the repository, but that makes it trickier to link to.
34.3.1. Common points between all TimepixAnalysis programs
All programs in the TimepixAnalysis repository are command line only. While it would be quite doable to merge the different programs into a single graphical user interface (GUI), I'm personally not much of a GUI person. Each program usually has a large number of (optional) parameters. Keeping the GUI up to date with (in the past, quickly) changing features is just extra work, which I personally did not have any use for (if someone wishes to write a GUI for TimepixAnalysis, I'd be more than happy to mentor though).
Every program uses cligen
2, a command line interface
generator. Based on the definition of the main procedure(s) in the
program, a command line interface is generated. While cligen
provides extremely simplified command line argument parsing for the
developer, it also gives a nice help
screen for every program. For
example, running the first program of the analysis pipeline
raw_data_manipulation
with the -h
or --help
option:
raw_data_manipulation -h
yields the help screen as shown in listing 16 3. Keep this in mind, if you are unsure about how to use any of the here mentioned programs.
Further, there is a TOML configuration file in the repository
(Analysis/ingrid/config.toml
from the repository root), which
controls many aspects of the different programs. Most of these can be
overwritten by command line arguments to the appropriate programs and
some also via environment variables. See the extended thesis for
information about this, it is mentioned where important.
Usage: main [REQUIRED,optional-params] Version: 44c0c91 built on: 2023-12-06 at 13:01:35 Options: -h, --help print this cligen-erated help --help-syntax advanced: prepend,plurals,.. -p=, --path= string REQUIRED set path -r=, --runType= RunTypeKind REQUIRED Select run type (Calib | Back | Xray) The following are parsed case insensetive: Calib = {"calib", "calibration", "c"} Back = {"back", "background", "b"} Xray = {"xray", "xrayfinger", "x"} -o=, --out= string "" Filename of output file. If none given will be set to run_file.h5. -n, --nofadc bool false Do not read FADC files. -i, --ignoreRunList bool false If set ignores the run list 2014/15 to indicate using any rfOldTos run -c=, --config= string "" Path to the configuration file to use. Default is config.toml in directory of this source file. ... -t, --tpx3 bool false Convert data from a Timepix3 H5 file to TPA format instead of a Tpx1 run directory ...
34.3.2. Dependencies
TimepixAnalysis mainly has a single noteworthy external dependency,
namely the HDF5 (The HDF Group 1997) library. The vast majority of code (inside
the repository itself and its dependencies) is pure Nim. Two
optimization libraries written in C mpfit
(Levenberg-Marquardt) 4 and NLopt 5 are wrapped from
Nim and are further minor dependencies. Local compilation and
installation of these is trivial and explained in the TimepixAnalysis
README.
For those programs related to the multilayer perceptron (MLP) training or usage, PyTorch (Paszke et al. 2019) is an additional dependency via (SciNim contributors 2023). Flambeau installs a suitable PyTorch version for you.
Common other dependencies are the cairo
graphics library 6
and a working BLAS and LAPACK installation.
34.3.3. Compilation
Nim being a compiled language means we need to compile the programs
mentioned below. It can target a C or C++ backend (among others). The
compilation commands differ slightly between the different programs
and can depend on usage. The likelihood
program below for example
can be compiled for the C or C++ backend. In the latter case, the MLP
as a classifier is compiled in.
Generally, compilation is done via:
nim c -d:release foo.nim
where foo.nim
is any of the programs below. -d:release
tells the
Nim compiler to compile with optimizations (you can compile with
-d:danger
for even faster, but less safe, code). Replace c
by
cpp
to compile to the C++ backend.
See the TimepixAnalysis README for further details on how to compile each program.
Unless otherwise specified, each program mentioned below is located in
Analysis/ingrid
from the root of the TimepixAnalysis repository.
34.3.4. raw_data_manipulation
raw_data_manipulation
is the first step of the analysis
pipeline. Essentially, it is a parsing stage of the data generated by
TOS (see section 17.2.1 for an explanation of it)
and storing it in a compressed HDF5 (The HDF Group 1997) data file.
The program is fed a directory containing a TOS run via the -p /
--path
argument. Either a directory containing a single run (i.e. a
data taking period ranging typically from minutes to days in length),
or a directory that itself contains multiple TOS run directories. Runs
compressed as gzipped TAR balls, .tar.gz
are also supported.
All data files contained in a run directory will then be parsed in a
multithreaded way. The files are memory mapped and parsed in parallel
into a Run
data structure, which itself contains Event
structures.
If FADC files are present in a directory, these will also be parsed
into FadcEvent
structures in a similar fashion, unless explicitly
disabled via the --nofadc
option.
Each run is then written into the output HDF5 file as a 'group' (HDF5 terminology). The meta data about each run and event are stored as 'attributes' and additional 'datasets', respectively. The structure of the produced HDF5 file is shown in sec. 34.3.4.1.
In addition, the tool also supports input from HDF5 files containing the raw data from a Timepix3 detector. That data is parsed and reprocessed into the same kind of file structure.
34.3.4.1. HDF5 data layout generated by raw_data_manipulation
The listing 17 shows the layout of the
data stored in the HDF5 files after the raw_data_manipulation
program has processed the TOS run folders. The data is structured in
groups based on each run, chip and the FADC (if available). Generally
each "property" is stored in its own dataset for performance reasons
to allow faster access to individual subsets of the data (read only
the hits, only \(x/y\) data, etc.). While HDF5 supports even
heterogeneous compound datasets (that is different data types in
different "columns" of a 2D like dataset), these are only used
sparingly and not at all in the raw_data_manipulation
output, as
reading individual columns from these is inefficient.
- runs - run_<number> - chip_0 # one for each chip in the event - Hits # number of hits in each event - Occupancy # a 2D occupancy map of this run - ToT # all ToT values of this run - raw_ch # the ToT/ToA values recorded for each event (ragged data) - raw_x # the x coordinates recorded for each event (ragged data) - raw_y # the y coordinates recorded for each event (ragged data) - chip_i # all other chips - ... - fadc # if available - eventNumber # event number of each entry # (not all events have FADC data) - raw_fadc # raw FADC data (uncorrected, all 10240 registers) - trigger_record # temporal correction factor for each event - fadcReadout # flag if FADC was readout in each event - fadcTriggerClock # clock cycle FADC triggered - szintillator trigger clocks # datasets for each scintillator - timestamp # timestamp of each event - run_i # all other runs - ...
34.3.5. reconstruction
After the raw data has been converted to storage in HDF5, the
reconstruction
tool is used to start the actual analysis of the
data. The program receives an input HDF5 file via the -i / --input
argument. As the name implies, the first stage of data analysis is in
the form of reconstructing the basic properties of each event. In this
stage all events are processed in a multithreaded way. The steps for
cluster finding and geometric cluster reconstruction (as mentioned in
sec. 9.4) are performed and the data is
written to the desired output file given by -o / --outfile
.
The produced output HDF5 file then also acts as the input file for
reconstruction
for all further, optional reconstruction steps. These
are mentioned at different parts in the thesis, but we will explain
them shortly here now.
-
--only_fadc
- Performs the reconstruction of the FADC data to calculate FADC values such as rise and fall times.
-
--only_fe_spec
- If the input file contains \cefe calibration runs, creates the \cefe spectra and performs fits to them. Also performs the energy calibration for each run.
-
--only_charge
- Performs the
ToT
calibration of all runs to compute the detected charges in electrons. Requires each chip to be present in the InGrid database (see sec. 34.3.10). -
--only_gas_gain
- Computes the gas gain in the desired interval lengths via Pólya fits.
-
--only_gain_fit
- If the input file contains \cefe calibration runs, performs the fit of energy calibration runs against the gas gain of each interval. Required to perform energy calibration in background runs.
-
--only_energy_from_e
- Performs the energy calibration for each cluster in the input file.
34.3.5.1. Command line interface of reconstruction
extended
Usage: main [REQUIRED,optional-params] InGrid reconstruction and energy calibration. NOTE: When calling reconstruction without any of the --only_ flags, the input file has to be a H5 file resulting from raw_data_manipulation. In the other cases the input is simply a file resulting from a prior reconstruction call! The optional flags are given roughly in the order in which the full analysis chain requires them to be run. If unsure on the order, check the runAnalysisChain.nim file. Version: 6f8ed08 built on: 2023-11-17 at 18:36:40 Options: -h, --help print this cligen-erated help --help-syntax advanced: prepend,plurals,.. -i=, --input= string REQUIRED set input -o=, --outfile= string "" Filename and path of output file -r=, --runNumber= int none Only work on this run -c, --create_fe_spec bool false Toggle to create Fe calibration spectrum based on cuts Takes precedence over --calib_energy if set! --only_fadc bool false If this flag is set, the reconstructed FADC data is used to calculate FADC values such as rise and fall times among others, which are written to the H5 file. --only_fe_spec bool false Toggle to /only/ create the Fe spectrum for this run and perform the fit of it. Will try to perform a charge calibration, if possible. --only_charge bool false Toggle to /only/ calculate the charge for each TOT value based on the TOT calibration. The ingridDatabase.h5 needs to be present. --only_gas_gain bool false Toggle to /only/ calculate the gas gain for the runs in the input file based on the polya fits to time slices defined by gasGainInterval. ingridDatabase.h5 needs to be present. --only_gain_fit bool false Toggle to /only/ calculate the fit mapping the energy calibration factors of the 55Fe runs to the gas gain values for each time slice. Required to calculate the energy in any run using only_energy_from_e. --only_energy_from_e bool false Toggle to /only/ calculate the energy for each cluster based on the Fe charge spectrum vs gas gain calibration --only_energy= float none Toggle to /only/ perform energy calibration using the given factor. Takes precedence over --create_fe_spec if set. If no runNumber is given, performs energy calibration on all runs in the HDF5 file. --clusterAlgo= ClusteringAlgorithm none The clustering algorithm to use. Leave at caDefault unless you know what you're doing. -s=, --searchRadius= int none The radius in pixels to use for the default clustering algorithm. -d=, --dbscanEpsilon= float none The radius in pixels to use for the DBSCAN clustering algorithm. -u=, --useTeX= bool none Whether to use TeX to produce plots instead of Cairo. --config= string "" Path to the configuration file to use. -p=, --plotOutPath= string none set plotOutPath
34.3.5.2. HDF5 data layout generated by reconstruction
The HDF5 file generated by reconstruction
follows closely the one
from raw_data_manipulation
. The main difference is that within each
chip group now each chip has a different number of entries in the
datasets, as each entry now corresponds to a single cluster, not an
event from the detector. On some events multiple clusters on a single
chip may be reconstructed, while other events may be fully empty. This
means an additional eventNumber
dataset is required for each chip,
which maps back each cluster to a corresponding event.
Aside from that the other major difference is simply that each chip has a larger number of datasets, as each computed cluster property is a single variable. Also additional new datasets will be created during the data calibration (charge calibration, computation of the gas gain, etc.).
Listing 19 shows the layout in a
similar fashion to the equivalent for raw_data_manipulation
before.
- reconstruction - run_<number> - chip_0 # one for each chip in the event - datasets for each property - optional datasets for calibrations - chip_i # all other chips - ... - fadc # if available - datasets for each FADC property - common datasets # copied from `raw_data_manipulation` input - run_i # all other runs - ...
34.3.6. cdl_spectrum_creation
This is a helper program responsible for the treatment of the X-ray
reference data taken at the CAST Detector Lab in Feb. 2019. It
receives an HDF5 file as input that is fully reconstructed using
raw_data_manipulation
and reconstruction
containing all runs taken
in the CDL. An additional Org table is used as reference to map each
run to the correct target/filter kind, found in
resources/cdl_runs_2019.org
.
The program performs the fits to the correct fluorescence lines for
each run based on the target/filter kind in use. It can also produce a
helper HDF5 file called calibration-cdl-2018.h5
via the genCdlFile
argument, which contains all CDL data split by target/filter
kind. This file is used in the context of the likelihood cut method to
produce the reference distributions for each cluster property used.
34.3.7. likelihood
The likelihood
program is the (historically named) tool that applies
the classifier and any selection of vetoes to an input file. The input
files are a fully reconstructed background HDF5 file, corresponding
calibration runs and the calibration-cdl-2018.h5
file mentioned
above. It has a large number of command line options to adjust the
classifier that is used, software efficiency, vetoes, region of the
chip to cut to, whether tracking data or background data is selected
and more. The program writes the remaining clusters (with additional
meta information) to the HDF5 file given by the --h5out
argument. The structure is essentially identical to that of the
reconstruction
tool (the data is stored in a likelihood
group
instead of a reconstruction
group).
The selection of tracking or non-tracking data requires information
about when solar trackings took place as attributes inside of the
background HDF5 files. These are added using the cast_log_reader
,
see sec. 34.3.11.
It is also used directly to estimate the random coincidences of the septem and line vetoes, as mentioned in sec. 12.5.5.
34.3.8. determineDiffusion
directory
The Analysis/ingrid/determineDiffusion
directory contains the
library / binary to empirically determine the gas diffusion parameters
from input data, as explained in
sec. 12.4.3. It can either be
compiled as a standalone program or be used as a library.
34.3.9. nn
directory
The Analysis/ingrid/nn
subdirectory in the TimepixAnalysis
repository contains the programs related to the training and
evaluation of the multilayer perceptrons (MLP) used in the thesis. Of
note is the train_ingrid
program, which is the program to train a
network. It allows to customize the network to be trained based on
command line arguments describing the number of neurons, hidden
layers, optimizers, activation functions and so forth. The extended
thesis contains the command to train the best performing network.
Secondly, the simulate_xrays
program is a helper program to produce
an HDF5 file containing simulated X-rays as described in
sec. 12.4.2. It makes use of the
fake_event_generator.nim
file in the ingrid
directory, which
contains the actual logic.
34.3.10. InGridDatabase
The InGrid database is both a library part of the TimepixAnalysis
repository (InGridDatabase
directory) as well as a binary tool and
the name for a very simple 'database' storing information about
different GridPix chips.
At its core the 'database' part is an HDF5 file containing chip
calibrations (ToT, SCurve, …) mapped to timestamps or run numbers in
which these are applicable. This allows (mainly) the reconstruction
program to retrieve the required calibrations automatically without
user input based on the given input files.
To utilize it, the databaseTool
needs to be compiled as a
binary. Chips are added to the database using this tool. A directory
describing the applicable run period and containing calibration files
for the chip need to follow the format seen for example in:
https://github.com/Vindaar/TimepixAnalysis/tree/master/resources/ChipCalibrations/Run2
for the Run-2 period of the Septemboard. The runPeriod.toml
file
describes the applicability of the data, see listing
20 for the file in this case. For
each chip, there is simply a directory with the calibration files as
produced by TOS and an additional chipInfo.txt
file, see listing
21. Note that the runPeriod
name needs
to match the name of one of the run periods listed in the TOML file.
The databaseTool
also allows to perform fits to the calibration
data, if needed (for example to analyze SCurves or the raw ToT
calibration data).
title = "Run period 2 of CAST, 2017/18" # list of the run periods defined in the file runPeriods = ["Run2"] [Run2] start = 2017-10-30 stop = 2018-04-11 # either as a sequence of run numbers validRuns = [ 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89 90, 91, 92, 93, 94, 95, 96, 97, 98, 99,100,101,102,103 104,105,106,107,108,109,110,111,112,113,114,115,116,117 118,119,120,121,122,123,124,125,126,127,128,145,146,147 148,149,150,151,152,153,154,155,156,157,158,159,160,161 162,163,164,165,166,167,168,169,170,171,172,173,174,175 176,177,178,179,180,181,182,183,184,185,186,187,188,189 ] # or as simply a range given as start and stop values firstRun = 76 lastRun = 189
chipName: H10 W69 runPeriod: Run2 board: SeptemH chipNumber: 3 Info: This calibration data is valid for Run 2 starting until March 2018!
34.3.11. cast_log_reader
The LogReader/cast_log_reader
is a utility to work with the slow control and
tracking log files produced by CAST. It can parse and analyze the
different log file formats used over the years and provide different
information (magnet operation statistics for example). For the
Septemboard detector it provides the option to parse the tracking
log files and add the tracking information to the background HDF5
files.
The program first parses the log files and determines valid solar
trackings. Then, given an HDF5 data file containing the background
data each solar tracking is mapped to a background run. One background
run may have zero or more solar trackings attached to it. In the final
step the solar tracking start and stop information is added to each
run as additional meta data. During the later stages of processing in
other TimepixAnalysis
programs, notably likelihood
, this meta data
is then used to only consider background (non tracking) or solar
tracking data.
The CAST log files relevant for the Septemboard detector can be found together with the Septemboard CAST data.
34.3.11.1. Adding tracking information to HDF5 files extended
See sec. 20.6.1 on how to use
cast_log_reader
to add the tracking information to the HDF5 files.
34.3.12. mcmc_limit_calculation
mcmc_limit_calculation
is the 'final' tool relevant for this
thesis. As the name implies it performs the limit calculation using
Markov Chain Monte Carlo (MCMC) as explained in detail in chapter
13. See the extended thesis on how to use it.
34.3.13. Tools
directory
Multitude of tools for various things that were analyzed over the years. Includes things like computing the gas properties of the Septemboard gas mixture and detection efficiency.
34.3.14. resources
directory
A large number of resources required or simply useful about different data takings, efficiencies, log files and more, which are small enough to be part of a non LFS git repository.
34.3.15. Plotting
directory
From a practical analysis point of view, the Plotting
directory is
one of the most interesting parts of the repository. It contains
different tools to visualize data at various stages of the analysis
pipeline. The most relevant are mentioned briefly here.
34.3.15.1. plotBackgroundClusters
plotBackgroundCluster
produces plots of the distribution of cluster
center left after application of the likelihood
program. This is
used to produce figures like 23
and fig. #fig:background:background_suppression_comparison.
34.3.15.2. plotBackgroundRate
plotBackgroundRate
is the main tool to visualize background (or raw
data) spectra. All such plots in the thesis are produced with
it. Input files are reconstructed HDF5 files or the result of the
likelihood
program.
34.3.15.3. plotCalibration
plotCalibration
is a tool to produce visualizations of the different
Timepix calibration steps, e.g. ToT calibration, SCurve scans and so
on. The figures in sec. 19.1 are produced
with it.
34.3.15.4. plotData
plotData
is a very versatile tool to produce a variety of different
plots. It can produce histograms of the different geometric
properties, occupancy maps, event displays and more. If desired, it can
produce a large number of plots for an input data file in one go. It
is very powerful, because it can receive an arbitrary number of cuts
on any dataset present in the input. This allows to produce
visualizations for any desired subset of the data. For example to
produce event displays or histograms for only those events with
specific geometric properties. It is an exceptionally useful tool to
understand certain subsets of data that appear 'interesting'.
In sec. 30.2 we mention non noisy FADC events
in a region of the rise time / skewness space of
fig. 187. These are easily
filtered to and investigated using plotData
. Also
fig. 190 and
fig. 34 are produced with it, among
others. Generally, it favors information (density) over aesthetically
pleasing visualizations though.
34.4. Other libraries relevant for TimepixAnalysis
A few other libraries not part of the TimepixAnalysis repository bear mentioning, due to their importance. They were written alongside TimepixAnalysis.
- ggplotnim
- A
ggplot2
7 inspired plotting library. All plots in this thesis are produced with it. - Datamancer
- A
dpylr
8 inspired data frame library. - nimhdf5
- A high level interface to the HDF5 (The HDF Group 1997) library,
somewhat similar to
h5py
9 for Python. - Unchained
- A library to perform zero runtime overhead, compile-time checking and conversion of physical units. Exceptionally useful to avoid bugs due to wrong unit conversions and a big help when dealing with natural units.
- Measuremancer
- A library to deal with measurements with uncertainties. Performs automatic Gaussian error propagation when performing calculations with measurements.
- xrayAttenuation
- A library dealing with the interaction of X-rays with matter (gases and solids). It is used to calculate things like absorption of gas and reflectivities of the X-ray telescope in this thesis.
- TrAXer
- The raytracer used to compute the axion image, expanded on in appendix 37.
Footnotes:
You can have a $HOME/.config/cligen/config
configuration
file to adjust the output style (color, column widths, drop entire
columns etc.).