35. Full data reconstruction Appendix
We will now go over how to fully reconstruct all CAST data from the Run-2 and Run-3 data taking campaigns all the way up to the computation of the observed limit. This assumes the following:
- all CAST data files are located in a directory accessible by a
DATA
environment variable. Inside should be the directory structure containing directories2017
,2018
,2018_2
andCDL_2019
as downloaded from Zenodo (Schmidt 2024). - all binaries used below are compiled and their location is in your
PATH
variable. There is abin
directory in the repository root, which contains symbolic links to the binaries if they are placed next to the source file. I recommend adding this directory to yourPATH
.
Note: There is also the
Analysis/ingrid/runAnalysisChain
program, which automates a majority of these calls. See the section below on how it replaces the data parsing and reconstruction. Here I wish to illustrate the actual steps and don't hide any.
Furthermore, the commands below produce the correct results assuming
the Analysis/ingrid/config.toml
file is used as committed as part of
the phd
tag in the TimepixAnalysis repository.
35.1. Raw data parsing and reconstruction
If this is all in place, first perform the raw data parsing.
Run-2, background and calibration:
raw_data_manipulation -p $DATA/2017/DataRuns \ --runType rtBackground \ --out $DATA/DataRuns2017_Raw.h5 raw_data_manipulation -p $DATA/2017/CalibrationRuns \ --runType rtCalibration \ --out $DATA/DataRuns2017_Raw.h5
Run-3:
raw_data_manipulation -p $DATA/2018_2/DataRuns \ --runType rtBackground \ --out $DATA/DataRuns2018_Raw.h5 raw_data_manipulation -p $DATA/2018_2/CalibrationRuns \ --runType rtCalibration \ --out $DATA/DataRuns2018_Raw.h5
Next, the initial data reconstruction (geometric properties):
Run-2:
reconstruction -i $DATA/DataRuns2017_Raw.h5 \ -o $DATA/DataRuns2017_Reco.h5 reconstruction -i $DATA/CalibrationRuns2017_Raw.h5 \ -o $DATA/CalibrationRuns2017_Reco.h5
Run-3:
reconstruction -i $DATA/DataRuns2018_Raw.h5 \ -o $DATA/DataRuns2018_Reco.h5 reconstruction -i $DATA/CalibrationRuns2018_Raw.h5 \ -o $DATA/CalibrationRuns2018_Reco.h5
Now, the next steps of the reconstruction (charge calibration, gas gain and FADC reconstruction):
DATA=~/CastData/data for typ in Data Calibration; do for year in 2017 2018; do file="${DATA}/${typ}Runs${year}_Reco.h5" reconstruction -i $file --only_charge reconstruction -i $file --only_fadc reconstruction -i $file --only_gas_gain done done
where we simply loop over the Data
and Calibration
prefixes and
years.
With this done, we can perform the \cefe calibration fits:
reconstruction -i $DATA/CalibrationRuns2017_Reco.h5 --only_fe_spec reconstruction -i $DATA/CalibrationRuns2018_Reco.h5 --only_fe_spec
and then finally the fit of energy calibration factors determined from each fit against its gas gain:
reconstruction -i $DATA/CalibrationRuns2017_Reco.h5 --only_gain_fit reconstruction -i $DATA/CalibrationRuns2018_Reco.h5 --only_gain_fit
This then allows to calibrate the energy for each file:
reconstruction -i $DATA/DataRuns2017_Reco.h5 --only_energy_from_e reconstruction -i $DATA/DataRuns2018_Reco.h5 --only_energy_from_e reconstruction -i $DATA/CalibrationRuns2017_Reco.h5 --only_energy_from_e reconstruction -i $DATA/CalibrationRuns2018_Reco.h5 --only_energy_from_e
35.2. Parse and reconstruct the CDL data
In order to use the likelihood cut method we also need to parse and
reconstruct the CDL data. Note that this can be done before even
reconstructing any of the CAST data files (with the exception of
--only_energy_from_e
, which is optional anyway).
raw_data_manipulation -p $DATA/CDL_2019/ -r Xray -o $DATA/CDL_2019/CDL_2019_Raw.h5 reconstruction -i $DATA/CDL_2019/CDL_2019_Raw.h5 -o $DATA/CDL_2019/CDL_2019_Reco.h5 reconstruction -i $DATA/CDL_2019/CDL_2019_Reco.h5 --only_charge reconstruction -i $DATA/CDL_2019/CDL_2019_Reco.h5 --only_fadc reconstruction -i $DATA/CDL_2019/CDL_2019_Reco.h5 --only_gas_gain reconstruction -i $DATA/CDL_2019/CDL_2019_Reco.h5 --only_energy_from_e
With this file we can then run cdl_spectrum_calibration
in order to
produce the calibration-cdl-2018.h5
file:
cdl_spectrum_creation $DATA/CDL_2019/CDL_2019_reco.h5 \ --genCdlFile --year=2018
35.3. Add tracking information to background files
To add the tracking information, the CAST log files are needed. The
default path (which may be a symbolic link of course) is in
resources/LogFiles/tracking-logs
from the TimepixAnalysis root.
for year in 2017 2018; do cast_log_reader tracking \ -p $TPXDIR/resources/LogFiles/tracking-logs \ --startTime "2017/01/01" \ --endTime "2018/12/31" \ --h5out "${DATA}/DataRuns${year}_Reco.h5" done
Here we assume TPXDIR
is an environment variable to the root of the
TimepixAnalysis repository. Feel free to adjust if your path differs.
35.4. Using runAnalysisChain
All of the commands above can also be performed in one by using runAnalysisChain
:
./runAnalysisChain \ -i $DATA \ --outpath $DATA \ --years 2017 --years 2018 \ --calib --back --cdl \ --raw --reco \ --logL \ --tracking
which tells the runAnalysisChain
helper to parse all data files for
Run-2 and Run-3, do the same for the CDL data files, reconstruct with
all steps, compute the likelihood values and add the tracking
information to the background data files. It makes the same
assumptions about the location of the data files as in the above.
In particular, if the tracking log files are in a different location,
add the --trackingLogs
argument to the above with the correct path.
35.5. Applying a classifier
To apply a classifier we use the likelihood
program. We will show
how to apply it for one file now here. Generally, manual calling of
likelihood
is not needed, more on that below.
Let's apply the likelihood cut method using all vetoes for the Run-2 data:
likelihood \ -f $DATA/DataRuns2017_Reco.h5 \ --h5out $DATA/classifier/run2_lnL_80_all_vetoes.h5 \ --region=crAll \ --cdlYear=2018 \ --lnL --signalEfficiency=0.8 \ --scintiveto --fadcveto --septemveto --lineveto \ --vetoPercentile=0.99 \ --cdlFile=$DATA/CDL_2019/calibration-cdl-2018.h5 \ --calibFile=$DATA/CalibrationRuns2017_Reco.h5
which should be mostly self explanatory. The --signalEfficiency
argument is the software efficiency of the likelihood
cut. --vetoPercentile
refers to the FADC veto cut. The method is
applied to the entire center chip (--region=crAll
). A --tracking
flag could be added to apply the classifier only to the tracking data
(in this case it is applied only to non tracking data).
As explained in sec. 12.6, many
different veto setups were considered. For this reason a helper tool
createAllLikelihoodCombinations
(excuse the verbose name, heh)
exists to simplify the process of calling likelihood
. It is located
in the Analysis
directory.
For example:
./createAllLikelihoodCombinations \ --f2017 $DATA/DataRuns2017_Reco.h5 \ --f2018 $DATA/DataRuns2018_Reco.h5 \ --c2017 $DATA/CalibrationRuns2017_Reco.h5 \ --c2018 $DATA/CalibrationRuns2018_Reco.h5 \ --regions crAll \ --vetoSets "{fkMLP, fkFadc, fkScinti, fkSeptem, fkLineVeto, fkExclusiveLineVeto}" \ --mlpPath $MLP/<mlp_of_choice>.pt \ --fadcVetoPercentile 0.99 \ --signalEfficiency 0.85 --signalEfficiency 0.90 \ --signalEfficiency 0.95 --signalEfficiency 0.98 \ --out $DATA/classifier/mlp \ --cdlFile $DATA/CDL_2019/calibration-cdl-2018.h5 \ --multiprocessing \ --jobs 6 \ --dryRun
would apply the MLP classifier (here with a dummy name) with
successive additions of all vetoes in vetoSets
at 4 different signal
efficiencies and store the HDF5 output files in
DATA/classifier/mlp
. Multiple vetoSets
arguments can be given in
one call. Also each fk
entry can be prefixed with a +
to indicate
that this veto should not be run on its own. This allows
flexibility. As indicated by --multiprocessing
and --jobs 6
this
would run 6 processes in parallel. Note that each process can peak at
up to 10 GB of RAM (depending on the used classifier and vetoes).
35.6. Computing limits
To compute limits we now feed each of the produced HDF5 output files
of the likelihood
program (in pairs including Run-2 and Run-3 output
files) into mcmc_limit_calculation
. As the number of combinations
can be quite large, this can again be automated using the
Analysis/runLimits
program.
Let's continue with the example from above and pretend we wish to run
1000
toy candidate sets for each input pair, but only those that
include the line veto. This can be accomplished by using the
--prefix
argument to runLimits
, which understands standard glob patterns.
./runLimits \ --path $DATA/classifier/mlp \ --prefix "lhood_c18_R2_crAll_*line*" \ --exclude "_septem_" \ --outpath $DATA/limits/mlp/ \ --energyMin 0.2 --energyMax 12.0 \ --axionModel <path_to_differential_axion_flux>.csv \ --axionImage <path_to_axion_image>.csv \ --combinedEfficiencyFile <path_to_detection_efficiency>.csv \ --switchAxes \ --nmc 1000 \ --dryRun
Note that the CSV files mentioned here are found in the resources
directory of the repository of this thesis. See the extended thesis on
how to produce them.
To produce the observed limit, we run:
mcmc_limit_calculation \ limit \ -f <path_to_background_run2>.h5 \ -f <path_to_background_run3>.h5 \ --tracking <path_to_tracking_run2>.h5 \ --tracking <path_to_tracking_run3>.h5 \ --axionModel <path_to_differential_axion_flux>.csv \ --axionImage <path_to_axion_image>.csv \ --combinedEfficiencyFile <path_to_detection_efficiency>.csv \ --switchAxes \ --path "" \ --years 2017 --years 2018 \ --σ_p 0.05 \ --energyMin 0.2 --energyMax 12.0 \ --limitKind lkMCMC \ --outpath ~/phd/Figs/trackingCandidates/ \ --suffix ""
where the <path_to_*_run>
refers to the output HDF5 file of the
likelihood
program for the best setup as described in
sec. 13.13. --σ_p
refers to the position
uncertainty to use. The signal and background uncertainties have the
real value placed as the default.
With this done, you have successfully reproduced the entire final results of the thesis! As mentioned in chapter 3, see the extended thesis for commands, code snippets and more information about how each figure and table is produced exactly.
35.6.1. Produce the axion model, axion image and detection efficiency file extended
For the axion model and axion image, see sec. 37.4.3. For the detection efficiency see sec. 13.10.4.4.1.