35. Full data reconstruction    Appendix

We will now go over how to fully reconstruct all CAST data from the Run-2 and Run-3 data taking campaigns all the way up to the computation of the observed limit. This assumes the following:

  • all CAST data files are located in a directory accessible by a DATA environment variable. Inside should be the directory structure containing directories 2017, 2018, 2018_2 and CDL_2019 as downloaded from Zenodo (Schmidt 2024).
  • all binaries used below are compiled and their location is in your PATH variable. There is a bin directory in the repository root, which contains symbolic links to the binaries if they are placed next to the source file. I recommend adding this directory to your PATH.

Note: There is also the Analysis/ingrid/runAnalysisChain program, which automates a majority of these calls. See the section below on how it replaces the data parsing and reconstruction. Here I wish to illustrate the actual steps and don't hide any.

Furthermore, the commands below produce the correct results assuming the Analysis/ingrid/config.toml file is used as committed as part of the phd tag in the TimepixAnalysis repository.

35.1. Raw data parsing and reconstruction

If this is all in place, first perform the raw data parsing.

Run-2, background and calibration:

raw_data_manipulation -p $DATA/2017/DataRuns \
                      --runType rtBackground \
                      --out $DATA/DataRuns2017_Raw.h5
raw_data_manipulation -p $DATA/2017/CalibrationRuns \
                      --runType rtCalibration \
                      --out $DATA/DataRuns2017_Raw.h5

Run-3:

raw_data_manipulation -p $DATA/2018_2/DataRuns \
                      --runType rtBackground \
                      --out $DATA/DataRuns2018_Raw.h5
raw_data_manipulation -p $DATA/2018_2/CalibrationRuns \
                      --runType rtCalibration \
                      --out $DATA/DataRuns2018_Raw.h5

Next, the initial data reconstruction (geometric properties):

Run-2:

reconstruction -i $DATA/DataRuns2017_Raw.h5 \
               -o $DATA/DataRuns2017_Reco.h5
reconstruction -i $DATA/CalibrationRuns2017_Raw.h5 \
               -o $DATA/CalibrationRuns2017_Reco.h5

Run-3:

reconstruction -i $DATA/DataRuns2018_Raw.h5 \
               -o $DATA/DataRuns2018_Reco.h5
reconstruction -i $DATA/CalibrationRuns2018_Raw.h5 \
               -o $DATA/CalibrationRuns2018_Reco.h5

Now, the next steps of the reconstruction (charge calibration, gas gain and FADC reconstruction):

DATA=~/CastData/data
for typ in Data Calibration; do
    for year in 2017 2018; do
        file="${DATA}/${typ}Runs${year}_Reco.h5"
        reconstruction -i $file --only_charge
        reconstruction -i $file --only_fadc
        reconstruction -i $file --only_gas_gain
    done
done

where we simply loop over the Data and Calibration prefixes and years.

With this done, we can perform the \cefe calibration fits:

reconstruction -i $DATA/CalibrationRuns2017_Reco.h5 --only_fe_spec
reconstruction -i $DATA/CalibrationRuns2018_Reco.h5 --only_fe_spec

and then finally the fit of energy calibration factors determined from each fit against its gas gain:

reconstruction -i $DATA/CalibrationRuns2017_Reco.h5 --only_gain_fit
reconstruction -i $DATA/CalibrationRuns2018_Reco.h5 --only_gain_fit

This then allows to calibrate the energy for each file:

reconstruction -i $DATA/DataRuns2017_Reco.h5 --only_energy_from_e
reconstruction -i $DATA/DataRuns2018_Reco.h5 --only_energy_from_e
reconstruction -i $DATA/CalibrationRuns2017_Reco.h5 --only_energy_from_e
reconstruction -i $DATA/CalibrationRuns2018_Reco.h5 --only_energy_from_e

35.2. Parse and reconstruct the CDL data

In order to use the likelihood cut method we also need to parse and reconstruct the CDL data. Note that this can be done before even reconstructing any of the CAST data files (with the exception of --only_energy_from_e, which is optional anyway).

raw_data_manipulation -p $DATA/CDL_2019/ -r Xray -o $DATA/CDL_2019/CDL_2019_Raw.h5
reconstruction -i $DATA/CDL_2019/CDL_2019_Raw.h5 -o $DATA/CDL_2019/CDL_2019_Reco.h5
reconstruction -i $DATA/CDL_2019/CDL_2019_Reco.h5 --only_charge
reconstruction -i $DATA/CDL_2019/CDL_2019_Reco.h5 --only_fadc
reconstruction -i $DATA/CDL_2019/CDL_2019_Reco.h5 --only_gas_gain
reconstruction -i $DATA/CDL_2019/CDL_2019_Reco.h5 --only_energy_from_e

With this file we can then run cdl_spectrum_calibration in order to produce the calibration-cdl-2018.h5 file:

cdl_spectrum_creation $DATA/CDL_2019/CDL_2019_reco.h5 \
                      --genCdlFile --year=2018

35.3. Add tracking information to background files

To add the tracking information, the CAST log files are needed. The default path (which may be a symbolic link of course) is in resources/LogFiles/tracking-logs from the TimepixAnalysis root.

for year in 2017 2018; do
    cast_log_reader tracking \
                -p $TPXDIR/resources/LogFiles/tracking-logs \
                --startTime "2017/01/01" \
                --endTime "2018/12/31" \
                --h5out "${DATA}/DataRuns${year}_Reco.h5"
done

Here we assume TPXDIR is an environment variable to the root of the TimepixAnalysis repository. Feel free to adjust if your path differs.

35.4. Using runAnalysisChain

All of the commands above can also be performed in one by using runAnalysisChain:

./runAnalysisChain \
    -i $DATA \
    --outpath $DATA \
    --years 2017 --years 2018 \
    --calib --back --cdl \
    --raw --reco \
    --logL \
    --tracking

which tells the runAnalysisChain helper to parse all data files for Run-2 and Run-3, do the same for the CDL data files, reconstruct with all steps, compute the likelihood values and add the tracking information to the background data files. It makes the same assumptions about the location of the data files as in the above.

In particular, if the tracking log files are in a different location, add the --trackingLogs argument to the above with the correct path.

35.5. Applying a classifier

To apply a classifier we use the likelihood program. We will show how to apply it for one file now here. Generally, manual calling of likelihood is not needed, more on that below.

Let's apply the likelihood cut method using all vetoes for the Run-2 data:

likelihood \
    -f $DATA/DataRuns2017_Reco.h5 \
    --h5out $DATA/classifier/run2_lnL_80_all_vetoes.h5 \
    --region=crAll \
    --cdlYear=2018 \
    --lnL --signalEfficiency=0.8 \
    --scintiveto --fadcveto --septemveto --lineveto \
    --vetoPercentile=0.99 \
    --cdlFile=$DATA/CDL_2019/calibration-cdl-2018.h5 \
    --calibFile=$DATA/CalibrationRuns2017_Reco.h5

which should be mostly self explanatory. The --signalEfficiency argument is the software efficiency of the likelihood cut. --vetoPercentile refers to the FADC veto cut. The method is applied to the entire center chip (--region=crAll). A --tracking flag could be added to apply the classifier only to the tracking data (in this case it is applied only to non tracking data).

As explained in sec. 12.6, many different veto setups were considered. For this reason a helper tool createAllLikelihoodCombinations (excuse the verbose name, heh) exists to simplify the process of calling likelihood. It is located in the Analysis directory.

For example:

./createAllLikelihoodCombinations \
    --f2017 $DATA/DataRuns2017_Reco.h5 \
    --f2018 $DATA/DataRuns2018_Reco.h5 \
    --c2017 $DATA/CalibrationRuns2017_Reco.h5 \
    --c2018 $DATA/CalibrationRuns2018_Reco.h5 \
    --regions crAll \
    --vetoSets "{fkMLP, fkFadc, fkScinti, fkSeptem, fkLineVeto, fkExclusiveLineVeto}" \
    --mlpPath $MLP/<mlp_of_choice>.pt \
    --fadcVetoPercentile 0.99 \
    --signalEfficiency 0.85 --signalEfficiency 0.90 \
    --signalEfficiency 0.95 --signalEfficiency 0.98 \
    --out $DATA/classifier/mlp \
    --cdlFile $DATA/CDL_2019/calibration-cdl-2018.h5 \
    --multiprocessing \
    --jobs 6 \
    --dryRun

would apply the MLP classifier (here with a dummy name) with successive additions of all vetoes in vetoSets at 4 different signal efficiencies and store the HDF5 output files in DATA/classifier/mlp. Multiple vetoSets arguments can be given in one call. Also each fk entry can be prefixed with a + to indicate that this veto should not be run on its own. This allows flexibility. As indicated by --multiprocessing and --jobs 6 this would run 6 processes in parallel. Note that each process can peak at up to 10 GB of RAM (depending on the used classifier and vetoes).

35.6. Computing limits

To compute limits we now feed each of the produced HDF5 output files of the likelihood program (in pairs including Run-2 and Run-3 output files) into mcmc_limit_calculation. As the number of combinations can be quite large, this can again be automated using the Analysis/runLimits program.

Let's continue with the example from above and pretend we wish to run 1000 toy candidate sets for each input pair, but only those that include the line veto. This can be accomplished by using the --prefix argument to runLimits, which understands standard glob patterns.

./runLimits \
    --path $DATA/classifier/mlp \
    --prefix "lhood_c18_R2_crAll_*line*" \
    --exclude "_septem_" \
    --outpath $DATA/limits/mlp/ \
    --energyMin 0.2 --energyMax 12.0 \
    --axionModel <path_to_differential_axion_flux>.csv \
    --axionImage <path_to_axion_image>.csv \
    --combinedEfficiencyFile <path_to_detection_efficiency>.csv \
    --switchAxes \
    --nmc 1000 \
    --dryRun

Note that the CSV files mentioned here are found in the resources directory of the repository of this thesis. See the extended thesis on how to produce them.

To produce the observed limit, we run:

mcmc_limit_calculation \
    limit \
    -f <path_to_background_run2>.h5 \
    -f <path_to_background_run3>.h5 \
    --tracking <path_to_tracking_run2>.h5 \
    --tracking <path_to_tracking_run3>.h5 \
    --axionModel <path_to_differential_axion_flux>.csv \
    --axionImage <path_to_axion_image>.csv \
    --combinedEfficiencyFile <path_to_detection_efficiency>.csv \
    --switchAxes \
    --path "" \
    --years 2017 --years 2018 \
    --σ_p 0.05 \
    --energyMin 0.2 --energyMax 12.0 \
    --limitKind lkMCMC \
    --outpath ~/phd/Figs/trackingCandidates/ \
    --suffix ""

where the <path_to_*_run> refers to the output HDF5 file of the likelihood program for the best setup as described in sec. 13.13. --σ_p refers to the position uncertainty to use. The signal and background uncertainties have the real value placed as the default.

With this done, you have successfully reproduced the entire final results of the thesis! As mentioned in chapter 3, see the extended thesis for commands, code snippets and more information about how each figure and table is produced exactly.

35.6.1. Produce the axion model, axion image and detection efficiency file    extended

For the axion model and axion image, see sec. 37.4.3. For the detection efficiency see sec. 13.10.4.4.1.

Click on any heading marked 'extended' to open it