## 28. Morphing of CDL reference spectra Appendix

This appendix contains further figures about the interpolation of the reference distributions entering the likelihood method as discussed in sec. 12.2.6 and sec. 12.2.7. Fig. 1 directly shows the impact of interpolating the reference distributions on the likelihood values (colored points for each different target/filter combination) and the cut value (black line). Both the points and the cut values become smoother utilizing the interpolation. Note that even the interpolation is not perfectly smooth, as the underlying reference data is still binned to a histogram.

Further, section 28.3 contains tilemaps of the not interpolated reference data for each dataset and sec. 28.4 shows the fully interpolated space for each dataset. Finally, sec. 28.5 contains the histograms of the binwise linear interpolations where for each ridge one dataset is skipped.

See the extended thesis for the development notes of the interpolation technique, including different approaches (e.g. binwise spline interpolation instead of linear interpolation).

\clearpage

### 28.1. Generate all morphing / tile related plots extended

In principle the plots are produced with the same command as in sec. 12.2.7.2, but here we wish to have wider versions that are not side-by-side:

FWIDTH=0.9 HEIGHT=420 WRITE_PLOT_CSV=true USE_TEX=true ./cdlMorphing \ --cdlFile ~/CastData/data/CDL_2019/calibration-cdl-2018.h5 \ --outpath ~/phd/Figs/CDL/cdlMorphing/fWidth0.9/

### 28.3. Tilemap of each likelihood dataset

The tilemaps shown here should be compared to the resulting raster plots of the next section, sec. 28.4. For each dataset entering the likelihood method the reference distribution computed for each target/filter combination is spread over its applicable energy range. In the plots the colormap is the height of the distribution. The fluorescence lines are indicated that define the X-ray energy of each target. This illustrates how a cluster energy is mapped to a probability from each of the 8 target/filter distributions. The distinct jump in the distributions is visible in the form of a cut along the horizontal axis at different energies.

\clearpage

### 28.4. Interpolation of each likelihood dataset

In contrast to the plots of the previous section, sec. 28.3 all plots of the different datasets entering the likelihood method here show the distribution in property / energy space after binwise linear interpolation. Compared to the previous plots it results in a smooth transition over the entire energy range.

\clearpage

### 28.5. Binwise linear interpolations for each likelihood dataset

The figures in this section show the binwise linear interpolation by skipping the dataset in each ridge and interpolating using the nearest neighbors above/below in energy. These are the same as shown in fig. #fig:cdl:cdl_morphing_frac_known_lines for all three datasets (i.e. fig. 9 is the same). As mentioned in the main text, skipping the dataset at each ridge means the interpolation errors are much larger than used in practice (as shown in the previous section 28.4).

### 28.6. Notes on CDL morphing development extended

This section of the (not exported) appendix contains our notes about building and implementing the linear interpolation of CDL data (when talking about 'current approach' in the below, we refer to the old, non interpolating approach). It should give an understanding of what was considered and why the final choice is a linear interpolation.

One problem with the current approach of utilizing the CDL data is that the reference distributions for the different logL variables are non continuous between two energy bins. This means that if a cluster is moved from one bin to another it suddenly has a very different cut for each property.

It might be possible to morph CDL spectra between two energies. That is to allow interpolation between the shape of two neighboring reference datasets.

This is the likely cause for the sudden steps visible in the background rate. With a fully morphed function this should hopefully disappear.

#### 28.6.1. References & ideas

Read up on morphing of different functions:

- in HEP: https://indico.cern.ch/event/507948/contributions/2028505/attachments/1262169/1866169/atlas-hcomb-morphwshop-intro-v1.pdf
- https://mathematica.stackexchange.com/questions/208990/morphing-between-two-functions
- https://mathematica.stackexchange.com/questions/209039/convert-symbolic-to-numeric-code-speed-up-morphing

Aside from morphing, the theory of optimal transport seems to be directly related to such problems:

- https://de.wikipedia.org/wiki/Optimaler_Transport (funny, this can be described by topology using GR lingo; there's no English article on this)
- https://en.wikipedia.org/wiki/Transportation_theory_(mathematics) see in particular:

This seems to imply that given some functions \(f(x)\), \(g(x)\) we are looking for the transport function \(T\) which maps \(T: f(x) \rightarrow g(x)\) in the language of transportation theory.

See the linked article about the Wasserstein metric:

- https://en.wikipedia.org/wiki/Wasserstein_metric in particular the section about its connection to the optimal transport problem

It describes a distance metric between two probability distributions. In that sense the distance between two distributions to be transported is directly related to the Wasserstein distance.

One of the major constraints of transportation theory is that the transportation
has to preserve the integral of the transported function. Technically, this is
not the case for our application to the CDL data, due to different amount of
data available for each target. However, of course we normalize the CDL data and
assume that the given data actually is a PDF. At that point each distribution is
normalized to 1 and thus each morphed function has to be normalized to 1 as
well. This is a decent check for a morph result. If a morphing technique does
**not** satisfy this property we need to renormalize the result.

On the other hand, considering the slides by Verkerke (indico.cern.ch link above), the morphing between two functions can also be interpreted as a simple interpolation problem.

In that sense there are multiple approaches to compute an intermediate step of the CDL distributions.

- visualize the CDL data as a 2D heatmap:
- value of the logL variable is x
- energy of each fluorescence line is y

- linear interpolation at a specific energy \(E\) based on the two
neighboring CDL distributions (interpolation thus
**only**along y axis) - spline interpolation over all energy ranges
- KDE along the energy axis (only along
**y**) Extend KDE to 2D? - bicubic interpolation (-> problematic, because our energies/variables are
**not**spread on a rectilinear grid (energy is not spread evenly) - other distance based interpolations, i.e. KD-tree? Simply perform an interpolation based on all neighboring points in a certain distance?

Of these 2 and 4 seem to be the easiest implementations. KD-tree would also be easy, provided I finish the implementation finally.

We will investigate different ideas in ./../CastData/ExternCode/TimepixAnalysis/Tools/cdlMorphing/cdlMorphing.nim.

##### 28.6.1.1. DONE Visualize CDL data as a 2D heatmap

In each of the following plots each distribution (= target filter combination) is normalized to 1.

First let's visualize all of the CDL data as a scatter plot. That is pretty simple and gives an idea where the lines are and what the shape is roughly, fig. 11.

Now we can check out what the data looks like if we interpret the whole (value of each variable, Energy) phase space as a tile map. In this way, morphing can be interpreted as performing interpolation along the energy axis in the resulting tile map.

Each figure contains in addition as colored lines the **start** of each energy
range as currently used. So clusters will be with the distribution defined by
the line below.

In addition the energy of each fluorescence lines is plotted in red at the corresponding energy value. This also shows that the intervals and the energy of the lines are highly asymmetric.

##### 28.6.1.2. DONE Morph by linear interpolation bin by bin

Based on the tile maps in the previous section it seems like a decent idea to perform a linear interpolation for any point in between two intervals:

where \(f_\text{low,high}\) are the distributions below / above the given energy \(E\). \(L_\text{low,high}\) corresponds to the energy of the fluorescence line corresponding to the distribution below / above \(E\). \(\Delta E\) is the difference in energy between the lower and higher fluorescence lines. \(x\) is the value of the given logL variable.

In code this is:

let E = ... # given as argument to function let lineEnergies = getXrayFluorescenceLines() let refLowT = df.filter(f{string -> bool: `Dset` == refLow})["Hist", float] let refHighT = df.filter(f{string -> bool: `Dset` == refHigh})["Hist", float] result = zeros[float](refLowT.size.int) let deltaE = abs(lineEnergies[idx] - lineEnergies[idx+offset]) # walk over each bin and compute linear interpolation between for i in 0 ..< refLowT.size: result[i] = refLowT[i] * (1 - (abs(lineEnergies[idx] - E)) / deltaE) + refHighT[i] * (1 - (abs(lineEnergies[idx+offset] - E)) / deltaE)

Since doing this for a point between two lines is not particularly helpful,
because we do not *know* what the distribution in between does actually look
like. Instead for validation we will now try to compute the `Cu-EPIC-0.9kV`

distribution (corresponding to the \(\text{O K}_{α}\) line at
\(\SI{0.525}{\kilo \electronvolt}\)) based on the `C-EPIC-0.6kV`

and `Cu-EPIC-2kV`

distributions.
That means the interpolate the second ridge from the first and third in the
CDL ridgeline plots.

This is shown in fig. 15, 16 and 17. The real data for each distribution is shown in red and the morphed linear bin-wise interpolation for the second ridge is shown in red.

##### 28.6.1.3. DONE Compute all reference spectra from neighbors

##### 28.6.1.4. DONE Compute full linear interpolation between fluorescence lines

We can now apply the lessons from the last section to compute arbitrary reference spectra. We will use this to compute a heatmap of all possible energies in between the first and last fluorescence line.

For all three logL variables, these are shown in fig.

#### 28.6.2. KDE approach extended

Using a KDE is problematic, because our data is already pre-binned of course. This leads to a very sparse phase space, which either makes the local prediction around a known distribution good, but fails miserably in between them (small bandwidth) or gives decent predictions in between, but pretty bad reconstruction of the known distributions (larger bandwidth).

There is also a strong conflict in bandwidth selection, due to the non-linear steps in energy between the different CDL distributions. This leads to a too large / too small bandwidth at either end of the energy range.

Fig. 24, 25, 26 show the default bandwidth (Silverman's rule of thumb). In comparison Fig. 27, 28, 29 show the same plot using a much smaller custom bandwidth of 0.3 keV. The agreement is much better, but the actual prediction between the different distributions becomes much worse. Compare fig. 30 (default bandwidth) to fig. 31. The latter has regions of almost no counts, which is obviously wrong.

Note that also the fig. 30 is problematic. An effect of a bad KDE input is visible, namely that the bandwidth vs. number of datapoints is such that the center region (in energy) has higher values than the edges, due to the fact that predictions near the boundaries see no signal (this boundary effect could be corrected for by assuming a suitable boundary conditions, e.g. just extending the first/last distributions in the respective ranges. It is not clear however in what spacing such a distribution should be placed etc.

\clearpage

\clearpage

#### 28.6.3. Spline approach

Another idea is to use a spline interpolation. This has the advantage that the existing distributions will be correctly predicted (as for the linear interpolation), but possibly yields better results between distributions (or in the case of predicting a known distributions.

Fig. 32, 33, 34 show the prediction using a spline. Same as for the linear interpolation each morphed distribution was computed by excluding that distribution from the spline definition and then predicting the energy of the respective fluorescence line.

The result looks somewhat better in certain areas than the linear interpolation, but has unphysical artifacts in other areas (negative values) while also deviating quite a bit. For that reason it seems like simpler is better in case of CDL morphing (at least if it's done bin-wise).

\clearpage

#### 28.6.4. Summary

For the time being we will use the linear interpolation method and see where this leads us. Should definitely be a big improvement over the current interval based option.

For the results of applying linear interpolation based morphing to the likelihood analysis see section 28.6.5.

#### 28.6.5. Implementation in `likelihood.nim`

Thoughts on the implememntition in `likelihood.nim`

or CDL morphing.

- Add interpolation code from
`cdlMorphing.nim`

in`private/cdl_cuts.nim`

. - Add a field to
`config.nim`

that describes the morphing technique to be used. - Add an enum for the possible morphing techniques,
`MorphingKind`

with fields`mkNone`

,`mkLinear`

- In
`calcCutValueTab`

we currently return a`Table[string, float]`

mapping target/filter combinations to cut values. This needs to be modified such that we have something that hides away input -> output and yields what we need. Define an`CutValueInterpolator`

type, which is returned instead. It will be a variant object with case`kind: MorphingKind`

. This object will allow access to cut values based on:`string`

: a target/filter combination.`mkNone`

: access internal`Table[string, float]`

as done currently`mkLinear`

: raise exception, since does not make sense

`float`

: an energy in keV:`mkNone`

: convert energy to a target/filter combination and access internal`Table`

`mkLinear`

: access the closest energy distribution and return its cut value

- in
`filterClustersByLogL`

replace`cutTab`

name and access by energy of cluster instead of converted to target/filter dataset

With these steps we should have a working interpolation routine. The
code used in the `cdlMorphing.nim`

test script needs to be added of
course to provide the linearly interpolated logic (see step 0).

##### 28.6.5.1. Bizarre Al-Al 4kV behavior with `mkLinear`

After the first implementation we see some very bizarre behavior in the case of linear interpolation for the logL distributions.

This is both visible with the `plotCdl.nim`

as well as plotting code
in `likelihood.nim`

.

See fig. 35.

**UPDATE**: The issue was a couple of bugs & design choices in the
implementation of the linear interpolation in
`likelihood_utils.nim`

. In particular about the design of the DF
returned from `getInterpolatedWideDf`

and a bug accessing not the sub
DF, but the actual DF in the loop.

The fixed result is shown in fig. 36 and in comparison the result using no interpolation (the reference in a way) in fig. 37.