28. Morphing of CDL reference spectra    Appendix

This appendix contains further figures about the interpolation of the reference distributions entering the likelihood method as discussed in sec. 12.2.6 and sec. 12.2.7. Fig. 1 directly shows the impact of interpolating the reference distributions on the likelihood values (colored points for each different target/filter combination) and the cut value (black line). Both the points and the cut values become smoother utilizing the interpolation. Note that even the interpolation is not perfectly smooth, as the underlying reference data is still binned to a histogram.

Further, section 28.3 contains tilemaps of the not interpolated reference data for each dataset and sec. 28.4 shows the fully interpolated space for each dataset. Finally, sec. 28.5 contains the histograms of the binwise linear interpolations where for each ridge one dataset is skipped.

See the extended thesis for the development notes of the interpolation technique, including different approaches (e.g. binwise spline interpolation instead of linear interpolation).

logL_of_CDL_vs_energy.svg
Figure 1: Figure 140: \(\ln\mathcal{L}\) values for all the cleaned CDL cluster data against the energy of the cluster. Left is the binwise linear interpolation and right is the calculation using the old cite:krieger2018search approach of fixed energy intervals. The bin wise linear interpolation helps to provide a smoother description of the \(\ln\mathcal{L}\) data.

\clearpage

28.1. Generate all morphing / tile related plots    extended

In principle the plots are produced with the same command as in sec. 12.2.7.2, but here we wish to have wider versions that are not side-by-side:

FWIDTH=0.9 HEIGHT=420 WRITE_PLOT_CSV=true USE_TEX=true ./cdlMorphing \
    --cdlFile ~/CastData/data/CDL_2019/calibration-cdl-2018.h5 \
    --outpath ~/phd/Figs/CDL/cdlMorphing/fWidth0.9/

28.2. Generate plot comparing likelihood behavior    extended

Fig. 1 is produced in sec. 12.2.5.2.

28.3. Tilemap of each likelihood dataset

The tilemaps shown here should be compared to the resulting raster plots of the next section, sec. 28.4. For each dataset entering the likelihood method the reference distribution computed for each target/filter combination is spread over its applicable energy range. In the plots the colormap is the height of the distribution. The fluorescence lines are indicated that define the X-ray energy of each target. This illustrates how a cluster energy is mapped to a probability from each of the 8 target/filter distributions. The distinct jump in the distributions is visible in the form of a cut along the horizontal axis at different energies.

cdl_as_tile_morph_mtNone_eccentricity.svg
Figure 2: Figure 141: Tilemap of the eccentricity dataset against energy. The colormap corresponds to the height of the eccentricity distribution at that point. Along the energy axis the data is constant within the applicable energy range of the target/filter kind. The energy of each fluorescence line is indicated.
cdl_as_tile_morph_mtNone_fractionInTransverseRms.svg
Figure 3: Figure 142: Tilemap of the fractionInTransverseRms dataset against energy. The colormap corresponds to the height of the fractionInTransverseRms distribution at that point. Along the energy axis the data is constant within the applicable energy range of the target/filter kind. The energy of each fluorescence line is indicated.
cdl_as_tile_morph_mtNone_lengthDivRmsTrans.svg
Figure 4: Figure 143: Tilemap of the lengthDivRmsTrans dataset against energy. The colormap corresponds to the height of the lengthDivRmsTrans distribution at that point. Along the energy axis the data is constant within the applicable energy range of the target/filter kind. The energy of each fluorescence line is indicated.

\clearpage

28.4. Interpolation of each likelihood dataset

In contrast to the plots of the previous section, sec. 28.3 all plots of the different datasets entering the likelihood method here show the distribution in property / energy space after binwise linear interpolation. Compared to the previous plots it results in a smooth transition over the entire energy range.

cdl_as_raster_interpolated_morph_mtNone_eccentricity.svg
Figure 5: Figure 144: Raster plot of the eccentricity dataset against energy after binwise linear interpolation. The colormap corresponds to the height of the eccentricity distribution at that point. The energy of each fluorescence line is indicated.
cdl_as_raster_interpolated_morph_mtNone_fractionInTransverseRms.svg
Figure 6: Figure 145: Raster plot of the fractionInTransverseRms dataset against energy after binwise linear interpolation. The colormap corresponds to the height of the fractionInTransverseRms distribution at that point. The energy of each fluorescence line is indicated.
cdl_as_raster_interpolated_morph_mtNone_lengthDivRmsTrans.svg
Figure 7: Figure 146: Raster plot of the lengthDivRmsTrans dataset against energy after binwise linear interpolation. The colormap corresponds to the height of the lengthDivRmsTrans distribution at that point. The energy of each fluorescence line is indicated.

\clearpage

28.5. Binwise linear interpolations for each likelihood dataset

The figures in this section show the binwise linear interpolation by skipping the dataset in each ridge and interpolating using the nearest neighbors above/below in energy. These are the same as shown in fig. #fig:cdl:cdl_morphing_frac_known_lines for all three datasets (i.e. fig. 9 is the same). As mentioned in the main text, skipping the dataset at each ridge means the interpolation errors are much larger than used in practice (as shown in the previous section 28.4).

eccentricity_ridgeline_morph_mtLinear_calibration-cdl-2018.h5_2018.svg
Figure 8: Figure 147: Binwise linear interpolation of the eccentricity dataset skipping the target/filter dataset being interpolated.
fractionInTransverseRms_ridgeline_morph_mtLinear_calibration-cdl-2018.h5_2018.svg
Figure 9: Figure 148: Binwise linear interpolation of the fractionInTransverseRms dataset skipping the target/filter dataset being interpolated.
lengthDivRmsTrans_ridgeline_morph_mtLinear_calibration-cdl-2018.h5_2018.svg
Figure 10: Figure 149: Binwise linear interpolation of the lengthDivRmsTrans dataset skipping the target/filter dataset being interpolated.

28.6. Notes on CDL morphing development    extended

This section of the (not exported) appendix contains our notes about building and implementing the linear interpolation of CDL data (when talking about 'current approach' in the below, we refer to the old, non interpolating approach). It should give an understanding of what was considered and why the final choice is a linear interpolation.

One problem with the current approach of utilizing the CDL data is that the reference distributions for the different logL variables are non continuous between two energy bins. This means that if a cluster is moved from one bin to another it suddenly has a very different cut for each property.

It might be possible to morph CDL spectra between two energies. That is to allow interpolation between the shape of two neighboring reference datasets.

This is the likely cause for the sudden steps visible in the background rate. With a fully morphed function this should hopefully disappear.

28.6.1. References & ideas

Read up on morphing of different functions:

Aside from morphing, the theory of optimal transport seems to be directly related to such problems:

This seems to imply that given some functions \(f(x)\), \(g(x)\) we are looking for the transport function \(T\) which maps \(T: f(x) \rightarrow g(x)\) in the language of transportation theory.

See the linked article about the Wasserstein metric:

It describes a distance metric between two probability distributions. In that sense the distance between two distributions to be transported is directly related to the Wasserstein distance.

One of the major constraints of transportation theory is that the transportation has to preserve the integral of the transported function. Technically, this is not the case for our application to the CDL data, due to different amount of data available for each target. However, of course we normalize the CDL data and assume that the given data actually is a PDF. At that point each distribution is normalized to 1 and thus each morphed function has to be normalized to 1 as well. This is a decent check for a morph result. If a morphing technique does not satisfy this property we need to renormalize the result.

On the other hand, considering the slides by Verkerke (indico.cern.ch link above), the morphing between two functions can also be interpreted as a simple interpolation problem.

In that sense there are multiple approaches to compute an intermediate step of the CDL distributions.

  1. visualize the CDL data as a 2D heatmap:
    • value of the logL variable is x
    • energy of each fluorescence line is y
  2. linear interpolation at a specific energy \(E\) based on the two neighboring CDL distributions (interpolation thus only along y axis)
  3. spline interpolation over all energy ranges
  4. KDE along the energy axis (only along y) Extend KDE to 2D?
  5. bicubic interpolation (-> problematic, because our energies/variables are not spread on a rectilinear grid (energy is not spread evenly)
  6. other distance based interpolations, i.e. KD-tree? Simply perform an interpolation based on all neighboring points in a certain distance?

Of these 2 and 4 seem to be the easiest implementations. KD-tree would also be easy, provided I finish the implementation finally.

We will investigate different ideas in ./../CastData/ExternCode/TimepixAnalysis/Tools/cdlMorphing/cdlMorphing.nim.

28.6.1.1. DONE Visualize CDL data as a 2D heatmap

In each of the following plots each distribution (= target filter combination) is normalized to 1.

First let's visualize all of the CDL data as a scatter plot. That is pretty simple and gives an idea where the lines are and what the shape is roughly, fig. 11.

cdl_as_scatter_lengthDivRmsTrans.svg
Figure 11: Figure 150: Scatter plot of the reference dataset for the length / transverse RMS logL variable, colored by the normalized counts for each distribution (each is normalized to 1). The distributions are drawn at the energy of the fluorescence lines.

Now we can check out what the data looks like if we interpret the whole (value of each variable, Energy) phase space as a tile map. In this way, morphing can be interpreted as performing interpolation along the energy axis in the resulting tile map.

Each figure contains in addition as colored lines the start of each energy range as currently used. So clusters will be with the distribution defined by the line below.

In addition the energy of each fluorescence lines is plotted in red at the corresponding energy value. This also shows that the intervals and the energy of the lines are highly asymmetric.

cdl_as_tile_eccentricity.svg
Figure 12: Figure 151: Tile map of the reference dataset for the eccentricity logL variable, colored by the normalized counts for each distribution (each is normalized to 1 along the x axis). Each tile covers the range from start to end interval. In addition in red the energy of each fluorescence line is plotted.
cdl_as_tile_lengthDivRmsTrans.svg
Figure 13: Figure 152: Tile map of the reference dataset for the length / transverse RMS logL variable, colored by the normalized counts for each distribution (each is normalized to 1 along the x axis). Each tile covers the range from start to end interval. In addition in red the energy of each fluorescence line is plotted.
cdl_as_tile_fractionInTransverseRms.svg
Figure 14: Figure 153: Tile map of the reference dataset for the fraction within transverse RMS logL variable, colored by the normalized counts for each distribution (each is normalized to 1 along the x axis). Each tile covers the range from start to end interval. In addition in red the energy of each fluorescence line is plotted.
28.6.1.2. DONE Morph by linear interpolation bin by bin

Based on the tile maps in the previous section it seems like a decent idea to perform a linear interpolation for any point in between two intervals:

where \(f_\text{low,high}\) are the distributions below / above the given energy \(E\). \(L_\text{low,high}\) corresponds to the energy of the fluorescence line corresponding to the distribution below / above \(E\). \(\Delta E\) is the difference in energy between the lower and higher fluorescence lines. \(x\) is the value of the given logL variable.

In code this is:

  let E = ... # given as argument to function
  let lineEnergies = getXrayFluorescenceLines()
  let refLowT = df.filter(f{string -> bool: `Dset` == refLow})["Hist", float]
  let refHighT = df.filter(f{string -> bool: `Dset` == refHigh})["Hist", float]
  result = zeros[float](refLowT.size.int)
  let deltaE = abs(lineEnergies[idx] - lineEnergies[idx+offset])
  # walk over each bin and compute linear interpolation between
  for i in 0 ..< refLowT.size:
    result[i] = refLowT[i] * (1 - (abs(lineEnergies[idx] - E)) / deltaE) +
      refHighT[i] * (1 - (abs(lineEnergies[idx+offset] - E)) / deltaE)

Since doing this for a point between two lines is not particularly helpful, because we do not know what the distribution in between does actually look like. Instead for validation we will now try to compute the Cu-EPIC-0.9kV distribution (corresponding to the \(\text{O K}_{α}\) line at \(\SI{0.525}{\kilo \electronvolt}\)) based on the C-EPIC-0.6kV and Cu-EPIC-2kV distributions. That means the interpolate the second ridge from the first and third in the CDL ridgeline plots.

This is shown in fig. 15, 16 and 17. The real data for each distribution is shown in red and the morphed linear bin-wise interpolation for the second ridge is shown in red.

eccentricity_ridgeline_XrayReferenceFile2018.h5_2018.svg
Figure 15: Figure 154: Interpolation of the Cu-EPIC-0.9kV distribution for the eccentricity logL variable using bin wise linear interpolation based on the C-EPIC-0.6kV and Cu-EPIC-2kV distributions. The real data is shown in the second ridge in red and the morphed interpolation is is shown in blue. The agreement is remarkable for the simplicity of the method.
lengthDivRmsTrans_ridgeline_XrayReferenceFile2018.h5_2018.svg
Figure 16: Figure 155: Interpolation of the Cu-EPIC-0.9kV distribution for the length / transverse RMS logL variable using bin wise linear interpolation based on the C-EPIC-0.6kV and Cu-EPIC-2kV distributions. The real data is shown in the second ridge in red and the morphed interpolation is is shown in blue. The agreement is remarkable for the simplicity of the method.
fractionInTransverseRms_ridgeline_XrayReferenceFile2018.h5_2018.svg
Figure 17: Figure 156: Interpolation of the Cu-EPIC-0.9kV distribution for the fraction in transverse RMS logL variable using bin wise linear interpolation based on the C-EPIC-0.6kV and Cu-EPIC-2kV distributions. The real data is shown in the second ridge in red and the morphed interpolation is is shown in blue. This in particular is the problematic variable, due to the integer nature of the data at low energies. However, even here the interpolation works extremely well.
28.6.1.3. DONE Compute all reference spectra from neighbors

Similar to the plots in the previous section we can now compute all reference spectra based on the next neighboring spectras.

This is done in fig. 18, 19, 20.

eccentricity_ridgeline_morph_all_XrayReferenceFile2018.h5_2018.svg
Figure 18: Figure 157: Interpolation of all reference distribution for the eccentricity logL variable using bin wise linear interpolation based on the neighboring distributions. The real data is shown in red while the morphed data is shown in cyan. In most ridges the agreement is very good.
lengthDivRmsTrans_ridgeline_morph_all_XrayReferenceFile2018.h5_2018.svg
Figure 19: Figure 158: Interpolation of all reference distribution for the length / transverse RMS logL variable using bin wise linear interpolation based on the neighboring distributions. The real data is shown in red while the morphed data is shown in cyan. In most ridges the agreement is very good.
fractionInTransverseRms_ridgeline_morph_all_XrayReferenceFile2018.h5_2018.svg
Figure 20: Figure 159: Interpolation of all reference distribution for the fraction in transverse RMS logL variable using bin wise linear interpolation based on the neighboring distributions. The real data is shown in red while the morphed data is shown in cyan. In most ridges the agreement is very good.
28.6.1.4. DONE Compute full linear interpolation between fluorescence lines

We can now apply the lessons from the last section to compute arbitrary reference spectra. We will use this to compute a heatmap of all possible energies in between the first and last fluorescence line.

For all three logL variables, these are shown in fig.

cdl_as_raster_interpolated_eccentricity.svg
Figure 21: Figure 160: Heatmap of a fully linear interpolated view of the energy / eccentricity phase space in between the first and last fluorescence line.
cdl_as_raster_interpolated_lengthDivRmsTrans.svg
Figure 22: Figure 161: Heatmap of a fully linear interpolated view of the energy / lenthDivRmsTrans phase space in between the first and last fluorescence line.
cdl_as_raster_interpolated_fractionInTransverseRms.svg
Figure 23: Figure 162: Heatmap of a fully linear interpolated view of the energy / fracRmsTrans phase space in between the first and last fluorescence line.

28.6.2. KDE approach    extended

Using a KDE is problematic, because our data is already pre-binned of course. This leads to a very sparse phase space, which either makes the local prediction around a known distribution good, but fails miserably in between them (small bandwidth) or gives decent predictions in between, but pretty bad reconstruction of the known distributions (larger bandwidth).

There is also a strong conflict in bandwidth selection, due to the non-linear steps in energy between the different CDL distributions. This leads to a too large / too small bandwidth at either end of the energy range.

Fig. 24, 25, 26 show the default bandwidth (Silverman's rule of thumb). In comparison Fig. 27, 28, 29 show the same plot using a much smaller custom bandwidth of 0.3 keV. The agreement is much better, but the actual prediction between the different distributions becomes much worse. Compare fig. 30 (default bandwidth) to fig. 31. The latter has regions of almost no counts, which is obviously wrong.

Note that also the fig. 30 is problematic. An effect of a bad KDE input is visible, namely that the bandwidth vs. number of datapoints is such that the center region (in energy) has higher values than the edges, due to the fact that predictions near the boundaries see no signal (this boundary effect could be corrected for by assuming a suitable boundary conditions, e.g. just extending the first/last distributions in the respective ranges. It is not clear however in what spacing such a distribution should be placed etc.

eccentricity_ridgeline_morph_kde.svg
Figure 24: Figure 163: Reconstruction of the different eccentricity CDL distributions using a KDE with the bin counts as weights with the automatically computed bandwidth using Silverman's rule of thumb (about 1.6 keV in this case). Bad match with real data for low energies (top ridges).
fractionInTransverseRms_ridgeline_morph_kde.svg
Figure 25: Figure 164: Reconstruction of the different fraction in transverse RMS CDL distributions using a KDE with the bin counts as weights with the automatically computed bandwidth using Silverman's rule of thumb (about 1.6 keV in this case). Bad match with real data for low energies (top ridges).
lengthDivRmsTrans_ridgeline_morph_kde.svg
Figure 26: Figure 165: Reconstruction of the different length / transverse RMS CDL distributions using a KDE with the bin counts as weights with the automatically computed bandwidth using Silverman's rule of thumb (about 1.6 keV in this case). Bad match with real data for low energies (top ridges).

\clearpage

eccentricity_ridgeline_morph_kde_small_bw.svg
Figure 27: Figure 166: Reconstruction of the different eccentricity CDL distributions using a KDE with the bin counts as weights with a custom bandwidth of 0.3 keV. Bad match with real data for low energies (top ridges).
fractionInTransverseRms_ridgeline_morph_kde_small_bw.svg
Figure 28: Figure 167: Reconstruction of the different fraction in transverse RMS CDL distributions using a KDE with the bin counts as weights with a custom bandwidth of 0.3 keV. Bad match with real data for low energies (top ridges).
lengthDivRmsTrans_ridgeline_morph_kde_small_bw.svg
Figure 29: Figure 168: Reconstruction of the different length / transverse RMS CDL distributions using a KDE with the bin counts as weights with a custom bandwidth of 0.3 keV. Bad match with real data for low energies (top ridges).

\clearpage

eccentricity_raster_kde.svg
Figure 30: Figure 169: Raster of the KDE interpolation for the eccentricity using the automatically determined bandwidth based on Silverman's rule of thumb (about 1.6 keV in this case). Boundary effects are visible due to apparent more activity near the center energies.
eccentricity_raster_kde_small_bw.svg
Figure 31: Figure 170: Raster of the KDE interpolation for the eccentricity using a custom bandwidth of 0.3 keV. Better agreement at the different CDL target energies at the expense of a reasonable prediction between the different regions.

28.6.3. Spline approach

Another idea is to use a spline interpolation. This has the advantage that the existing distributions will be correctly predicted (as for the linear interpolation), but possibly yields better results between distributions (or in the case of predicting a known distributions.

Fig. 32, 33, 34 show the prediction using a spline. Same as for the linear interpolation each morphed distribution was computed by excluding that distribution from the spline definition and then predicting the energy of the respective fluorescence line.

The result looks somewhat better in certain areas than the linear interpolation, but has unphysical artifacts in other areas (negative values) while also deviating quite a bit. For that reason it seems like simpler is better in case of CDL morphing (at least if it's done bin-wise).

eccentricity_ridgeline_morph_spline.svg
Figure 32: Figure 171: Reconstruction of the different eccentricity CDL distributions using a spline interpolation (by excluding each distribution that is being predicted). Prediction sometimes even yields negative values, highlighting the problems of a spline in certain use cases (unphysical results).
fractionInTransverseRms_ridgeline_morph_spline.svg
Figure 33: Figure 172: Reconstruction of the different fraction in transverse RMS CDL distributions using a spline interpolation (by excluding each distribution that is being predicted). Prediction sometimes even yields negative values, highlighting the problems of a spline in certain use cases (unphysical results).
lengthDivRmsTrans_ridgeline_morph_spline.svg
Figure 34: Figure 173: Reconstruction of the different length / transverse RMS CDL distributions using a spline interpolation (by excluding each distribution that is being predicted). Prediction sometimes even yields negative values, highlighting the problems of a spline in certain use cases (unphysical results).

\clearpage

28.6.4. Summary

For the time being we will use the linear interpolation method and see where this leads us. Should definitely be a big improvement over the current interval based option.

For the results of applying linear interpolation based morphing to the likelihood analysis see section 28.6.5.

28.6.5. Implementation in likelihood.nim

Thoughts on the implememntition in likelihood.nim or CDL morphing.

  1. Add interpolation code from cdlMorphing.nim in private/cdl_cuts.nim.
  2. Add a field to config.nim that describes the morphing technique to be used.
  3. Add an enum for the possible morphing techniques, MorphingKind with fields mkNone, mkLinear
  4. In calcCutValueTab we currently return a Table[string, float] mapping target/filter combinations to cut values. This needs to be modified such that we have something that hides away input -> output and yields what we need. Define an CutValueInterpolator type, which is returned instead. It will be a variant object with case kind: MorphingKind. This object will allow access to cut values based on:
    • string: a target/filter combination.
      • mkNone: access internal Table[string, float] as done currently
      • mkLinear: raise exception, since does not make sense
    • float: an energy in keV:
      • mkNone: convert energy to a target/filter combination and access internal Table
      • mkLinear: access the closest energy distribution and return its cut value
  5. in filterClustersByLogL replace cutTab name and access by energy of cluster instead of converted to target/filter dataset

With these steps we should have a working interpolation routine. The code used in the cdlMorphing.nim test script needs to be added of course to provide the linearly interpolated logic (see step 0).

28.6.5.1. Bizarre Al-Al 4kV behavior with mkLinear

After the first implementation we see some very bizarre behavior in the case of linear interpolation for the logL distributions.

This is both visible with the plotCdl.nim as well as plotting code in likelihood.nim.

See fig. 35.

cdl_logl_linear_bizarre_Al_Al_4kV.svg
Figure 35: Figure 174: LogL distribitions after implementing linear interpolation and running with mkLinear. The Al-Al 4kV line is noweher near where we expect it. The code currently recomputes the logL values by default, in which the mkLinear plays a role. The bug has to be somewhere in that part of the interpolation.

UPDATE: The issue was a couple of bugs & design choices in the implementation of the linear interpolation in likelihood_utils.nim. In particular about the design of the DF returned from getInterpolatedWideDf and a bug accessing not the sub DF, but the actual DF in the loop.

The fixed result is shown in fig. 36 and in comparison the result using no interpolation (the reference in a way) in fig. 37.

cdl_logl_linear_fixed.svg
Figure 36: Figure 175: LogL distribitions after implementing linear interpolation and running with mkLinear and after the above mentioned bug has been fixed. This is the same result as for mkNone, see fig. 37.
cdl_logl_no_interp.svg
Figure 37: Figure 176: LogL distribitions after implementing linear interpolation and running with mkLinear and after the above mentioned bug has been fixed. This is the same result as for mkNone, see fig. 37.
Click on any heading marked 'extended' to open it