Final Cut Models¶

The final event selection for Lightning Tracks is defined by two multilayer perceptron (MLP) classifiers—one for Starting Lightning Tracks (SLT) and one for Throughgoing Lightning Tracks (TLT). Unlike the filter models, which identify events based purely on topological features, the final cut models incorporate prior knowledge about the expected signal and background distributions as a function of zenith angle and energy. This physics-informed approach enables the models to learn the realistic signal-to-background ratio across the sky, rather than treating all directions equally.

What is an MLP?¶

A multilayer perceptron is a type of feedforward neural network that maps input features to an output through a sequence of transformations. Mathematically, an MLP with \(L\) hidden layers computes:

\[f(\mathbf{x}) = \sigma_L \circ h_L \circ \sigma_{L-1} \circ h_{L-1} \circ \cdots \circ \sigma_1 \circ h_1(\mathbf{x})\]

where each layer \(h_\ell(\mathbf{z}) = \mathbf{W}_\ell \mathbf{z} + \mathbf{b}_\ell\) is an affine transformation with learnable weights \(\mathbf{W}_\ell\) and biases \(\mathbf{b}_\ell\), and \(\sigma_\ell\) is a nonlinear activation function (here, ReLU: \(\sigma(z) = \max(0, z)\)).

The key property of MLPs is their ability to approximate arbitrarily complex nonlinear functions. The depth (number of hidden layers) determines what kinds of decision boundaries the network can represent:

A single hidden layer can approximate any continuous function, but may require exponentially many neurons.
Deeper networks can represent hierarchical features more efficiently, capturing increasingly abstract patterns at each layer.
The final cut models use 3–4 hidden layers with decreasing widths (e.g., 32 → 16 → 8 → 4), creating a funnel that progressively compresses information into a single output probability.

The final layer outputs a single value passed through a sigmoid function \(\varsigma(z) = 1/(1 + e^{-z})\), yielding a probability in \([0, 1]\) that the event is signal rather than background.

Training Philosophy¶

Physics-Informed Weighting¶

The central idea behind the final cut models is to train them on data that reflects our prior knowledge of the signal and background distributions—informed by our flux model assumptions. Within each class, events are sampled according to their physics weights, so the model sees the physically motivated distribution shapes rather than the raw simulation statistics:

Signal (NuGen): Neutrino simulation from the baseline Snowstorm production weighted to match expected flux models. For SLT, the weight combines both atmospheric and astrophysical neutrino fluxes (weight = atmo + astro), using the CORSIKA-fitted Gaisser H3a primary spectrum with SIBYLL 2.1 via Nuflux for atmospheric, and a MESE-like power law for astrophysical. For TLT, only the astrophysical component is used (weight = astro)—this teaches the model to favor upgoing events where astrophysical signal dominates, but also—in the southern sky where atmospheric muons are overwhelming—to retain only the high-energy tail of the distribution where \(S/B \gg 1\) for hard astrophysical spectra (\(\gamma \approx 2 \ll 3.7\), the atmospheric spectral index).
Background (CORSIKA): Atmospheric muon simulation from the 20904 production, weighted using the CorsikaWeight column (derived from cosmic ray flux models). This flux is strongly peaked toward the downgoing direction.

Figure 1

**Figure 1:** Physics-weighted zenith angle distributions of the training data. Training on these distributions teaches the model the prior signal-to-background ratio as a function of zenith, allowing it to calibrate its score based on where in the sky an event originates---not just its topological features.

Importantly, the two classes are balanced 1:1 during training (see Rejection Sampling below)—the correct absolute normalization between signal and background is not used, as the model would otherwise see almost exclusively CORSIKA events. Instead, the model learns the relative distribution shapes within each class: how signal and background are distributed differently across zenith and energy. This is sufficient for the model to learn where in phase space signal is more prevalent relative to background and to calibrate its score accordingly.

Rejection Sampling¶

In practice, the physics-based weighting is implemented through rejection sampling during training. Each class (NuGen and CORSIKA) maintains its own weighted sampler that draws events with probability proportional to their physics weights. During each training step:

A batch is assembled with 50% signal and 50% background events.
Within each class, events are drawn according to their physics weights—events with higher weights (more representative of the expected flux) are sampled more frequently.
The model sees a distribution that reflects what we expect in data, not the raw simulation statistics.

This approach is necessary because the simulation samples have vastly different sizes (Table 1).

Table 1

Sample	SLT Events	TLT Events
NuGen (signal)	~844,000	~3,870,000
CORSIKA (background)	~16,000	~1,690,000

Table 1: Training sample sizes for the SLT and TLT final cut models.

Without weighted sampling, the model would see NuGen events far more frequently than their true occurrence rate, leading to poor generalization.

Epoch Definition¶

Unlike conventional machine learning where an “epoch” means one complete pass through the training data, the final cut models use a fixed-step epoch definition. Each epoch consists of a predetermined number of training batches (10 batches of 14,000 events = 140,000 samples per epoch), regardless of the total dataset size.

This definition arises naturally from the rejection sampling approach: since events are drawn with replacement according to their weights, there is no concept of “seeing all the data once.” Instead, training progress is measured by the number of gradient updates, and epochs serve as checkpointing intervals for monitoring convergence.

Model Architectures¶

Both models share a common structure but differ in input features and network depth to accommodate the distinct physics of starting versus throughgoing tracks.

Input Features¶

Table 2

Feature	Description
\(Q_\text{total}\)	Homogenized total charge
LCSC starting score	Starting track CNN filter score
LCSC upgoing score	Upgoing track CNN filter score
RNN zenith	RNN reconstructed zenith angle
\(E_\text{reco}\)	MuEX energy estimate
RNN \(\sigma\)	RNN angular uncertainty
TNF zenith	TNF reconstructed zenith angle
TNF \(\sigma\)	TNF angular uncertainty
Reco separation	Angular separation between RNN and TNF
\(p_\text{miss}\)	STV miss probability
\(p_\text{miss}^\text{supp}\)	STV miss probability (stochastically suppressed)

Table 2: Input features for the SLT final cut model.

The STV miss probabilities encode information about whether light was observed in the veto region, which helps identify entering muons masquerading as contained neutrino events.

Table 3

Feature	Description
\(Q_\text{total}\)	Homogenized total charge
LCSC upgoing score	Upgoing track CNN filter score
LT downgoing score	Downgoing-throughgoing MLP filter score
RNN zenith	RNN reconstructed zenith angle
\(E_\text{reco}\)	MuEX energy estimate
RNN \(\sigma\)	RNN angular uncertainty
TNF zenith	TNF reconstructed zenith angle
TNF \(\sigma\)	TNF angular uncertainty
Reco separation	Angular separation between RNN and TNF

Table 3: Input features for the TLT final cut model.

Throughgoing tracks do not use containment-based features since they originate outside the detector by definition.

Network Structure¶

Table 4

Property	SLT	TLT
Input dimension	11	9
Hidden layers	[16, 8, 4]	[32, 16, 8, 4]
Activation	ReLU	ReLU
Regularization	Dropout (0.2), L2 (\(\lambda = 0.02\))	Dropout (0.2), L2 (\(\lambda = 0.01\))

Table 4: MLP architecture comparison for the SLT and TLT final cut models.

The TLT model is deeper (4 hidden layers vs 3) to capture the more complex decision boundary needed for throughgoing track classification, where the distinction between astrophysical and atmospheric backgrounds relies more heavily on zenith-dependent rate expectations.

Preprocessing¶

Two features undergo logarithmic transformation before entering the network:

\[ Q_\text{total} \to \log_{10}(1 + Q_\text{total}), \qquad E_\text{reco} \to \log_{10}(1 + E_\text{reco}) \]

These transforms compress the dynamic range of charge and energy—which span several orders of magnitude—into a scale more amenable to gradient-based optimization. All features are then standardized using batch normalization, which learns the mean and variance from the training data and applies the same normalization at inference time.

Training and Regularization¶

Loss Function¶

The models minimize binary cross-entropy with L2 regularization:

\[\mathcal{L} = -\frac{1}{N}\sum_{i=1}^N \left[ y_i \log \hat{y}_i + (1-y_i) \log(1-\hat{y}_i) \right] + \lambda \sum_j w_j^2\]

where \(y_i \in \{0, 1\}\) is the true label, \(\hat{y}_i\) is the predicted probability, and the L2 term penalizes large weights to prevent overfitting.

Epoch Selection¶

Figure 2

**Figure 2:** Training and validation loss curves for the SLT and TLT final cut models. The dotted vertical line marks epoch 100, where the final models were exported.

Both models were trained for 500 epochs, but the final exported models use the checkpoint from epoch 100. This choice was deliberate: Figure 2 shows that while training loss continues to decrease beyond epoch 100, validation loss plateaus and becomes increasingly noisy. The growing gap between training and validation loss after this point indicates the onset of overfitting.

Figure 3

**Figure 3:** Zoomed view of the first 150 training epochs, showing the convergence behavior and the epoch selection point.

Figure 3 provides a closer look at the early training dynamics. Both models converge rapidly in the first 50 epochs, then continue to improve more gradually. By epoch 100, the validation loss has largely stabilized, making this a natural stopping point that balances model performance against overfitting risk.

Table 5

Model	Train Loss	Val Loss	Train Accuracy	Val Accuracy
SLT	0.242	0.299	94.8%	92.9%
TLT	0.214	0.214	95.3%	95.3%

Table 5: Training and validation metrics at epoch 100 for both final cut models.

Overtraining Considerations¶

Comparing model output score distributions between training and test sets shows no serious signs of overtraining, except potentially for high-score CORSIKA events where statistics become sparse. The SLT training set contains only ~8,000 CORSIKA events (half of ~16,000 total), meaning these events were resampled many times during training, which could lead to some memorization of individual high-weight events.

For NuGen, there is no meaningful difference between training and test—the model generalizes well to unseen neutrino events. Given the large NuGen sample sizes, the models likely never saw most individual NuGen events more than once during training.

Warning

The first 50% of the CORSIKA sample (used for training) is effectively “burned” for further use in analyses that employ this selection, as the model has seen these events many times.

Score Interpretation¶

The model outputs are not used as hard cuts during processing. Instead, the continuous scores are stored for each event, and final event selection is performed at the analysis stage by applying thresholds. This keeps the processing pipeline maximally flexible: different physics analyses can tune thresholds independently or work with continuous event weights.

For the default Lightning Tracks configuration, events with scores above a tuned threshold (determined by sensitivity optimization) are included in the final sample. The threshold can be adjusted to trade off between sample purity and effective area depending on the analysis requirements.