Processed Datasets¶
This page summarizes all datasets processed for the Lightning Tracks selection and how each is used throughout this documentation. All processed datasets are defined via YAML files in config/datasets/, structured as follows:
- Experimental data (
exp/)—config/datasets/exp/ - NuGen simulation (
nugen/)—config/datasets/nugen/baseline/andconfig/datasets/nugen/systematics/ - CORSIKA (
corsika/)—config/datasets/corsika/(not yet used downstream in LT results)
Experimental Data¶
SLT refers to Starting Lightning Tracks, and TLT to Throughgoing Lightning Tracks. Table 1 lists each processed season (2011–2022) along with run counts, livetimes, and event statistics at both filter and final-selection level.
| Season | Runs / Livetime (days) |
Sample | Filter events | Filter rate (mHz) | Final events | Final rate (mHz) |
|---|---|---|---|---|---|---|
| 2011 | 1101 | SLT | 591665 | 20.582 | 49792 | 1.732 |
| 332.71 | TLT | 1955185 | 68.015 | 174381 | 6.066 | |
| 2012 | 1089 | SLT | 579779 | 20.738 | 48454 | 1.734 |
| 323.58 | TLT | 1967744 | 70.384 | 172762 | 6.183 | |
| 2013 | 1189 | SLT | 607606 | 20.598 | 51421 | 1.743 |
| 341.42 | TLT | 2058149 | 69.772 | 181755 | 6.162 | |
| 2014 | 1198 | SLT | 641979 | 20.612 | 54167 | 1.739 |
| 360.48 | TLT | 2156126 | 69.227 | 191028 | 6.133 | |
| 2015 | 1176 | SLT | 641851 | 20.529 | 54274 | 1.736 |
| 361.87 | TLT | 2146815 | 68.665 | 192009 | 6.141 | |
| 2016 | 1116 | SLT | 633230 | 20.714 | 53876 | 1.762 |
| 353.82 | TLT | 2119401 | 69.329 | 187941 | 6.148 | |
| 2017 | 1305 | SLT | 721487 | 20.545 | 60431 | 1.721 |
| 406.45 | TLT | 2386987 | 67.972 | 214839 | 6.118 | |
| 2018 | 1172 | SLT | 649314 | 20.568 | 54719 | 1.733 |
| 365.39 | TLT | 2165173 | 68.584 | 193568 | 6.131 | |
| 2019 | 979 | SLT | 558449 | 21.087 | 46892 | 1.771 |
| 306.51 | TLT | 1885796 | 71.209 | 164445 | 6.210 | |
| 2020 | 1111 | SLT | 641663 | 20.630 | 53745 | 1.728 |
| 359.99 | TLT | 2113250 | 67.943 | 190030 | 6.110 | |
| 2021 | 1340 | SLT | 756587 | 20.472 | 64498 | 1.745 |
| 427.74 | TLT | 2455923 | 66.454 | 223988 | 6.061 | |
| 2022 | 1503 | SLT | 851663 | 20.588 | 72127 | 1.744 |
| 478.79 | TLT | 2788839 | 67.416 | 252463 | 6.103 | |
| All | 14279 | SLT | 7875273 | 20.628 | 664396 | 1.740 |
| 4418.76 | TLT | 26199388 | 68.624 | 2339209 | 6.127 |
Simulation¶
NuGen¶
Baseline Production¶
Our primary baseline simulation is the Snowstorm NuGen production from Simulation Request 13C4, shown in Table 2. These samples were used to train the TNF reconstruction model and were additionally split 50/50 into training and validation sets for the Lightning Tracks final-cut models.
Because TNF was trained on these events, they are considered burned for physics studies, meaning they should not be used to generate Csky NumPy files for sensitivity or performance evaluations.
The remaining held-out half of the datasets continues to serve as the baseline for Data–MC Agreement validation plots.
| Flavor | \(10^2\)–\(10^4\) GeV | \(10^4\)–\(10^6\) GeV | \(10^6\)–\(10^8\) GeV |
|---|---|---|---|
| \(\nu_e\) | 22614 (E\(^{-1.5}\)) | 22613 (E\(^{-1.5}\)) | 22612 (E\(^{-1}\)) |
| \(\nu_\mu\) | 22646 (E\(^{-1.5}\)) | 22645 (E\(^{-1.5}\)) | 22644 (E\(^{-1}\)) |
| \(\nu_\tau\) | 22633 (E\(^{-1.5}\)) | 22634 (E\(^{-1.5}\)) | 22635 (E\(^{-1}\)) |
Systematics—Nominal Ensemble¶
Our systematics studies rely primarily on the Snowstorm ensemble sets from Simulation Request 13D7 (see Table 3). These are the nominal NuGen samples used to produce the Csky signal MC NumPy files, and therefore form the foundation of all Lightning Tracks performance results—including sensitivities, effective areas, PSF calibration, and Data–MC agreement.
| Flavor | 20–100 GeV | \(10^2\)–\(10^4\) GeV | \(10^4\)–\(10^6\) GeV | \(10^6\)–\(10^8\) GeV |
|---|---|---|---|---|
| \(\nu_\mu\) | 22861 (E\(^{-1.5}\)) | 22852 (E\(^{-1.5}\)) | 22853 (E\(^{-1.5}\)) | 22854 (E\(^{-1}\)) |
| \(\nu_e\) | 22855 (E\(^{-1.5}\)) | 22856 (E\(^{-1.5}\)) | 22857 (E\(^{-1}\)) | |
| \(\nu_\tau\) | 22858 (E\(^{-1.5}\)) | 22859 (E\(^{-1.5}\)) | 22860 (E\(^{-1}\)) |
Systematics—Hole Ice Variation¶
To evaluate hole-ice uncertainties, we processed an additional Snowstorm variant (SimReq 13CC), summarized in Table 4. These sets use unified hole-ice parameters (p0 = −0.27, p1 = −0.042) and serve as a controlled deviation from the nominal 13D7 ensemble.
| Flavor | \(10^2\)–\(10^4\) GeV | \(10^4\)–\(10^6\) GeV | \(10^6\)–\(10^8\) GeV |
|---|---|---|---|
| \(\nu_\mu\) | 22684 (E\(^{-1.5}\)) | 22685 (E\(^{-1.5}\)) | 22686 (E\(^{-1}\)) |
| \(\nu_e\) | 22687 (E\(^{-1.5}\)) | 22688 (E\(^{-1.5}\)) | 22689 (E\(^{-1}\)) |
| \(\nu_\tau\) | 22690 (E\(^{-1.5}\)) | 22691 (E\(^{-1.5}\)) | 22692 (E\(^{-1}\)) |
Systematics—Historical / Legacy NuGen¶
We also make use of the legacy ESTES NuGen production from SimReq 13AB, shown in Table 5. This set is not a systematics ensemble by construction—it was originally generated as a baseline under SPICE 3.2.1 using an older processing chain. We treat it as a systematic variation to evaluate the impact of outdated simulation and reconstruction infrastructure. In particular, we used it to assess the effect of the Calibration Errata Bug and, more broadly, to understand how Lightning Tracks responds to “ancient” processing conditions.
| Flavor | \(10^2\)–\(10^4\) GeV | \(10^4\)–\(10^6\) GeV | \(10^6\)–\(10^8\) GeV |
|---|---|---|---|
| \(\nu_\mu\) | 21813 (E\(^{-2}\)) | 21814 (E\(^{-1.5}\)) | 21938 (E\(^{-1}\)) |
| \(\nu_e\) | 21871 (E\(^{-1.5}\)) | ||
| \(\nu_\tau\) | 21867 (E\(^{-2}\)) | 21868 (E\(^{-1.5}\)) | 21939 (E\(^{-1}\)) |
CORSIKA¶
For CORSIKA, we rely almost exclusively on the 20904 production—at present the only sample with sufficient statistics for meaningful background modeling, especially in the low‐energy regime. This set is somewhat old and was generated using the SPICE 3.2 ice model, but despite its age it remains the only viable high-statistics muon bundle simulation available to us.
The first half of 20904 (split by run number) was used to train all classifier models that depend on atmospheric muon background—namely the final-cut classifiers and the downgoing–throughgoing filter network. The second half is held out for validation and forms the dataset used in our Data–MC Agreement comparisons.