# 7.3.3 Processing steps

The results of all-sky classification were obtained through the following steps.

1. 1.

Crossmatch of Gaia with literature to identify objects of known classes (Section 7.3.3).

2. 2.

Selection of catalogues to crossmatch and their prioritisation (in case of conflictual information on the same objects).

3. 3.

Filtering of sources not satisfying simple statistics (such as colour, magnitude, literature period, amplitude, skewness, and Abbe value computed on magnitudes sorted in time as well as in phase) that are typical of class ownership, while allowing for a large range of possible distance, extinction, and reddening.

4. 4.

Resampling of sources for a more representative distribution in the sky, in the number of FoV transits, and in magnitude.

5. 5.

Pipeline run of the Statistics module on time series pre-processed as described in Section 7.2.3.

6. 6.

Generation and selection of classification attributes (Section 7.3.3).

7. 7.

Training of a multi-stage classifier with optimized parameters.

8. 8.

Application of the multi-stage classifier to the Gaia data.

9. 9.

Improvement of the training set (sources and attributes) including high-confidence classifications and iterating steps 3–6 (Section 7.3.3).

10. 10.

Training of the improved multi-stage classifier with optimized parameters (Section 7.3.3).

11. 11.

Pipeline run of the Statistics and the Classification modules on time series pre-processed as described in Section 7.2.3.

12. 12.

Training of contamination-cleaning classifiers and their application to the results of the previous step, for RR Lyrae stars, Cepheids, and SX Phoenicis/$\delta$ Scuti stars (Section 7.3.3).

13. 13.

Definition of classification scores of the published results (Section 7.3.3).

14. 14.

Assessment of completeness and contamination of the published results (Section 7.3.4).

## Classes

The training set included objects of the classes targeted for publication in Gaia DR2 (listed in bold) as well as other types to reduce the contamination of the published classification results. The full list of object classes, with labels (used in the rest of this section) and corresponding descriptions, follows below.

1. 1.

ACEP: Anomalous Cepheids.

2. 2.

ACV: $\alpha^{2}$ Canum Venaticorum-type stars.

3. 3.

ACYG: $\alpha$ Cygni-type stars.

4. 4.

ARRD: Anomalous double-mode RR Lyrae stars.

5. 5.

BCEP: $\beta$ Cephei-type stars.

6. 6.

BLAP: Blue large amplitude pulsators.

7. 7.

CEP: Classical ($\delta$) Cepheids.

8. 8.

CONSTANT: Objects whose variations (or absence thereof) are consistent with those of constant sources (Section 7.2.3).

9. 9.

CV: Cataclysmic variables of unspecified type.

10. 10.

DSCT: $\delta$ Scuti-type stars.

11. 11.

ECL: Eclipsing binary stars.

12. 12.

ELL: Rotating ellipsoidal variable stars (in close binary systems).

13. 13.

FLARES: Magnetically active stars displaying flares.

14. 14.

GCAS: $\gamma$ Cassiopeiae-type stars.

15. 15.

GDOR: $\gamma$ Doradus-type stars.

16. 16.

MIRA: Long period variable stars of the $o$ (omicron) Ceti type (Mira).

17. 17.

OSARG: OGLE small amplitude red giant variable stars.

18. 18.

QSO: Optically variable quasi-stellar extragalactic sources.

19. 19.

ROT: Rotation modulation in solar-like stars due to magnetic activity (spots).

20. 20.

RRAB: Fundamental-mode RR Lyrae stars.

21. 21.

RRC: First-overtone RR Lyrae stars.

22. 22.

RRD: Double-mode RR Lyrae stars.

23. 23.

RS: RS Canum Venaticorum-type stars.

24. 24.

SOLARLIKE: Stars with solar-like variability induced by magnetic activity (flares, spots, and rotational modulation).

25. 25.

SPB: Slowly pulsating B-type stars.

26. 26.

SXARI: SX Arietis-type stars.

27. 27.

SXPHE: SX Phoenicis-type stars.

28. 28.

SR: Long period variable stars of the semiregular type.

29. 29.

T2CEP: Type-II Cepheids.

## Crossmatch with literature

Training-set objects are selected from Gaia sources crossmatched with objects associated with known classes in the literature. In order to increase the reliability of crossmatch results, a set of metrics was used in the comparison of Gaia and literature sources, always including the angular separation, and whenever possible also the time-series median magnitude in the $G$ band, the $G_{\mathrm{BP}}-G_{\mathrm{RP}}$ colour, as well as time series quantities characterising the amplitude of variations in the $G$ band such as the range or standard deviation. Such metrics were combined in a multi-dimensional distance which was minimised in an iterative process in order to allow for the tuning of empirical relations between the Gaia and literature photometric quantities (affected in particular by the different bandwidth coverage and sensitivity). The best matches were projected onto planes for all combinations of crossmatch metrics to inspect the corresponding distributions and reduce the chance of mis-matches by applying thresholds to exclude dubious outliers and excessive tails of the distributions. Although this approach sacrificed completeness in some cases, it was considered appropriate for training purposes, given the large number of sources available.

In order to sample as many regions of the sky as possible, cover most of the range of Gaia magnitudes, and include a large number of variability types, a multitude and variety of catalogues were selected from a larger set, following general reliability considerations, and prioritised in case of conflicting classifications for the same sources. The full list of catalogues employed in the training sets are presented in Table 7.1, including references and crossmatch metrics. Among the over seven hundred fifty thousand crossmatched objects available for training, only a small sample (of about 33 thousand sources) was vetted to train classifiers (Section 7.3.3), leaving many reliable crossmatches for the validation of results (Section 7.3.4).

## Classification attributes

About one hundred fifty attributes were computed to characterise sources with photometric (and some astrometric) time series features. Each classifier (described in Section 7.3.3) was tested with a varying number of attributes (e.g., Guyon and Elisseeff 2003) and a subset of 40 attributes represented the union of attributes used by all classifiers. The employed classification attributes are defined below, with units quoted in brackets after the attribute name (unless the attribute is dimensionless).

1. 1.

ABBE: The Abbe value (von Neumann 1941, 1942) computed from the magnitudes of FoV transits in the $G$ band.

2. 2.

BP_MINUS_RP_COLOUR (mag): The possibly reddened colour index from the median magnitudes in the $G_{\rm BP}$ and $G_{\rm RP}$ bands.

3. 3.

BP_MINUS_G_COLOUR (mag): The possibly reddened colour index from the median magnitudes in the $G_{\rm BP}$ and $G$ bands.

4. 4.

DENOISED_UNBIASED_UNWEIGHTED_KURTOSIS_MOMENT (mag${}^{4}$): The sample-size unbiased and unweighted kurtosis central moment of FoV transit magnitudes in the $G$ band, denoised assuming Gaussian uncertainties (Rimoldini 2014).

5. 5.

DENOISED_UNBIASED_UNWEIGHTED_VARIANCE (mag${}^{2}$): The sample-size unbiased and unweighted variance of FoV transit magnitudes in the $G$ band, denoised assuming Gaussian uncertainties (Rimoldini 2014).

6. 6.

DURATION (d): The duration of the time series from the first to the last FoV transit observation in the $G$ band.

7. 7.

G_MINUS_RP_COLOUR (mag): The possibly reddened colour index from the median magnitudes in the $G$ and $G_{\rm RP}$ bands.

8. 8.

G_VS_TIME_IQR_ABS_SLOPE (mag d${}^{-1}$): The unweighted interquartile range of the absolute values of magnitude changes per unit time between successive FoV transits in the $G$ band.

9. 9.

G_VS_TIME_MAX_SLOPE (mag d${}^{-1}$): The unweighted 95th percentile of magnitude changes per unit time between successive FoV transits in the $G$ band.

10. 10.

G_VS_TIME_MEDIAN_ABS_SLOPE (mag d${}^{-1}$): The unweighted median of the absolute values of magnitude changes per unit time between successive FoV transits in the $G$ band.

11. 11.

IQR_BP (mag): The unweighted interquartile magnitude range of FoV transits in the $G_{\rm BP}$ band.

12. 12.

IQR_RP (mag): The unweighted interquartile magnitude range of FoV transits in the $G_{\rm RP}$ band.

13. 13.

LOG_QSO_VAR: The decadic logarithm of the reduced chi-square of FoV transit magnitudes in the $G$ band with respect to a parameterised quasar variance model, represented by $\log_{10}(\chi^{2}_{\mathrm{QSO}}/\nu)$ in Butler and Bloom (2011); see Rimoldini et al. (in preparation) for details on the parameter values for the Gaia data.

14. 14.

LOG_NONQSO_VAR: The decadic logarithm of the reduced chi-square of FoV transit magnitudes in the $G$ band not to follow a parameterised quasar variance model, represented by $\log_{10}(\chi^{2}_{\mathrm{False}}/\nu)$ in Butler and Bloom (2011); see Rimoldini et al. (in preparation) for details on the parameter values for the Gaia data.

15. 15.

MAD_G (mag): The unweighted median absolute deviation from the median magnitude of FoV transits in the $G$ band.

16. 16.

MAX_ABS_SLOPE_HALFDAY (mag d${}^{-1}$): The maximum value of the magnitude ranges of FoV transits in the $G$ band within sliding windows of half a day, divided by the time span of the $G$-band observations within such sliding windows.

17. 17.

MEAN_G (mag): The unweighted arithmetic mean magnitude of FoV transits in the $G$ band.

18. 18.

MEAN_BP (mag): The unweighted arithmetic mean magnitude of FoV transits in the $G_{\rm BP}$ band.

19. 19.

MEAN_RP (mag): The unweighted arithmetic mean magnitude of FoV transits in the $G_{\rm RP}$ band.

20. 20.

MEDIAN_ABS_SLOPE_HALFDAY (mag d${}^{-1}$): The unweighted median of the magnitude ranges of FoV transits in the $G$ band within sliding windows of half a day, divided by the time span of the $G$-band observations within such sliding windows.

21. 21.

MEDIAN_ABS_SLOPE_ONEDAY (mag d${}^{-1}$): The unweighted median of the magnitude ranges of FoV transits in the $G$ band within sliding windows of one day, divided by the time span of the $G$-band observations within such sliding windows.

22. 22.

MEDIAN_G (mag): The unweighted median magnitude of FoV transits in the $G$ band.

23. 23.

MEDIAN_BP (mag): The unweighted median magnitude of FoV transits in the $G_{\rm BP}$ band.

24. 24.

MEDIAN_RANGE_HALFDAY_TO_ALL: The unweighted median of the magnitude ranges of FoV transits in the $G$ band within sliding windows of half a day, divided by the $G$-band magnitude range of the full time series.

25. 25.

MEDIAN_RP (mag): The unweighted median magnitude of FoV transits in the $G_{\rm RP}$ band.

26. 26.

NONQSO_PROB: A quantity distributed according to the null-hypothesis distribution of $\chi^{2}_{\mathrm{QSO}}$, given the data, for non-quasar objects, computed from a parameterised quasar variance model with magnitudes of FoV transits in the $G$ band, related to $P(\chi^{2}_{\mathrm{QSO}}|x,\mbox{not quasar})$ in Butler and Bloom (2011); see Rimoldini et al. (in preparation) for details on the parameter values for the Gaia data.

27. 27.

NORMALISED_CHI_SQUARE_EXCESS: The difference between the chi-square of FoV transit magnitudes in the $G$ band and the mean of the chi-square distribution expected for constant objects (i.e., the number of degrees of freedom), normalised by the standard deviation of the chi-square distribution of constant objects (i.e., the square root of twice the number of degrees of freedom).

28. 28.

OUTLIER_MEDIAN_G: The absolute difference between the most outlying FoV transit magnitude with respect to the median magnitude in the $G$ band, normalised by the uncertainty of the most outlying measurement.

29. 29.

PARALLAX (mas): The parallax value of the source derived from a preliminary astrometric solution (Section 7.2.2).

30. 30.

PROPER_MOTION (mas yr${}^{-1}$): The proper motion of the source projected in the sky derived from a preliminary astrometric solution (Section 7.2.2).

31. 31.

PROPER_MOTION_ERROR_TO_VALUE_RATIO: The ratio between the estimated projected proper motion uncertainty and the projected proper motion value of the source, derived from a preliminary astrometric solution (Section 7.2.2).

32. 32.

RANGE_G (mag): The magnitude range of FoV transits in the $G$ band.

33. 33.

REDUCED_CHI2_G: The reduced chi-square of FoV transit magnitudes in the $G$ band.

34. 34.

SIGNAL_TO_NOISE_STDEV_OVER_RMSERR_G: The ratio between the sample-size biased unweighted standard deviation of FoV transit magnitudes in the $G$ band and the root-mean-square of their uncertainties.

35. 35.

SKEWNESS_G: The sample-size unbiased and unweighted skewness central moment of FoV transit magnitudes in the $G$ band, normalised by the third power of the unbiased unweighted standard deviation of the same time-series measurements.

36. 36.

SKEWNESS_PERCENTILE_5: A robust measure of the skewness of the magnitude distribution of FoV transits in the $G$ band, computed as $(P_{95}+P_{5}-2\,P_{50})/(P_{95}-P_{5})$ where $P_{n}$ is the $n$th unweighted percentile.

37. 37.

STETSON_G: The single-band Stetson variability index (Stetson 1996) computed from the magnitudes of FoV transits in the $G$ band, pairing observations within 0.1 days.

38. 38.

STETSON_G_BP: The double-band Stetson variability index (Stetson 1996) computed from the magnitudes of FoV transits in the $G$ and $G_{\rm BP}$ bands, pairing observations in different bands within 0.001 days.

39. 39.

TRIMMED_RANGE_G (mag): The magnitude range between the 5th and 95th unweighted percentiles of FoV transits in the $G$ band.

40. 40.

TRIMMED_RANGE_RP (mag): The magnitude range between the 5th and 95th unweighted percentiles of FoV transits in the $G_{\rm RP}$ band.

## Classification models

A hierarchical structure of Random Forest (Breiman 2001) classifiers identified objects in progressively more detailed (groups of) classes. For Gaia DR2, we focused on high-amplitude variable stars, so objects with negligible or low amplitude variations were first separated from the high amplitude ones, which were then split into the types and subtypes of interest by subsequent classifiers.

Every Random Forest classifier was configured with unlimited depths and with a minimum number of instances per class at the leafs set to one. Other configuration parameters (number of trees nTree and number of tested attributes mTry to best split the data at a given node of a tree), the training-set classes to identify (specified by the labels defined in Section 7.3.3), and the selected attributes (described in Section 7.3.3) are listed below for each classifier. Aggregations of types are denoted by connecting single type labels with an underscore (unless indicated otherwise in brackets).

1. 1.

Random Forest classifier configured with nTree=400 and mTry=10.

1. (a)

Training set:

1. i.

14 684 CONSTANT;

2. ii.

3885 LOW_AMPLITUDE_VARIABLE (ACV, ACYG, BCEP, low-amplitude DSCT_GDOR, ELL, FLARES, GCAS, GDOR, OSARG, ROT, SOLAR_LIKE, SPB, SXARI);

3. iii.

14 999 OTHER_VARIABLE (ACEP, ARRD, BLAP, CEP, CV, DSCT, ECL, MIRA, QSO, RRAB, RRC, RRD, RS, SR, SXPHE, T2CEP).

2. (b)

Attributes: BP_MINUS_G_COLOUR, BP_MINUS_RP_COLOUR,
DENOISED_UNBIASED_UNWEIGHTED_VARIANCE, DURATION, G_MINUS_RP_COLOUR, G_VS_TIME_MEDIAN_ABS_SLOPE, IQR_BP, IQR_RP, LOG_NONQSO_VAR, LOG_QSO_VAR, MAD_G, MEDIAN_ABS_SLOPE_ONEDAY, MEDIAN_BP, MEDIAN_G, MEDIAN_RP,
NONQSO_PROB, NORMALISED_CHI_SQUARE_EXCESS, OUTLIER_MEDIAN_G,
RANGE_G, REDUCED_CHI2_G, SIGNAL_TO_NOISE_STDEV_OVER_RMSERR_G,
SKEWNESS_PERCENTILE_5, STETSON_G, STETSON_G_BP, and TRIMMED_RANGE_RP.

2. 2.

Random Forest classifier configured with nTree=321 and mTry=4 (not relevant to the classification results published in Gaia DR2, but still described for details on the objects of low-amplitude types employed).

1. (a)

Training set:

1. i.

363 ACV_ACYG_BCEP_GCAS_SPB_SXARI (combination of poorly represented low-amplitude objects characterized by multiperiodic, pulsating, rotating, or irregular light variations);

2. ii.

866 DSCT_GDOR_LOW_AMPLITUDE (DSCT, GDOR, and DSCT-GDOR hybrids with low amplitude variations);

3. iii.

397 ELL;

4. iv.

996 OSARG;

5. v.

1247 SOLARLIKE_FLARES_ROT.

2. (b)

Attributes: BP_MINUS_RP_COLOUR, DURATION, G_MINUS_RP_COLOUR, IQR_RP,
LOG_QSO_VAR, MEAN_BP, MEAN_G, PARALLAX, PROPER_MOTION.

3. 3.

Random Forest classifier configured with nTree=336 and mTry=3.

1. (a)

Training set: 10 BLAP, 711 CEP_ACEP_T2CEP, 518 CV, 1326 DSCT_SXPHE, 3861 ECL, 1945 MIRA_SR, 1996 QSO, 4108 RRAB_RRC_RRD_ARRD, and 500 RS.

2. (b)

Attributes: ABBE, BP_MINUS_RP_COLOUR,
DENOISED_UNBIASED_UNWEIGHTED_VARIANCE, G_MINUS_RP_COLOUR,
G_VS_TIME_MAX_SLOPE, MEAN_G, MEAN_RP, MEDIAN_ABS_SLOPE_ONEDAY,
MEDIAN_RANGE_HALFDAY_TO_ALL, NORMALISED_CHI_SQUARE_EXCESS,
PARALLAX, PROPER_MOTION, PROPER_MOTION_ERROR_TO_VALUE_RATIO,
RANGE_G, and SKEWNESS_G.

4. 4.

Random Forest classifier configured with nTree=202 and mTry=3.

1. (a)

Training set: 2922 RRAB, 969 RRC, 197 RRD, and 20 ARRD.

2. (b)

Attributes: BP_MINUS_RP_COLOUR,
DENOISED_UNBIASED_UNWEIGHTED_KURTOSIS_MOMENT,
G_VS_TIME_IQR_ABS_SLOPE, G_VS_TIME_MAX_SLOPE,
NORMALISED_CHI_SQUARE_EXCESS, STETSON_G, and TRIMMED_RANGE_G.

5. 5.

Random Forest classifier configured with nTree=135 and mTry=3.

1. (a)

Training set: 99 ACEP, 455 CEP, and 157 T2CEP.

2. (b)

Attributes: BP_MINUS_RP_COLOUR, DURATION, LOG_NONQSO_VAR,
LOG_QSO_VAR, MAX_ABS_SLOPE_HALFDAY, MEAN_G,
MEDIAN_ABS_SLOPE_HALFDAY, and MEDIAN_RP.

## Semi-supervised classification

Semi-supervised classification was applied to constant objects, RR Lyrae stars, and long period variables, in order to improve their representation in the training set as follows.

1. 1.

High-confidence classifications of such classes were selected as candidate training sources.

2. 2.

Candidate training objects were filtered by the statistics mentioned in item 3 of Section 7.3.3, except for the literature period and the Abbe value computed on phase-sorted magnitudes (not available for results classified without period computation).

3. 3.

Filtered candidate training objects were selected to cover regions in the sky and/or magnitude intervals that lacked proper representation in the training set.

## Contamination cleaning

The contamination of preliminary classification results was reduced with the help of dedicated classifiers applied to RR Lyrae stars, Cepheids, and SX Phoenicis/$\delta$ Scuti stars, separately for each type, as follows.

1. 1.

Samples of true positives and false positives (according to crossmatched objects) were selected from the candidates of the previous classification stage.

2. 2.

Classification attributes were generated and selected.

3. 3.

A binary classifier of true positives versus false positives (in similar amounts) was trained and optimized.

4. 4.

The preliminary classification candidates (above some minimal level of classification probability depending on the type) were processed by the binary classifier (item 3) and objects classified as true positives with a minimum probability of 50 per cent were retained.

## Classification score

The results of the contamination-cleaning classifiers are associated with classification scores which express the confidence of the classifier given the training set, thus such scores should not be interpreted as true probabilities. The scores of Gaia DR2 classification results are obtained by linearly mapping the internal classifier probabilities to values within a range from zero to one (from the weakest to the strongest candidate), for each variability type.