7.3.3 Processing steps

The results of all-sky classification were obtained through the following steps.

  1. 1.

    Crossmatch of Gaia with literature to identify objects of known classes (Section 7.3.3).

  2. 2.

    Selection of catalogues to crossmatch and their prioritisation (in case of conflictual information on the same objects).

  3. 3.

    Filtering of sources not satisfying simple statistics (such as colour, magnitude, literature period, amplitude, skewness, and Abbe value computed on magnitudes sorted in time as well as in phase) that are typical of class ownership, while allowing for a large range of possible distance, extinction, and reddening.

  4. 4.

    Resampling of sources for a more representative distribution in the sky, in the number of FoV transits, and in magnitude.

  5. 5.

    Pipeline run of the Statistics module on time series pre-processed as described in Section 7.2.3.

  6. 6.

    Generation and selection of classification attributes (Section 7.3.3).

  7. 7.

    Training of a multi-stage classifier with optimized parameters.

  8. 8.

    Application of the multi-stage classifier to the Gaia data.

  9. 9.

    Improvement of the training set (sources and attributes) including high-confidence classifications and iterating steps 3–6 (Section 7.3.3).

  10. 10.

    Training of the improved multi-stage classifier with optimized parameters (Section 7.3.3).

  11. 11.

    Pipeline run of the Statistics and the Classification modules on time series pre-processed as described in Section 7.2.3.

  12. 12.

    Training of contamination-cleaning classifiers and their application to the results of the previous step, for RR Lyrae stars, Cepheids, and SX Phoenicis/δ Scuti stars (Section 7.3.3).

  13. 13.

    Definition of classification scores of the published results (Section 7.3.3).

  14. 14.

    Assessment of completeness and contamination of the published results (Section 7.3.4).

Classes

The training set included objects of the classes targeted for publication in Gaia DR2 (listed in bold) as well as other types to reduce the contamination of the published classification results. The full list of object classes, with labels (used in the rest of this section) and corresponding descriptions, follows below.

  1. 1.

    ACEP: Anomalous Cepheids.

  2. 2.

    ACV: α2 Canum Venaticorum-type stars.

  3. 3.

    ACYG: α Cygni-type stars.

  4. 4.

    ARRD: Anomalous double-mode RR Lyrae stars.

  5. 5.

    BCEP: β Cephei-type stars.

  6. 6.

    BLAP: Blue large amplitude pulsators.

  7. 7.

    CEP: Classical (δ) Cepheids.

  8. 8.

    CONSTANT: Objects whose variations (or absence thereof) are consistent with those of constant sources (Section 7.2.3).

  9. 9.

    CV: Cataclysmic variables of unspecified type.

  10. 10.

    DSCT: δ Scuti-type stars.

  11. 11.

    ECL: Eclipsing binary stars.

  12. 12.

    ELL: Rotating ellipsoidal variable stars (in close binary systems).

  13. 13.

    FLARES: Magnetically active stars displaying flares.

  14. 14.

    GCAS: γ Cassiopeiae-type stars.

  15. 15.

    GDOR: γ Doradus-type stars.

  16. 16.

    MIRA: Long period variable stars of the o (omicron) Ceti type (Mira).

  17. 17.

    OSARG: OGLE small amplitude red giant variable stars.

  18. 18.

    QSO: Optically variable quasi-stellar extragalactic sources.

  19. 19.

    ROT: Rotation modulation in solar-like stars due to magnetic activity (spots).

  20. 20.

    RRAB: Fundamental-mode RR Lyrae stars.

  21. 21.

    RRC: First-overtone RR Lyrae stars.

  22. 22.

    RRD: Double-mode RR Lyrae stars.

  23. 23.

    RS: RS Canum Venaticorum-type stars.

  24. 24.

    SOLARLIKE: Stars with solar-like variability induced by magnetic activity (flares, spots, and rotational modulation).

  25. 25.

    SPB: Slowly pulsating B-type stars.

  26. 26.

    SXARI: SX Arietis-type stars.

  27. 27.

    SXPHE: SX Phoenicis-type stars.

  28. 28.

    SR: Long period variable stars of the semiregular type.

  29. 29.

    T2CEP: Type-II Cepheids.

Crossmatch with literature

Training-set objects are selected from Gaia sources crossmatched with objects associated with known classes in the literature. In order to increase the reliability of crossmatch results, a set of metrics was used in the comparison of Gaia and literature sources, always including the angular separation, and whenever possible also the time-series median magnitude in the G band, the GBP-GRP colour, as well as time series quantities characterising the amplitude of variations in the G band such as the range or standard deviation. Such metrics were combined in a multi-dimensional distance which was minimised in an iterative process in order to allow for the tuning of empirical relations between the Gaia and literature photometric quantities (affected in particular by the different bandwidth coverage and sensitivity). The best matches were projected onto planes for all combinations of crossmatch metrics to inspect the corresponding distributions and reduce the chance of mis-matches by applying thresholds to exclude dubious outliers and excessive tails of the distributions. Although this approach sacrificed completeness in some cases, it was considered appropriate for training purposes, given the large number of sources available.

In order to sample as many regions of the sky as possible, cover most of the range of Gaia magnitudes, and include a large number of variability types, a multitude and variety of catalogues were selected from a larger set, following general reliability considerations, and prioritised in case of conflicting classifications for the same sources. The full list of catalogues employed in the training sets are presented in Table 7.1, including references and crossmatch metrics. Among the over seven hundred fifty thousand crossmatched objects available for training, only a small sample (of about 33 thousand sources) was vetted to train classifiers (Section 7.3.3), leaving many reliable crossmatches for the validation of results (Section 7.3.4).

Table 7.1: Crossmatch of (mostly) variable objects from the literature selected for the training set. The Table includes names of surveys and/or variability types (specified by the labels defined in Section 7.3.3), references, and crossmatch metrics: angular separation (AS), time-series median G-band magnitude (M) and GBP-GRP colour (C), time-series G-band magnitude range (R) and standard deviation (SD).
Description Reference Crossmatch
Metrics
ASAS All-Star Catalog: solar-like stars Messina et al. (2010a, 2011) AS, M
ASAS variables in Kepler Pigulski et al. (2009) AS, M
BCEP stars Stankov and Handler (2005) AS, M
Catalina cataclysmic variables Drake et al. (2014a) AS
Catalina periodic variables Drake et al. (2014b) AS, M, R
Catalina RRab stars (paper I) Drake et al. (2013a) AS, M, R
Catalina RRab stars (paper II) Drake et al. (2013b) AS, M, R
Catalina RRab stars (SSS) Torrealba et al. (2015) AS, M, R
CoRoT Rotational Modulation De Medeiros et al. (2013) AS, M, C
DSCT and GDOR stars Bradley et al. (2015); Sarro et al. (2013) AS, M
Uytterhoeven et al. (2011)
EROS-II Beat Cepheids Marquette et al. (2009) AS
Gaia DR1 (RR Lyrae & Cepheids) Clementini et al. (2016) AS, M
GDOR stars Debosscher et al. (2007) AS, M, C
Kahraman Aliçavuş et al. (2016)
Hipparcos periodic variables and constants ESA (1997); van Leeuwen (2007b) AS, M, C, R
ICRF2 Quasars Ma et al. (2009) AS, M
Kepler Flares Shibayama et al. (2013) AS, M
Walkowicz et al. (2011); Wu et al. (2015)
Kepler Rotational Modulation Reinhold and Gizon (2015) AS, M, C
LINEAR periodic variables Palaversa et al. (2013) AS, M, SD
M37 Flares Chang et al. (2015b) AS
NSVS Red variables Woźniak et al. (2004) AS, M, R
NSVS RRab stars Kinemuchi et al. (2006) AS, M, R
OGLE-IV Blue large amplitude pulsators Pietrukowicz et al. (2017) AS, M
OGLE-IV Cataclysmic variables Mróz et al. (2015) AS
OGLE-IV Cepheids and RR Lyrae (LMC, SMC) Soszyński et al. (2015b, d, 2016b) AS, M, C, R
OGLE-IV Eclipsing binaries (bulge) Soszyński et al. (2016a) AS, M, C, R
OGLE-IV Eclipsing binaries (LMC, SMC) Pawlak et al. (2016) AS, M, C, R
OGLE-IV GSEP constant candidates Soszyński et al. (2012)11footnotemark: AS, M, C, SD
OGLE-IV GSEP variables Soszyński et al. (2012) AS, M
OGLE-IV RR Lyrae stars (bulge) Soszyński et al. (2014) AS, M, C, R
OGLE-IV Short period binaries Soszyński et al. (2015a) AS, M, C, R
Pan-STARRS1 RR Lyrae stars Sesar et al. (2017) AS, M
Rotational Modulation Stauffer et al. (2007) AS
Collier Cameron et al. (2009)
Hartman et al. (2009); Meibom et al. (2009)
Messina et al. (2010b); Delorme et al. (2011)
Meibom et al. (2011a, b); Moraux et al. (2013)
Kovács et al. (2014); Meibom et al. (2015)
Chang et al. (2015a); Barnes et al. (2015)
Douglas et al. (2016); Covey et al. (2016)
RR Lyrae in ω Centauri globular cluster Braga et al. (2016) AS, M
RR Lyrae in M3 Benkő et al. (2006) AS, M
RR Lyrae in M15 Corwin et al. (2008) AS, M
RR Lyrae in ultra-faint dwarf spheroidals Dall’Ora et al. (2006); Siegel (2006) AS, M
Kuehn et al. (2008); Greco et al. (2008)
Watkins et al. (2009); Moretti et al. (2009)
Musella et al. (2009, 2012)
Clementini et al. (2012); Dall’Ora et al. (2012)
Boettcher et al. (2013); Garofalo et al. (2013)
Sesar et al. (2014); Vivas et al. (2016)
SDSS DSCT and RR Lyrae stars Süveges et al. (2012) AS, M, C
SDSS-PS1-Catalina RR Lyrae stars Abbas et al. (2014) AS, M
SDSS Standard stars Ivezić et al. (2007) AS, M, C
Solar-like activity in the Pleiades Hartman et al. (2010) AS, M
SPB and BCEP stars Selected by Peter De Cat22footnotemark: AS, M
SPB stars Niemczura (2003) AS, M
22footnotemark: Selection of P. De Cat available at http://www.ster.kuleuven.ac.be/~peter/Bstars/.
11footnotemark: Selection of the least varying sources at ftp://ftp.astrouw.edu.pl/ogle/ogle4/GSEP/maps/.

Classification attributes

About one hundred fifty attributes were computed to characterise sources with photometric (and some astrometric) time series features. Each classifier (described in Section 7.3.3) was tested with a varying number of attributes (e.g., Guyon and Elisseeff 2003) and a subset of 40 attributes represented the union of attributes used by all classifiers. The employed classification attributes are defined below, with units quoted in brackets after the attribute name (unless the attribute is dimensionless).

  1. 1.

    ABBE: The Abbe value (von Neumann 1941, 1942) computed from the magnitudes of FoV transits in the G band.

  2. 2.

    BP_MINUS_RP_COLOUR (mag): The possibly reddened colour index from the median magnitudes in the GBP and GRP bands.

  3. 3.

    BP_MINUS_G_COLOUR (mag): The possibly reddened colour index from the median magnitudes in the GBP and G bands.

  4. 4.

    DENOISED_UNBIASED_UNWEIGHTED_KURTOSIS_MOMENT (mag4): The sample-size unbiased and unweighted kurtosis central moment of FoV transit magnitudes in the G band, denoised assuming Gaussian uncertainties (Rimoldini 2014).

  5. 5.

    DENOISED_UNBIASED_UNWEIGHTED_VARIANCE (mag2): The sample-size unbiased and unweighted variance of FoV transit magnitudes in the G band, denoised assuming Gaussian uncertainties (Rimoldini 2014).

  6. 6.

    DURATION (d): The duration of the time series from the first to the last FoV transit observation in the G band.

  7. 7.

    G_MINUS_RP_COLOUR (mag): The possibly reddened colour index from the median magnitudes in the G and GRP bands.

  8. 8.

    G_VS_TIME_IQR_ABS_SLOPE (mag d-1): The unweighted interquartile range of the absolute values of magnitude changes per unit time between successive FoV transits in the G band.

  9. 9.

    G_VS_TIME_MAX_SLOPE (mag d-1): The unweighted 95th percentile of magnitude changes per unit time between successive FoV transits in the G band.

  10. 10.

    G_VS_TIME_MEDIAN_ABS_SLOPE (mag d-1): The unweighted median of the absolute values of magnitude changes per unit time between successive FoV transits in the G band.

  11. 11.

    IQR_BP (mag): The unweighted interquartile magnitude range of FoV transits in the GBP band.

  12. 12.

    IQR_RP (mag): The unweighted interquartile magnitude range of FoV transits in the GRP band.

  13. 13.

    LOG_QSO_VAR: The decadic logarithm of the reduced chi-square of FoV transit magnitudes in the G band with respect to a parameterised quasar variance model, represented by log10(χQSO2/ν) in Butler and Bloom (2011).

  14. 14.

    LOG_NONQSO_VAR: The decadic logarithm of the reduced chi-square of FoV transit magnitudes in the G band not to follow a parameterised quasar variance model, represented by log10(χFalse2/ν) in Butler and Bloom (2011).

  15. 15.

    MAD_G (mag): The unweighted median absolute deviation from the median magnitude of FoV transits in the G band.

  16. 16.

    MAX_ABS_SLOPE_HALFDAY (mag d-1): The maximum value of the magnitude ranges of FoV transits in the G band within sliding windows of half a day, divided by the time span of the G-band observations within such sliding windows.

  17. 17.

    MEAN_G (mag): The unweighted arithmetic mean magnitude of FoV transits in the G band.

  18. 18.

    MEAN_BP (mag): The unweighted arithmetic mean magnitude of FoV transits in the GBP band.

  19. 19.

    MEAN_RP (mag): The unweighted arithmetic mean magnitude of FoV transits in the GRP band.

  20. 20.

    MEDIAN_ABS_SLOPE_HALFDAY (mag d-1): The unweighted median of the magnitude ranges of FoV transits in the G band within sliding windows of half a day, divided by the time span of the G-band observations within such sliding windows.

  21. 21.

    MEDIAN_ABS_SLOPE_ONEDAY (mag d-1): The unweighted median of the magnitude ranges of FoV transits in the G band within sliding windows of one day, divided by the time span of the G-band observations within such sliding windows.

  22. 22.

    MEDIAN_G (mag): The unweighted median magnitude of FoV transits in the G band.

  23. 23.

    MEDIAN_BP (mag): The unweighted median magnitude of FoV transits in the GBP band.

  24. 24.

    MEDIAN_RANGE_HALFDAY_TO_ALL: The unweighted median of the magnitude ranges of FoV transits in the G band within sliding windows of half a day, divided by the G-band magnitude range of the full time series.

  25. 25.

    MEDIAN_RP (mag): The unweighted median magnitude of FoV transits in the GRP band.

  26. 26.

    NONQSO_PROB: A quantity distributed according to the null-hypothesis distribution of χQSO2, given the data, for non-quasar objects, computed from a parameterised quasar variance model with magnitudes of FoV transits in the G band, related to P(χQSO2|x,not quasar) in Butler and Bloom (2011).

  27. 27.

    NORMALISED_CHI_SQUARE_EXCESS: The difference between the chi-square of FoV transit magnitudes in the G band and the mean of the chi-square distribution expected for constant objects (i.e., the number of degrees of freedom), normalised by the standard deviation of the chi-square distribution of constant objects (i.e., the square root of twice the number of degrees of freedom).

  28. 28.

    OUTLIER_MEDIAN_G: The absolute difference between the most outlying FoV transit magnitude with respect to the median magnitude in the G band, normalised by the uncertainty of the most outlying measurement.

  29. 29.

    PARALLAX (mas): The parallax value of the source derived from a preliminary astrometric solution (Section 7.2.2).

  30. 30.

    PROPER_MOTION (mas yr-1): The proper motion of the source projected in the sky derived from a preliminary astrometric solution (Section 7.2.2).

  31. 31.

    PROPER_MOTION_ERROR_TO_VALUE_RATIO: The ratio between the estimated projected proper motion uncertainty and the projected proper motion value of the source, derived from a preliminary astrometric solution (Section 7.2.2).

  32. 32.

    RANGE_G (mag): The magnitude range of FoV transits in the G band.

  33. 33.

    REDUCED_CHI2_G: The reduced chi-square of FoV transit magnitudes in the G band.

  34. 34.

    SIGNAL_TO_NOISE_STDEV_OVER_RMSERR_G: The ratio between the sample-size biased unweighted standard deviation of FoV transit magnitudes in the G band and the root-mean-square of their uncertainties.

  35. 35.

    SKEWNESS_G: The sample-size unbiased and unweighted skewness central moment of FoV transit magnitudes in the G band, normalised by the third power of the unbiased unweighted standard deviation of the same time-series measurements.

  36. 36.

    SKEWNESS_PERCENTILE_5: A robust measure of the skewness of the magnitude distribution of FoV transits in the G band, computed as (P95+P5-2P50)/(P95-P5) where Pn is the nth unweighted percentile.

  37. 37.

    STETSON_G: The single-band Stetson variability index (Stetson 1996) computed from the magnitudes of FoV transits in the G band, pairing observations within 0.1 days.

  38. 38.

    STETSON_G_BP: The double-band Stetson variability index (Stetson 1996) computed from the magnitudes of FoV transits in the G and GBP bands, pairing observations in different bands within 0.001 days.

  39. 39.

    TRIMMED_RANGE_G (mag): The magnitude range between the 5th and 95th unweighted percentiles of FoV transits in the G band.

  40. 40.

    TRIMMED_RANGE_RP (mag): The magnitude range between the 5th and 95th unweighted percentiles of FoV transits in the GRP band.

Classification models

A hierarchical structure of Random Forest (Breiman 2001) classifiers identified objects in progressively more detailed (groups of) classes. For Gaia DR2, we focused on high-amplitude variable stars, so objects with negligible or low amplitude variations were first separated from the high amplitude ones, which were then split into the types and subtypes of interest by subsequent classifiers.

Every Random Forest classifier was configured with unlimited depths and with a minimum number of instances per class at the leafs set to one. Other configuration parameters (number of trees nTree and number of tested attributes mTry to best split the data at a given node of a tree), the training-set classes to identify (specified by the labels defined in Section 7.3.3), and the selected attributes (described in Section 7.3.3) are listed below for each classifier. Aggregations of types are denoted by connecting single type labels with an underscore (unless indicated otherwise in brackets).

  1. 1.

    Random Forest classifier configured with nTree=400 and mTry=10.

    1. (a)

      Training set:

      1. i.

        14 684 CONSTANT;

      2. ii.

        3885 LOW_AMPLITUDE_VARIABLE (ACV, ACYG, BCEP, low-amplitude DSCT_GDOR, ELL, FLARES, GCAS, GDOR, OSARG, ROT, SOLAR_LIKE, SPB, SXARI);

      3. iii.

        14 999 OTHER_VARIABLE (ACEP, ARRD, BLAP, CEP, CV, DSCT, ECL, MIRA, QSO, RRAB, RRC, RRD, RS, SR, SXPHE, T2CEP).

    2. (b)

      Attributes: BP_MINUS_G_COLOUR, BP_MINUS_RP_COLOUR,
      DENOISED_UNBIASED_UNWEIGHTED_VARIANCE, DURATION, G_MINUS_RP_COLOUR, G_VS_TIME_MEDIAN_ABS_SLOPE, IQR_BP, IQR_RP, LOG_NONQSO_VAR, LOG_QSO_VAR, MAD_G, MEDIAN_ABS_SLOPE_ONEDAY, MEDIAN_BP, MEDIAN_G, MEDIAN_RP,
      NONQSO_PROB, NORMALISED_CHI_SQUARE_EXCESS, OUTLIER_MEDIAN_G,
      RANGE_G, REDUCED_CHI2_G, SIGNAL_TO_NOISE_STDEV_OVER_RMSERR_G,
      SKEWNESS_PERCENTILE_5, STETSON_G, STETSON_G_BP, and TRIMMED_RANGE_RP.

  2. 2.

    Random Forest classifier configured with nTree=321 and mTry=4 (not relevant to the classification results published in Gaia DR2, but still described for details on the objects of low-amplitude types employed).

    1. (a)

      Training set:

      1. i.

        363 ACV_ACYG_BCEP_GCAS_SPB_SXARI (combination of poorly represented low-amplitude objects characterized by multiperiodic, pulsating, rotating, or irregular light variations);

      2. ii.

        866 DSCT_GDOR_LOW_AMPLITUDE (DSCT, GDOR, and DSCT-GDOR hybrids with low amplitude variations);

      3. iii.

        397 ELL;

      4. iv.

        996 OSARG;

      5. v.

        1247 SOLARLIKE_FLARES_ROT.

    2. (b)

      Attributes: BP_MINUS_RP_COLOUR, DURATION, G_MINUS_RP_COLOUR, IQR_RP,
      LOG_QSO_VAR, MEAN_BP, MEAN_G, PARALLAX, PROPER_MOTION.

  3. 3.

    Random Forest classifier configured with nTree=336 and mTry=3.

    1. (a)

      Training set: 10 BLAP, 711 CEP_ACEP_T2CEP, 518 CV, 1326 DSCT_SXPHE, 3861 ECL, 1945 MIRA_SR, 1996 QSO, 4108 RRAB_RRC_RRD_ARRD, and 500 RS.

    2. (b)

      Attributes: ABBE, BP_MINUS_RP_COLOUR,
      DENOISED_UNBIASED_UNWEIGHTED_VARIANCE, G_MINUS_RP_COLOUR,
      G_VS_TIME_MAX_SLOPE, MEAN_G, MEAN_RP, MEDIAN_ABS_SLOPE_ONEDAY,
      MEDIAN_RANGE_HALFDAY_TO_ALL, NORMALISED_CHI_SQUARE_EXCESS,
      PARALLAX, PROPER_MOTION, PROPER_MOTION_ERROR_TO_VALUE_RATIO,
      RANGE_G, and SKEWNESS_G.

  4. 4.

    Random Forest classifier configured with nTree=202 and mTry=3.

    1. (a)

      Training set: 2922 RRAB, 969 RRC, 197 RRD, and 20 ARRD.

    2. (b)

      Attributes: BP_MINUS_RP_COLOUR,
      DENOISED_UNBIASED_UNWEIGHTED_KURTOSIS_MOMENT,
      G_VS_TIME_IQR_ABS_SLOPE, G_VS_TIME_MAX_SLOPE,
      NORMALISED_CHI_SQUARE_EXCESS, STETSON_G, and TRIMMED_RANGE_G.

  5. 5.

    Random Forest classifier configured with nTree=135 and mTry=3.

    1. (a)

      Training set: 99 ACEP, 455 CEP, and 157 T2CEP.

    2. (b)

      Attributes: BP_MINUS_RP_COLOUR, DURATION, LOG_NONQSO_VAR,
      LOG_QSO_VAR, MAX_ABS_SLOPE_HALFDAY, MEAN_G,
      MEDIAN_ABS_SLOPE_HALFDAY, and MEDIAN_RP.

Semi-supervised classification

Semi-supervised classification was applied to constant objects, RR Lyrae stars, and long period variables, in order to improve their representation in the training set as follows.

  1. 1.

    High-confidence classifications of such classes were selected as candidate training sources.

  2. 2.

    Candidate training objects were filtered by the statistics mentioned in item 3 of Section 7.3.3, except for the literature period and the Abbe value computed on phase-sorted magnitudes (not available for results classified without period computation).

  3. 3.

    Filtered candidate training objects were selected to cover regions in the sky and/or magnitude intervals that lacked proper representation in the training set.

Contamination cleaning

The contamination of preliminary classification results was reduced with the help of dedicated classifiers applied to RR Lyrae stars, Cepheids, and SX Phoenicis/δ Scuti stars, separately for each type, as follows.

  1. 1.

    Samples of true positives and false positives (according to crossmatched objects) were selected from the candidates of the previous classification stage.

  2. 2.

    Classification attributes were generated and selected.

  3. 3.

    A binary classifier of true positives versus false positives (in similar amounts) was trained and optimized.

  4. 4.

    The preliminary classification candidates (above some minimal level of classification probability depending on the type) were processed by the binary classifier (item 3) and objects classified as true positives with a minimum probability of 50 per cent were retained.

Classification score

The results of the contamination-cleaning classifiers are associated with classification scores which express the confidence of the classifier given the training set, thus such scores should not be interpreted as true probabilities. The scores of Gaia DR2 classification results are obtained by linearly mapping the internal classifier probabilities to values within a range from zero to one (from the weakest to the strongest candidate), for each variability type.