skip to main content

gaia data release 3 documentation

11.3 Apsis modules

11.3.14 Quasar Classifier (QSOC)

Author(s): Ludovic Delchambre


The Quasar Classifier (QSOC) module aim to determine the redshift, z, of the sources that are classified as quasars by the DSC module (see Section 11.3.2 for more details).


The determination of the redshift by QSOC is based on

  1. 1.

    the internally calibrated BP/RP spectra as sampled by SMSgen (see Section 11.3.1)

  2. 2.

    the DSC Combmod probability for the selection of the sources to be processed, namely those having classprob_dsc_combmod_quasar 0.01.

  3. 3.

    BP/RP rest-frame quasar templates coming from the weighted principal component analysis (Delchambre 2015) of continuum-subtracted spectra of quasars from the twelfth data release of the Sloan Digital Sky Survey Quasar Catalog (Pâris et al. 2017, hereafter DR12Q) and simulated through the BP/RP simulator (see Section 11.2.3). See Delchambre et al. (2022) for more details on the way these templates were built.


The determination of the redshift of quasars by QSOC is based on the fact that the redshift,

z=λobs-λrestλrest=λobsλrest-1, (11.41)

turns into a simple offset once considered on a logarithmic wavelength scale

Z=ln(z+1)=lnλobs-lnλrest, (11.42)

where we assume that a given spectral feature standing at rest-frame wavelength λrest is observed at wavelength λobs. Accordingly, given a set of rest-frame templates, 𝑻, and an observation vector, 𝒔, that are sampled on the same logarithmic wavelength scale, λi=λ0Li where λ0 is a reference wavelength and L is the logarithmic wavelength sampling we use (here lnL=0.001 or equivalently L1.001), the problem of finding the optimal shift, k, between 𝑻 and 𝒔 can be formulated as a χ2 minimization problem through

χ2(k)=i1σi2(si-jaj,kTi+k,j)2 (11.43)

where σi is the uncertainty on si and aj,k’s are the coefficients that allow to fit 𝑻 to 𝒔 in a least squares sense while considering a shift k that is applied to the templates. The redshift that is associated with the shift k being then simply retrieved through z=Lk-1. A sub-sampling precision is then obtained by fitting a quadratic curve in the vicinity of the selected shift. For reasons explained in Delchambre (2016) and in Delchambre et al. (2022), we compute a reversed and shifted version of this χ2,

ccf(k)=C-χ2(k), (11.44)

which we refer to as a cross-correlation function (CCF), where C is a constant. Figure 11.73 illustrates the CCF of a quasars spectrum from the Sloan Digital Sky Survey (SDSS) against quasar templates coming from Delchambre (2018). Whereas no spectral features can be modelled in shifts associated with local minima, local maxima corresponds to the fit of some of the template emission lines to the observed spectra, while the global maximum usually corresponds to the most probable redshift. As the BP and RP spectra are distinct, we should also note that the effective CCF is actually composed of the sum of two CCF

ccf(k)=ccfbp(k)+ccfrp(k), (11.45)

each associated with a given part of the spectrum.

Figure 11.73: (Top) Illustration of the cross-correlation function associated with the SDSS J000313.08+274044.9 quasar standing at z=2.2329. (Bottom) Specific fits corresponding to: (a) a local maximum, (b) a local minimum and (c) the global maximum of the cross-correlation function. 15 emission line templates as well as 5 continuum templates out of the online material from Delchambre (2018) were used for the purpose of the present illustration.

The shift that is selected by QSOC is the one that is associated with the highest score, as defined by

S(k)=w0×[χr2(k)]p+w1×[Zscore(k)]p, (11.46)

where w0=0.71413, w1=0.28587 and p=0.24365 (see Delchambre et al. 2022, for details). In this last equation, χr2(k) is the chi-square ratio, ccfratio_qsoc, defined as the value of the CCF evaluated at k to the maximum of the CCF,

χr2(k)=ccf(k)maxk(ccf) (11.47)

and Zscore(k), zscore_qsoc, is an indicator of the presence of quasar emission lines. A Zscore close to one indicates that all the emission lines that we expect at redshift z are found in the spectra while the miss of a single emission line often leads to a very low Zscore (see Delchambre et al. 2022, for details). In order to facilitate the filtering of the potentially erroneous redshifts by the final user, we further define a binary processing flag, flags_qsoc which is based on these quality indicators as well as on the number of spectral transits and G magnitude of the source.

Finally, the redshift that is reported by QSOC is distributed as a log-normal distribution of mean Z=ln(z+1) and variance σZ2=σk2ln2L such that its lower and upper confidence intervals, respectively taken as its 0.15866 and 0.84134 quantiles, are given by

zlow=exp(Z-σZ)-1  and  zup=exp(Z+σZ)-1. (11.48)

More information on the QSOC method and on the computation of the Zscore can be found in Delchambre et al. (2022).


The output of the QSOC module can be found in the following fields of the Gaia qso_candidates table:

  1. 1.

    redshift_qsoc: The quasar redshift, z, from Equation 11.41

  2. 2.

    redshift_qsoc_lower and redshift_qsoc_upper: The lower and upper confidence intervals, zlow and zup, corresponding to the 16% and 84% quantiles of z, respectively, as given by Equation 11.48

  3. 3.

    ccfratio_qsoc: The chi-square ratio, χr2, from Equation 11.47

  4. 4.

    zscore_qsoc: The Zscore from Equation 11.46

  5. 5.

    flags_qsoc: The QSOC processing flags, zwarn.


QSOC produces redshift predictions in the Gaia qso_candidates table only for sources with a DSC Combmod probability, classprob_dsc_combmod_quasar 0.01, while having

  • flags_qsoc 16, meaning that the BP/RP spectra are considered as reliable (i.e. flag Z_BADSPEC is not set such that flags_qsoc < 16) or the processing of this source rises no warning flag even though the spectrum was considered unreliable (i.e. flags_qsoc = 16).

  • flags_qsoc > 16 but added by other contributors to the Gaia QSO table, as indicated in the source_selection_flags field of the qso_candidates table.



We provide here a summary of the QSOC performances. We however refer the user to Delchambre et al. (2022) for a more detailed analysis of these results. QSOC performances are assessed by comparing the predicted reshifts against values from the literature. For this purpose, we cross-matched 6,375,063 sources having redshift estimates from QSOC with 790,776 quasars having spectroscopically confirmed redshifts in the Milliquas 7.2 catalogue of Flesch (2021) (i.e. type = 'Q' in Milliquas). The 1′′ search radius we used then allows us to extract 439 127 QSOC sources with literature redshifts. We should however stress out that neither these redshifts, nor the G magnitudes of the cross-matched sources follow realistic distributions, as they inherit from the selection/observational biases that are present in both the Milliquas catalogue and in Gaia. Accordingly, the numbers reported here should be taken with the caution they deserve. That being said, a straight comparison of these predictions shows that 279,850/439,127=63.73% of the sources have an absolute error on the predicted redshift, |Δz| that is lower than 0.1 while this ratio rises to 89,107/91,320=97.58% if only flags_qsoc = 0 sources are considered.

Figure 11.74: Histogram of the logarithmic redshift error, ΔZ=ln(z+1)-ln(ztrue+1) between QSOC redshift, z and literature redshift, ztrue, for 439,127 sources contained in the Milliquas 7.2 catalogue. Bin width is 0.01.

In Figure 11.74, we show the distribution of the logarithmic redshift error, ΔZ=ln(z+1)-ln(ztrue+1) between QSOC redshift, z and literature redshift, ztrue, for the 439,127 sources we previously identified. The distribution of this logarithmic redshift error provides, in addition to the number of good predictions, a straight visualisation of the mismatches existing between common quasars emission lines. We can see that most of the predictions have ΔZ0 and are accordingly in good agreement with their literature values. The emission line mismatches mainly occur with respect to two specific emission lines: C iii] and Mg ii, while the most frequent mismatch occurs when the C iv emission line is misidentified as the Lyα emission line, for reasons explained in Delchambre et al. (2022). By requiring that flags_qsoc = 0, we can mitigate the effect of these emission line mismatches without affecting too much the central peak of correct predictions.

Figure 11.75: Distribution of the normalized logarithmic error on the QSOC redshift for sources with |Δz|<0.1 amongst 439,127 sources contained in the Milliquas 7.2 catalogue. Bin width is 0.1.

Considering now only |Δz|<0.1 predictions, so as to isolate the central peak from Figure 11.74, and computing the normalized logarithmic error as ΔZ/σZ where σZ=[ln(zup+1)-ln(zlow+1)]/2, we can see that the distribution of ΔZ/σZ approximately follows a Gaussian distribution of median 0.00744 and standard deviation – extrapolated from the inter-quartile range – of 1.05335 (0.002163±1.139733 if flags_qsoc = 0 observations are considered). The normality of ΔZ/σZ is expected from Section 11.3.14, though large tails are present that come from systematics and from a smooth background of random predictions associated with very low signal-to-noise ratio (SNR) observations.

Figure 11.76: Fraction of observations having an absolute error of the predicted redshift, |Δz|<0.1, for 439,127 sources contained in the Milliquas 7.2 catalogue (Top) and for those additionally having flags_qsoc =0 or flags_qsoc =16 (Bottom) with respect to G magnitude and literature (Left) or QSOC (Right) redshift. Fractions are computed over magnitude bins of width 0.1 and redshift bins of width 0.05.

Finally, in Figure 11.76, we plot the fraction of sources with |Δz|<0.1 with respect to literature/QSOC redshift and Gaia G magnitude for all the 439,127 predictions found in the Milliquas catalogue and for those where we encounter no processing issues (i.e. flags_qsoc values 18 are not set, such that flags_qsoc =0 or flags_qsoc =16). This plot can be regarded as a way to evaluate the purity/completeness of the QSOC predictions as it straightly provides the fraction of |Δz|<0.1 observations in terms of predicted redshift (i.e. purity) and in terms of ‘true’ redshift, assimilated here to be the literature redshift (i.e. completeness). We can note an overall decrease of the performances as we go to fainter objects. This is an obvious consequence of the generally lower SNR of these objects. Regarding the QSOC completeness (left part of Figure 11.76), we can note two problematic regions in 0.9<ztrue<1.3 and around ztrue2 that are due to the fact that only the Mg ii emission line is covered in the BP/RP spectra of 0.9<ztrue<1.3 quasars and to the misidentification of the C iv emission line as Lyα in ztrue2 quasars, as explained in Delchambre et al. (2022). The lower purity seen in quasars having QSOC redshift z<0.2 or z>4, comes from the rarity of these very low/high redfshift quasars such that any false predictions towards these loosely populated regions are largely reflected in the final fraction of observations. Again, appropriate cuts on the flags_qsoc allow to circumscribe these wrong predictions to limited regions of the G magnitude/redshift space, as seen by comparing the lower and upper panels of Figure 11.76.


A few points of attention should be kept in mind while using QSOC predictions:

  • The set of sources processed by QSOC aim to be complete rather than pure (i.e. we voluntarily set a very low threshold on the DSC Combmod probability of classprob_dsc_combmod_quasar 0.01). As a consequence, we expect most of our processed sources to be stellar contaminants rather than genuine quasars such that users interested in purer samples should consider using stricter selection rules as classlabel_dsc_joint ='quasar' or those described in (Gaia Collaboration et al. 2022b, Section 8).

  • QSOC is –by construction– designed to process Type-I/core-dominated quasars with broad emission lines in the optical and accordingly yield only poor predictions on galaxies, type-II AGN and BL Lacertae/Blazars objects, that would in any case have flags_qsoc 0.

  • The presence of the sole Mg ii emission line in the observed BP/RP spectra of quasars at 0.9<z<1.3 can lead to a higher rate of degeneracy between redshifts in this particular range. Also, the mismatch between the C iv and Lyα emission lines in z2 quasars is a frequent case of misidentification. These degeneracies being more frequent amongst faint sources.

  • Because SMSGen, described in Section 11.3.1, do not provide covariance matrices on the integrated flux (see Creevey et al. 2022), the computed χ2, from Equation 11.43, is systematically underestimated and is consequently not published in Gaia DR3. The computed redshift and associated confidence intervals, zlow and zup from Equation 11.48, though appropriately re-scaled, might sporadically suffer from this limitation.

  • Requiring that flags_qsoc =0 may lead to a large amount of predictions that are discarded, especially at faint magnitudes. For users interested in G19 mag quasars, we hence suggest to use a less stringent cut of the form flags_qsoc =0 or flags_qsoc =16, where we encounter no processing issue (i.e. flags 18 are not set) even when the BP/RP spectra are unreliable (i.e. flag 16 can be set).