# 2.4.9 Crossmatch processing

Author(s): Javier Castañeda

The crossmatch (XM) provides the link between Gaia detections and entries in the Gaia working catalogue. It consists of a single source link for each detection, and consequently a list of linked detections for each source. When a detection has more than one source candidate fulfilling the match criterion, in principle only one is linked, the principal match, while the others are registered as ambiguous matches.

A first, preliminary crossmatching pre-processing is done on a daily basis, in IDT, initially mainly to bootstrap downstream DPAC systems during the first months of the mission, and later predominantly to process the most recent data before it reaches cyclic pre-processing in IDU. By definition, such daily crossmatching cannot be completely accurate, as some data will typically arrive with a delay of some hours or even days to IDT.

On the other hand, the final crossmatching (also for the present release) is executed by IDU over the complete set of accumulated data. This provides better consistency since having all of the data available allows a more efficient resolution of dense sky regions, multiple stars, high proper motion sources, and other complex cases. Additionally, in the cyclic processing, the crossmatch is revised using the improvements on the working catalogue, of the calibrations, and of the removal of spurious detections (see Section 2.4.9).

Some of the crossmatching algorithms and tasks are nearly identical in the daily and cyclic executions, but the most important ones are only executed in the final crossmatching done by IDU.

For the cyclic executions of the crossmatch, the data volume is small. However, the number of detections at the end of the mission will reach $\sim 10^{11}$ records. Ideally, the crossmatch should handle all these detections in a single process, which is clearly not an efficient approach, especially when deploying the software in a computer cluster. The solution is to arrange the detections by spatial index, such as HEALPix (Górski et al. 2005), and then distribute and treat the arranged groups of detections separately. However, this solution presents some disadvantages:

• Complicated treatment of detections close to the boundaries of the adopted spatial arrangement.

• Complicated handling of detections of high proper motion stars which cannot be easily bounded to any fixed region.

• Repeated accessing to time-based data, such as attitude and geometric calibration, from spatially distributed jobs.

These issues could, in principle, be solved but would introduce more complexity into the software. Therefore, another procedure that is better adapted to Gaia operations has been developed. This procedure splits the crossmatch task into three different steps.

Detection Processor

In the first step (described in Section 2.4.9 and Section 2.4.9), the input observations are processed in time order to compute the detection sky coordinates and obtain the preliminary source candidates for each individual detection.

Sky Partitioner

In the second step (described in Section 2.4.9), results from the previous step are grouped according to the source candidates provided for each individual detection. The objective is to determine isolated groups of detections, all located in a rather small and confined sky region which are related to each other according to the source candidates. Therefore, this step does not perform any scientific processing but provides an efficient spatial data arrangement by solving region boundary issues and high proper motion scenarios. Therefore, this stage acts as a bridge between the time-based and the final, spatial-based processing.

Match Resolver

In this final step (described in Section 2.4.9), the crossmatch is resolved and the final data products are produced. This step is ultimately a spatial-based processing where all detections from a given isolated sky region are treated together, thus taking into account all observations of the sources within that region from the different scans.

In the following subsections, we describe the main processing steps and algorithms involved in the crossmatching, focusing on the cyclic (final) case.

## Sky coordinates determination

The images detected on board, in the real-time analysis of the sky mapper data, are propagated to their expected transit positions in the first strip of astrometric CCDs, AF1, i.e., their transit time and AC column are extrapolated and expressed as a reference acquisition pixel. This pixel is the key to all further on-board operations and to the identification of the transit. For consistency, the crossmatch does not use any image analysis other than the on-board detection, and is therefore based on the reference pixel of each detection, even if the actual image in AF1 may be slightly offset from it. This decision was made because, in general, we do not have the same high-resolution SM and AF1 images on ground as the ones used on board.

The first step of the crossmatch is the determination of the sky coordinates of the Gaia detections, but only for those considered genuine. As mentioned, the sky coordinates are computed using the reference acquisition pixel in AF1. The precision is therefore limited by the pixel resolution as well as by the precision of the on-board image parameter determination. The conversion from the observed positions on the focal plane to celestial coordinates, e.g., right ascension and declination, involves several steps and reference systems, as shown in Figure 2.14.

The reference system for the source catalogue is the Barycentric Celestial Reference System (BCRS/ICRS), which is a quasi-inertial, relativistic reference system that is non-rotating with respect to distant extra-galactic objects. Gaia observations are more naturally expressed in the Centre-of-Mass Reference System (CoMRS), which is defined from the BCRS by special relativistic coordinate transformations. This system moves with the Gaia spacecraft and is defined to be kinematically non-rotating with respect to the BCRS/ICRS. The BCRS is used to define the positions of the sources and to model the light propagation from the sources to Gaia. Observable proper directions towards the sources as seen by Gaia are then defined in the CoMRS. The computation of observable directions requires several pieces of additional data such as the Gaia orbit in the Solar system, a Solar system ephemerides, etc. As a next step, we introduce the Scanning Reference System (SRS), which is co-moving and co-rotating with the body of the Gaia spacecraft, and is used to define the satellite attitude. Celestial coordinates in the SRS differ from those in the CoMRS only by a spatial rotation given by the attitude quaternions. The attitude used to derive the sky coordinates for the crossmatch is the initial attitude reconstruction OGA1 described in Section 2.4.5.

We now introduce separate reference systems for each telescope, called the Field-of-View Reference Systems (FoVRSs) with their origins at the centre of mass of the spacecraft and with the primary axis pointing to the optical centre of each of the fields, while the third axis coincides with the one of the SRS. Spherical coordinates in this reference system, the already mentioned field angles ($\eta,\zeta$), are defined for convenience of the modelling of the observations and instruments. Celestial coordinates in each of the FoVRSs differ from those in the SRS only by a fixed nominal spatial rotation around the spacecraft rotation axis, namely by half the basic angle of $\Gamma=106\mbox{.\!\!^{\circ}}5$ (Section 1.1.3).

Finally, and through the optical projections of each instrument, we reach the Focal Plane Reference System (FPRS), which is the natural system for expressing the location of each CCD and each pixel. It is also convenient to extend the FPRS to express the relevant parameters of each detection, specifically the field of view, CCD, TDI gate, and pixel. This is the Window Reference System (WRS). In practical applications, the relation between the WRS and the FoVRS must be modelled. This is done through a geometric calibration, expressed as corrections to nominal field angles as detailed in Section 3.3.6.

The geometric calibration used in the daily pipeline is derived by the First-Look system in the ‘One-Day Astrometric Solution’ .4.(ODAS; Section 2.4.5) whereas the calibration for the cyclic system is produced by the Astrometric Global Iterative Solution (AGIS; Section 3.4.2).

## Scene determination

The scene is in charge of providing a prediction of the objects scanned by the two fields of view of Gaia according to the spacecraft attitude and orbit, the planetary ephemerides, and the source catalogue. It was originally introduced to track the illumination history of the CCD columns for the parametrization of the CTI mitigation. However, this information is also relevant for:

• The astrophysical background estimation and the LSF/PSF profile calibration, to identify nearby sources that may be affecting a given observation. The scene can easily reveal if the transit is disturbed or polluted by a parasitic source.

• The crossmatch, to identify sources that will probably not be detected directly, but still leave many spurious detections, for example from diffraction spikes or internal reflections.

Therefore, the scene does not only include the sources actually scanned by both fields of view but it also identifies:

• Sources without the corresponding Gaia observations. This can happen in case of, e.g.:

• Very bright sources (brighter than $\sim 6^{\rm th}$ magnitude) or transits of solar-system objects (SSOs) that are not detected in the Sky Mapper (SM) or not confirmed in the first CCD of the Astrometric Field (AF1).

• Fast SSOs, detected in SM but not successfully confirmed in AF1.

• High-density regions in which the on-board resources (windows) are insufficient in number to cover all detected objects.

• Close neighbours on the sky for which the detection and acquisition of two separate observations is impossible (Section 1.1.3).

• Data losses due to on-board storage overflow, data transfer issues, or processing errors.

• Sources falling into the edges and between CCD rows.

• Sources falling out of both fields of view but so bright that they may disturb or pollute nearby observations.

It must be noted that the scene is established not from the individual observations, but from the catalogue sources and planetary ephemerides. The scene is therefore limited by the completeness and quality of those input tables.

## Spurious detections identification

The Gaia on-board detection software was built to detect point-like images on the SM CCDs and to autonomously discriminate star images from cosmic rays, etc. For this, parametrised criteria of the image shape are used, which need to be calibrated and tuned. There is clearly a trade-off between a high detection probability for stars at the faint end and keeping the detections from bright-star diffraction spikes (and other disturbances) at a minimum. A study of the detection capability, in particular for non-saturated stars, double stars, unresolved external galaxies, and asteroids is provided by de Bruijne et al. (2015).

The main problem with spurious detections arises from the fact that they are numerous (15–20% of all detections), and that each of them may lead to the creation of a (spurious) new source during the crossmatch. Therefore, a classification of the detections as either genuine or spurious is needed to only consider the former in the crossmatch.

The main categories of spurious detections found in the data so far are:

• Spurious detections around and along the diffraction spikes of sources brighter than approximately 16 mag. For very bright stars, there may be hundreds or even thousands of spurious detections generated in a single transit, especially along the diffraction spikes in the AL direction, see Figure 2.15 for an extreme example.

• Spurious detections in one telescope originating from a very bright source in the other telescope, due to unexpected light paths and reflections within the payload.

• Spurious detections from major planets. These transits can pollute large sky regions with thousands of spurious detections, see Figure 2.16, but they can be easily removed.

• Detections from extended and diffuse objects. Figure 2.17 shows that Gaia is actually detecting not only stars but also filamentary structures of high surface brightness. These detections are not strictly spurious, but require a special treatment. Their CCD images may differ too much from the expected PSF or LSF profile, in which case they may be classified as spurious, meaning that some of them may not be present in Gaia DR2.

• Duplicated detections produced from slightly asymmetric images where more than one local maximum is detected. These produce redundant observations and must be identified during the crossmatch.

• Spurious detections due to cosmic rays. A few manage to get through the on-board filters, but these are relatively harmless as they happen randomly across the sky.

• Spurious detections due to background noise or hot CCD columns. Most are caught on-board, so they are few and cause no serious problems.

For Gaia DR2, the signal in the several CCDs of each given transit has been analysed, looking for consistent enough measurements throughout the transit, helping to identify many of the spurious detections described in the last two categories. Despite the spurious detections that may still remain, this has no impact on the published data, as these detections happen randomly on the sky such that there are no corresponding stellar images in the astrometric (AF) CCDs.

Besides the analysis of raw data samples to discard spurious detections due to cosmic rays, background noise, or hot CCD columns, for Gaia DR2 we mainly identify, either using actual Gaia detections of those or the predicted transits obtained in the scene, and select all detections contained within a pre-defined set of boxes centred on the brightest transit. The selected detections are then analysed, and they are classified as spurious if certain distance and magnitude criteria are met. These pre-defined boxes have been parametrised with the features and patterns seen in the actual data according to the magnitude of the source producing the spurious detections.

For very bright sources (brighter than 6 mag) and for the major planets, this model has been extended. For these cases, larger areas around the predicted transits are considered. Also both fields of view are scanned for possible spurious detections.

Identifying spurious detections around fainter sources (down to 16 mag) is more difficult, since there are often only very few or none. In these cases, a multi-epoch treatment is required to know if a given detection is genuine or spurious, i.e., checking if more transits are in agreement and resolve to the same new source entry. These cases will be addressed in future data releases as the data reduction cycles progress and more information from that sky region is becoming available.

Finally, spurious new sources can also be introduced by excursions of the on-ground attitude reconstruction used to project the detections on the sky (i.e., short intervals of large errors in OGA1), leading to misplaced detections. Therefore, the attitude is carefully analysed to identify and clean up these excursions before the crossmatch is run.

## Detection processor

This processing step is in charge of providing an initial list of source candidates for each individual observation.

The first step is the determination of the sky coordinates as described in Section 2.4.9. This step is executed in multiple tasks split by time interval blocks. All Gaia observations enter this step, with the exception of Virtual Objects, and data from dedicated calibration campaigns. Also, all the observations positively classified as spurious detections are filtered out.

Once the observation sky coordinates are available, these are compared with a list of sources. In this step, the observation-to-source matching, the sources that cover the sky seen by Gaia in the time interval of each task, are extracted from the Gaia catalogue. These sources are propagated with respect to parallax, proper motion, orbital motion, etc. to the relevant epoch.

The candidate sources are selected based on a pure angular distance criterion. The decision of only using distance was taken because the position of a source changes slowly and predictably, whereas other parameters such as the magnitude may change in an unpredictable way. Additionally, the initial Gaia catalogue is quite heterogeneous, exhibiting different accuracies and errors which suggest the need of a match criterion subjected to the provenance of the source data. In later stages of the mission, when the source catalogue is dominated by Gaia astrometry, this dependency can be removed and then the criterion should be updated to take advantage of the better accuracy of the detection in the along-scan direction. At that point, it will be possible to use separate along- and across-scan criteria, or use an ellipse with the major axis oriented across scan which will benefit the resolution of the most complex cases.

A special case is the treatment of solar system objects observations. The processing of these objects are the responsibility of CU4 (Section 1.2.2) and for this reason, no special considerations have been implemented in the crossmatch. These observations will have Gaia Catalogue entries created on a daily basis by IDT and those entries will remain, so the corresponding observations will be matched again and again to their respective sources without any major impact on the other observations.

An additional processing may be required when observations without source candidates are left after the observation-to-source matching process. In principle, this situation should be rare since IDT has already treated all observations before IDU runs. However, unmatched observations may arise because of IDT processing failures, updates in the detection classification, updates in the source catalogue, or simply the usage of a more strict match criterion in IDU. Thus, this additional process is basically in charge of processing the unmatched observations and creating temporary sources as needed just to remove all the unmatched observations in a second run of the source-matching process. The new sources created by these tasks will ultimately be resolved (by confirmation or deletion) in the last crossmatch step.

Summarising, the result of this first step is a set of MatchCandidates for the whole accumulated mission data. Each MatchCandidate corresponds to a single detection and contains a list of source candidates. Together with the MatchCandidates, an auxiliary table is produced to track the number of links created to each source, the SourceLinksCount table. Results are stored in a spatial-based structure using HEALPix (Górski et al. 2005), for convenience of the subsequent processing steps.

## Sky partitioner

The sky partitioner task is in charge of grouping the results of the observation-to-source matching (Section 2.4.9) according to the source candidates provided for each individual detection. The purpose of this process is to create self-contained groups of MatchCandidates. The process starts loading all MatchCandidates for a given sky region. From the loaded entries, the unique list of matched sources is identified and the corresponding SourceLinksCount information is loaded. Once loaded, a recursive process is followed to find the isolated and self-contained groups of detections and sources. The final result of this process is a set of MatchCandidateGroups (as shown in Figure 2.18) where all the input observations are included. In summary, within a group, all observations are related to each other by links to source candidates. Consequently, sources present in a given group are not present in any other group.

In early runs, there is a certain risk to end up with unmanageably big groups. For those cases, a limit on the number of sources per group has been introduced so that the processing is not stopped. The adopted approach may create spurious or duplicated sources in the overlapped area of these groups. However, as the cyclic processing progresses, these cases should disappear (groups will be reduced) due to better precision in the catalogue, improved attitude and calibration, and the adoption of a smaller match radius. So far, such cases have not been encountered and therefore the practical limit for the number of sources per group has therefore not yet been reached.

After this process, each MatchCandidateGroup can be processed independently from the others as the observations and sources from two different groups are fully independent.

## Crossmatch resolution

The final step of the crossmatch is the most complex, namely resolving the final matches and consolidating the new sources. Three main cases need to be solved:

• Duplicate matches: when two (or more) detections close in time are matched to the same source. This will typically be either newly resolved binaries or spurious double detections.

• Duplicate sources: when a pair of sources from the catalogue has never been observed simultaneously, thus never identifying two detections within the same time frame, but has the same matches. This can be caused by double entries in the working catalogue.

• Unmatched observations: observations without any valid source candidate.

For the first cyclic processing and Gaia DR1, the resolution algorithm has been based on a nearest-neighbour solution in which the conflict between two given observations was resolved independently from the other observations included in the group. This was a simple and quick conflict resolution algorithm. However, this approach did not minimize the number of new sources created when more than two observations close in time had the same source as primary match. The crossmatch resolution algorithm in Gaia DR2 is based on a much more sophisticated algorithm. In particular, it uses tailored clustering and resolution algorithms in which all the relations between the observations contained in each group are taken into account to generate the best possible resolution.