1.3.4 Processing configuration

Author(s): Rocio Guerra, Javier Castañeda, Chantal Panem, Francesca De Angeli, Krzysztof Nienartowicz, Rosario Messineo, Jos de Bruijne

DPCE

The Data Processing Centre at ESAC (DPCE) is part of the Gaia Science Operations Centre (SOC; Section 1.1.5) at ESAC, Madrid, Spain. DPCE operates software from:

  • CU1:

    • Main DataBase (MDB; Section 1.2.3);

    • MOC Interface Task (MIT);

    • De-compression and Calibration Services (DCS);

    • Payload Operations System (POS);

    • Gaia Transfer System (GTS);

  • CU3:

    • Initial Data Treatment (IDT; Section 2.4.2);

    • First Look (FL; Section 2.5.2);

    • Astrometric Global Iterative Solution (AGIS; Section 3.4.2);

  • DPCE:

    • IDT/FL DataBase (IDTFLDB);

    • Daily PipeLine (DPL);

    • Gaia Observing Schedule Tool (GOST).

DPCE has responsibility for interactions with the Mission Operations Centre (MOC; Section 1.1.5) regarding payload calibration and operations. As the central ’hub’ in the Gaia Science Ground Segment, DPCE is the interface between the MOC and other DPCs in DPAC (Section 1.2.2). DPCE tasks include science-data retrieval from MOC and distribution to the other DPCs (after IDT/FL processing) as well as distribution of the MDB, i.e., receiving processed data from the other DPCs and assembly of the subsequent version after DPAC processing (Section 9.2.1). Data transfers make use of the internet-based Gaia Transfer System. The current bandwidth between the hub – DPCE – and the other DPCs is 1 Gb s-1; this can be increased as needed for subsequent processing cycles.

DPCE performs both daily and cyclic processing (Section 1.2.3 and Section 1.2.3).

Daily processing

Daily processing at DPCE aims to transform the raw observation data from the spacecraft into usable form for further processing and to provide an initial treatment into higher-level data. The input to daily processing is the telemetry stream from the spacecraft, with initial input data from CU3 and CU5 in the form of catalogues and calibration data. The daily processing started during the commissioning phase of Gaia and will continue until the end of (nominal and possibly extended) spacecraft operations.

The principal software systems (MIT, IDT, and FL) have run with versions 17.0, 17.1, 18.0, 19.0, and 19.1 (including a number of patch releases to fix specific issues found during operations). The following table shows the main data types and the associated total counts produced at DPCE that are input to Gaia DR2:

Data type Total counts Section
ApBackgroundRecordDt 3 342 098 Section 2.4.6
AstroElementary 42 081 775 514 Section 3.1
BamElementary 4 869 335 Section 1.3.4
BaryVeloCorr 35 559
BiasRecordDt 1 459 188 Section 2.4.6
Oga1 37 552 Section 2.4.5
PhotoElemSmo 32 499 859
PhotoElementary 43 715 701 165 Chapter 5
AcShifts 409 406 192 Section 2.2.2
AcShiftsSmearing 241 452 010 Section 2.2.2
AocsAttitude 303 342
AstroObsSpecialVo 15 401 346
AstroObservation 51 865 925 469 Section 2.4.3
AstroObservationVo 220 422 510
BamObservation 4 845 216 Section 1.3.4
ChargeInjection 85 144 231 Section 2.4.6
FailedBamObs 320 Section 1.3.4
GateInfoAstro 614 816 533
GateInfoPhoto 26 374 219
GateInfoRaw 1 565 544 272
GateModePkt 643 312
ObjectLogAFXP 58 685 100 022 Section 2.2.2
ObjectLogRVS 3 501 424 892 Section 2.2.2
ObjectLogRvsRotSmear 2 017 768 922 Section 2.2.2
PhotoObsSpecialVo 15 401 346
PhotoObservation 52 096 462 106 Section 2.4.3
PhotoObservationSmo 32 425 656
PhotoObservationVo 221 723 923
PreScan 1 555 233 Section 2.2.2
RvsResolution 1 359 Section 2.2.2
SifPkt 21 764 524 Section 2.2.2
SpectroObsSpecialVo 519 832
SpectroObservation 3 082 880 510 Section 2.4.3
SpectroObservationVo 112 045 072
Statistics 1 564 968
StatsVariant 1 562 416
WfsPkt 1 297 171 Section 1.1.3
ZoomGateModePkt 2 699 317

Cyclic processing

The cyclic processing at DPCE consists of running AGIS and the MDB:

  • AGIS produces the main astrometric solution for the mission (Section 3.4.2). The applicable version for Gaia DR2 is 19.1, which generated 2 499 374 508 sources (before CU9 validation and filtering).

  • The MDB is the central repository for the Gaia mission data (Section 1.2.3). The input data volume for the Gaia DR2 generation is 107 TB. The MDB comprises a number of tools required for handling the input/output actions, assessing the data accountability, and checking the data consistency. In particular, the MDB Integrator collects all output data processed during the data-processing cycle plus previous inputs, and unifies them into a unique table of Gaia sources for publication (Section 9.2).

DPCB

Background

The Data Processing Centre of Barcelona (DPCB) is embedded in the Gaia DPAC group at the University of Barcelona (UB) and the Institute for Space Studies of Catalonia (IEEC), in close cooperation with the Barcelona Supercomputing Centre (BSC) and the Consorci de Serveis Universitaris de Catalunya (CSUC), also in Barcelona, Spain. The operational DPCB hardware is provided by BSC, whereas the team at the UB/IEEC carries out the management, operations, development, and software tests. The responsibilities of DPCB are the execution of the:

  • Gaia Transfer System (CU1-GTS), for the reception of input data on a daily basis from DPCE and for the transfer of data produced at DPCB to the DPCE hub;

  • Intermediate Data Updating (CU3-IDU), one of the major processing tasks in the cyclic processing in charge of regenerating all the intermediate data as described in Section 2.4.2;

  • Simulations, specifically CU2-GASS and CU2-GOG, which simulate satellite telemetry and the final Gaia catalogue, respectively (Section 1.2.4). GASS simulations have been key in the (pre-launch) development and testing of software products across DPAC. GOG simulations are still being generated and have been an essential part of the CU9 software validation and testing.

The main focus of DPCB during operations is the execution of the cyclic processing, running several stages of IDU during each data processing cycle. In some cases, depending on the inputs available and specifically in the early processing cycles, only part of these subsystems have been run whereas, in later stages of the mission, repeated execution of some of these subsystems have been made during a given cycle.

In addition to the execution of IDU, DPCB is also responsible for the integration of the IDU software into the execution environment within the available resources, in particular the MareNostrum III supercomputer. This computer, hosted at BSC, offered a peak performance of 1.1 Petaflops and 100.8 TB of memory during Gaia DR2 operations. In June 2017, this supercomputer has been upgraded to MareNostrum IV, increasing the peak performance to 13.7 Petaflops and 390.8 TB of memory.

The design and implementation of IDU and its integration in DPCB presents a variety of interesting challenges, covering not only the purely scientific problems that appear in any data-reduction process but also technical issues that arise during the processing of the large amount of data that Gaia generates. In particular, DPCB has developed an efficient and flexible execution framework, including tailored data-access routines, efficient data formats, and an autonomous application in charge of handling and checking the correctness of all input data entering or produced by IDU.

Through IDU, DPCB is responsible for the reprocessing of all the accumulated astrometric data collected from the spacecraft, adding the latest measurements, and recomputing the IDU outputs using the latest calibrations to obtain improved scientific results. These improved results are the starting point for the next iterative reduction loop, which includes AGIS and PhotPipe (Section 3.4.6).

Gaia DR2

For Gaia DR2, the following IDU processes have been executed:

  • On-ground Attitude Reconstruction (OGA): initial attitude determination to compute preliminary sky coordinates for all observations to allow matching them with catalogue sources;

  • Scene: predicts the CCD transit times of sources given an input catalogue and the spacecraft attitude;

  • Detection classifier: flags spurious detections which have to be ignored in the subsequent processes;

  • Cross-match: matches observations to sources;

  • CCD Bias and AstroPhysical Background (APD) calibrations;

  • Image-Parameter Determination (IPD): computes the location and flux estimations for each individual CCD window using the latest calibrations;

  • Validation: provides technical and scientific consistency checks of the IDU products.

These processes are nominally executed after a given data segment is closed and DPCB has received all associated data. The tasks are executed sequentially in a single run over the full data set. In the current case, these tasks included all the data up to and including data segment 2 (Table 1.4).

In the Gaia DR2 processing cycle, IDU version 19.0.0 has been used. All processing, including validation activities and delivery of the data to DPCE, was performed in 48 days on the MareNostrum III supercomputer, in the period 19 July to 5 September 2016. During this period, 51 712 381 972 observations and 30 TB of input data were successfully processed.

The number of new sources that were erroneously created as a result of spurious detections has been significantly reduced owing to the integration of more sophisticated cross-match algorithms and improved spurious-detection models. Nevertheless, spurious detections are still the main cause of catalogue pollution and additional effort is still committed to improve the detection classification in forthcoming releases.

The CCD Bias and AstroPhysical Background calibration and Image Parameter Determination tasks were executed for the first time during Gaia DR2 processing. Through delivery of improved inputs to the downstream processing systems, which were based on the daily processing products for Gaia DR1, the inclusion of these new tasks in the cyclic pipeline has improved the quality of the data published in Gaia DR2 (see Chapter 2).

The total number of computing hours consumed and the total volume of data produced by each process during Gaia DR2 processing, but excluding intermediate-data arrangement and validation tasks and activities, is shown in Table 1.11. When accounting for all operations in MareNostrum III, the total number of CPU hours approaches 600 000.

Table 1.11: Total CPU hours and data size at DPCB for Gaia DR2 activities, only including successful runs.
Task CPU hours Data volume [Gigabytes]
On-ground Attitude Reconstruction 1 000 375
Scene 8 959 3 085
Detection Classifier 5 431 84
Cross-match (at observation level) 150 022 1 420
Bias and AstroPhysical Background (APD) 58 755 1 560
Image-Parameter Determination (IPD) 165 078 18 910
Total 389 245 25 434

In addition to the operational activities described above, DPCB also provides support to the development and testing of IDT and other related products. These activities use mainly resources in CSUC. Being involved in the development of IDT has provided the DPCB team unique knowledge and expertise on the processing of raw spacecraft data. This, in turn, has contributed to significant improvements in the software developed for the execution of IDU.

DPCC

Background

The Data Processing Centre at CNES (DPCC) is located at the Centre National d’Etudes Spatiales, Toulouse, France. It has the responsibility of running the CU4, CU6, and CU8 processing chains (Table 1.3). This includes both daily (Section 1.2.3) and cyclic processing (Section 1.2.3). DPCC is also in charge of backing up the Gaia science telemetry archive as well as the Main DataBase (MDB, located at DPCE; Section 1.2.3), which acts as central hub of all Gaia science and calibration data. DPCC faces and copes with several challenges:

  • a large number of elements to handle: dozens of database tables, containing up to 80 billion rows;

  • a complex processing with temporal constraints and resource sharing: short delays for daily processing have to be balanced against deadlines for cyclic processing;

  • the cyclic chains process a growing amount of data from one data-processing cycle to the next, with significant added complexity in the software modules: the volume of data produced and the required amount of processing can only be assessed based on estimations;

  • a large data volume: some 3 PB of processed data are foreseen at the end of the nominal, five-year mission (not counting intermediate data generated at DPCC).

DPCC uses Hadoop and Map/Reduce technologies as the core of its framework. Hadoop offers a highly distributed framework, with a powerful distributed file system (HDFS), to manage the resource allocation to each processing chain. Scalability of the solution allows an incremental purchase of hardware in order to follow the growing needs in terms of data volume and processing power over the five years of the nominal mission. For the CU4 Solar-System-Object (SSO) chain (see below), a dedicated Cassandra database is used. The current operational platform contains 172 computing nodes with 3500 cores and 19 TB of memory and offers a total storage capacity of 2.6 PB. The final, operational DPCC cluster will require some 5800 cores and 4.5 PB of storage. The MDB backup system is based on temporary disk servers (64 TB) and a robotic tape system.

DPCC is responsible for all aspects related to the hardware used to process the science data on its cluster (from purchase to maintenance and system administration) as well as for the development and maintenance of the software infrastructure required to process and archive the input and output data. DPCC runs the scientific software modules developed within the CU4, CU6, and CU8 coordination units in DPAC and delivers the data back to the CU4/CU6/CU8 scientists and to the MDB at DPCE. DPCC plays a fundamental role in the validation, pre-integration, and integration of the scientific software modules into the final framework, up to the final qualification of the overall system. All operational aspects, from data deliveries to pipeline operations, are also under DPCC responsibility.

The DPCC data reception chain automatically analyses and indexes the 200 GB of data that are daily received as input to the daily processing chains. For each daily chain, DPCC publishes the main results of the chain execution, the log files, and the execution reports on a Web portal (GAIAWEB) for analysis and monitoring by the CU4/CU6/CU8 scientists. For the cyclic processing chains, data is delivered to DPCC on agreed times in batches of several tens of TBs. The scientific results of the cyclic chains, which are much smaller in volume, are transferred back to the MDB.

Gaia DR2

In addition to two daily processing chains that run at DPCC (CU4 and CU6), three cyclic chains (CU4, CU6, and CU8) ran at DPCC from March to October 2017, during the second data-processing cycle (often referred to as Cycle-02):

  • CU4

    The CU4 object-processing chains process eclipsing binaries identified by the variability coordination unit (CU7) plus all objects not processed or identified in the astrometric (CU3), photometric (CU5), or spectroscopic (CU6) coordination units. CU4 objects include Non-Single Stars (NSSs), Solar-System Objects (SSOs) such as comets and asteroids, and Extended Objects (EOs) such as unresolved galaxies. Gaia DR2 only contains results from the SSO chains (see Chapter 4). The SSO Short-Term chain (SSO-ST; Chapter 4) ran on a (quasi-)daily basis to process newly-discovered SSOs and to generate science alerts to IMCCE for ground-based follow-up through the Gaia Follow-Up Network for Solar System Objects (Gaia-FUN-SSO; https://gaiafunsso.imcce.fr/). The SSO Long-Term chain (SSO-LT; Chapter 4) ran several times on the 22 months of science data that form the basis for Gaia DR2 (Table 1.4). SSO-LT processed 318 290 transits of known objects. Each run lasted less than 10 days. Some 1100 cores, 6.6 TB of memory, and 55 TB of disk space were allocated to the CU4 chain. It published 190 GB of validation data to CU4 scientists and delivered 450 MB of pre-validation results for 14 124 objects to the MDB.

  • CU6

    The CU6 spectroscopic processing chains process and analyse the data obtained with the Radial Velocity Spectrometer (RVS; Cropper et al. 2018). The goals of the spectroscopic processing system, which is detailed in Chapter 6, are:

    • to monitor the health of the RVS spectrometer and to calibrate its characteristics;

    • to provide radial and rotational velocities;

    • to issue variability and multiplicity diagnostics;

    • to alert on objects that require a rapid ground-based follow-up; and

    • to provide clean, calibrated spectra.

    The CU6 bias non-uniformity calibration chain (UC1; Section 6.3.1) ran several times, triggered by the reception of calibration data (Section 1.3.3). The CU6 daily calibration chain (UC2; Section 6.2.2) has been running on a (quasi-)daily basis since the middle of the first data-processing cycle and performs a daily RVS calibration (wavelength, along-scan LSF, across-scan LSF, diffuse background, and photometric calibration) and also derives the radial velocities of stars. The CU6 spectroscopic chain (CU6 Global R2; Section 6.1.2) processed the 22 months of science data that form the basis for Gaia DR2 (Table 1.4). Some 1900 cores, 13.4 TB of memory, and 450 TB of disk space were allocated to the CU6 chain. It published 5.6 TB of validation data to CU6 scientists and delivered 2.4 TB of pre-validation radial-velocity data for 10 million objects to the MDB.

  • CU8

    The CU8 astrophysical-parameter processing chain (Apsis; Chapter 8) classifies Gaia objects and estimates their astrophysical parameters. The Apsis chain ran several times on the 22 months of science data that form the basis for Gaia DR2 (Table 1.4). Each run lasted less than 2 days. Some 1100 cores, 7 TB of memory, and 40 TB of disk space were allocated to the CU8 chain. It published 1.5 TB of validation data to CU8 scientists and delivered 48 GB of pre-validation results for 170 million objects brighter than magnitude 17 to the MDB.

DPCI

Background

The Gaia broad-band photometric and spectro-photometric BP/RP data are processed at the Data Processing Centre located at the Institute of Astronomy (IoA, University of Cambridge, United Kingdom). This centre is also referred to as DPCI. DPCI is responsible for all aspects related to the hardware used to process the data, from purchase to maintenance and system administration, as well as for the development of the software infrastructure required to run the scientific modules developed within CU5 on the DPCI cluster. DPCI plays a fundamental role in the integration of those modules into the pipeline. All operational aspects, from data deliveries to pipeline operation, are under DPCI responsibility.

The pipeline that processes the broad-band photometric and the spectro-photometric data, called PhotPipe, is explained in detail in Chapter 5. For the production of the data included in Gaia DR2, PhotPipe releases 19.1.0, 19.2.0, 19.3.0, and 19.4.0 have been used. In PhotPipe, the scientific modules are implemented as a series of ’Processing Elements’ that can be assembled into a workflow or ’Recipe’. Recipes are defined using a custom-designed Domain Specific Language (DSL), called Scylla, which is based on functional programming.

During operations, data is received daily at DPCI. The automatic data-handling system records new deliveries and stores metadata into a database. A process converts the input data from the DPAC format into the DPCI-internal data model that is optimized for the PhotPipe processing. Modelling of the complex data stream is done using yet another DSL, called Charybdis, which has been developed at DPCI.

PhotPipe operates in cyclic mode, i.e., PhotPipe operations start when all data for a data segment has been received and when the results of IDU (image-parameter determination and cross-matching in particular; Section 2.4.2) and AGIS (attitude, calibrations, and source astrometry; Section 3.4.2) from the same cycle have been received. The processing in PhotPipe can be divided into the following steps:

  • ingestion and pre-processing of data, including the computation of CCD bias corrections, heliotropic angles, predicted and extrapolated positions, and the creation of types optimized for the PhotPipe processing, by joining several inputs coming from different upstream systems;

  • BP/RP pre-processing and initial calibration, in particular background (this covers the straylight component only for Gaia DR2) and along- and across-scan geometric calibrations;

  • internal calibration of the integrated fluxes, including the initialisation of the photometric internal reference system and all the internal calibrations required to remove all instrumental effects (time-link calibration, gate and window class-link calibration, large- and small-scale geometric calibrations);

  • calibration of the BP/RP instrument model, taking the effect of variations in the flux response and line-spread function with time as well as across the BP and RP CCDs into account;

  • external calibration, creating the link between the internal photometric reference system, for both photometric and spectral data, and the absolute one, thus allowing comparisons of Gaia data with other catalogues;

  • export of the data produced by PhotPipe to the MDB for integration with results from other DPAC systems, for distribution of the data to downstream users within DPAC, and for creation of selections to be released to the public in formal data releases.

At each data-processing cycle, PhotPipe re-processes all data collected by Gaia since the start of the nominal mission. Particularly in the first cycles, significant enhancements in the software and algorithms materialise as a result of a continuously improving understanding of the character of the data. The cyclic nature of the DPAC processing ensures that these improvements affect all the science data collected so far.

The large data volume produced by Gaia (some 26 billion field-of-view transits per year), the complexity of its data stream, and the self-calibrating approach adopted in the science data processing pose unique challenges in terms of scalability, reliability, and robustness of both the software pipelines and the operational infrastructure on which they are executed. To cope with these challenges, DPCI has adopted Hadoop and Map/Reduce as the core technologies for its infrastructure.

Gaia DR2

During Gaia DR2 processing, sometimes referred to as Cycle 2 operations, over 51.7 billion field-of-view transits entered the PhotPipe processing. Only a fraction of these (40.9 billion) were cross-matched to a source, thus enabling further processing in PhotPipe. The remaining 10.7 billion transits were considered spurious detections and therefore not associated to any source.

The current status of the software can successfully handle nominal cases, including truncated windows and unexpected TDI-gate/window-class configurations (so-called complex TDI-gate cases, i.e., windows affected by two TDI gates, are currently not treated). All processing in PhotPipe is transit-based. The source mean photometry is produced by accumulating the calibrated epoch photometry for all transits cross-matched to the same source.

In the following, a detailed accounting of the operations for Cycle 2 is provided, including how the number of calibrated transits has been affected by various problems encountered. Note that even though some transits do not have all the required information available, they are nevertheless brought forward in the processing and stored, the idea being that currently missing information associated with non-nominal situations might become available in future cycles as a result of advances in the processing software, allowing future recovery of currently non-treated transits.

  • The operations started with 51 715 656 100 AstroObservations (AOs), 51 715 475 265 PhotoObservations (POs), and 51 712 381 972 AstroElementaries (AEs).

  • Out of all field-of-view transits, 41 190 636 992 had valid image-parameter-determination results for all SM/AF CCDs. Incomplete transits have also been processed but are more likely to drop off in case of other problems affecting the available CCD transits.

  • Out of the 51 712 381 972 field-of-view transits available, 40 966 546 433 had a valid match, and 10 737 486 581 were blacklisted (i.e., considered to be spurious detection; Section 2.4.9). Blacklisted transits have not been calibrated.

  • The above numbers refer to the entire period that entered the processing for Cycle 2. Detailed accounting for the period used for the initialisation of the photometric reference system (covering approximately one year of observations) provides the following figures:

    • Out of the 29 728 761 613 input field-of-view transits, 1 366 386 were filtered out as being collected during decontamination or refocus activities (Table 1.6). Integrated BP/RP and spectral-shape coefficients (SSCs; Section 5.1.3) were computed for a total of 17 596 705 837 field-of-view transits. The missing transits were flagged as Truncated, EliminatedSamples, GateRelease, MissingWindow, or having a distance to the last Charge Injection < 0 (i.e., having a charge injection within the window). For a number of transits, the computation of predicted positions, which are critical for further scientific processing, was not successful: these affected 201 367 425 transits in BP and 229 254 199 transits in RP. Other possible causes of reductions in the number of transits available for the photometric calibration are: no bias correction (i.e., no ASD7 ObjectLog coverage) or missing heliotropic coordinates (i.e., no attitude coverage), no background estimate (i.e., insufficient VO coverage), or failure in the computation of the spectral-shape coefficients for either BP or RP. Epoch photometry entries numbered 23 475 908 195. Clearly, many of these are incomplete, thus requiring ad hoc calibrations with assumed default colour information.

    • Indeed, when performing the first accumulation of calibrated data, a catalogue of 2 298 160 540 sources was created. Through the iterations, this number has been significantly reduced due to incompleteness of the colour information, colour restrictions imposed by the model used for the time-link calibration, or missing calibrations (mostly at the bright end). The catalogue of mean photometry created from the period used for the initialisation of the photometric system (after five iterations of large-scale calibration and two iterations between large- and small-scale calibrations) contained a total of 1 527 436 167 sources.

  • The final photometric catalogue, before integration with the astrometric data and before filtering resulting from validation activities, contains 2 572 718 795 sources.

DPCG

Background

The Data Processing Centre in Geneva (DPCG) is embedded in the Gaia DPAC group at the Astronomical Observatory of the University of Geneva, Switzerland. Its physical location is the Integral Science Data Centre (ISDC), a part of Geneva Observatory, in Versoix near Geneva. DPCG runs the Integrated Variability Pipeline (IVP), which is the CU7 variability pipeline integrated into DPCG’s software and hardware infrastructure. Along the pipeline processing, data is visualised for monitoring and quality-control purposes. DPCG performs a cyclic processing dependent on the input provided by the CU3, CU4, CU5, CU6, and CU8 systems. The overall DPCG processing aims at providing a characterisation and classification of the variability aspects of the celestial objects observed by Gaia. The information DPCG will provide at the end of the mission comprises:

  • Reconstructed time series;

  • Time-series statistics for all objects and all time series;

  • Identification of variable objects;

  • Identification of periodic objects;

  • Characterisation of variable objects: significant period values, amplitudes, phases, and models;

  • Classification of the objects: a probability vector providing the probability per object to be of a given variability type;

  • Additional attributes that depend on the classification of the objects to a given variability type.

The IVP extracts attributes which are specific to the objects belonging to specific classification types. This output of the IVP is transferred to DPCE and integrated into the MDB (Section 1.2.3) from where it is used as input for further processing, mainly by CU4, CU6, and CU8.

The DPCG hardware has the following elements:

  • Database nodes: these provide the storage space and input/output capabilities as well as the computing capabilities needed for the big-data scale development, testing, and operational needs of DPCG. Eight nodes form a Postgres-XL parallel database cluster that is based on a Postgres 9.5 database, the development of which DPCG has been actively involved in. In addition to increasing the storage capacity and augmenting the database with new indexing techniques, in order to boost analytical power of the group and allow spectra processing, expansion of the cluster to 12 nodes is planned for the next data-processing cycle.

  • Processing nodes: these provide the bulk of CPU processing power for DPCG. The nodes use the Sun Grid Engine (SGE) batch system to launch pipeline runs that process sources in parallel on a high-performance-computing cluster. Apache Ignite binding is in preparation for possible replacement of the SGE for Gaia DR3. DPCG deploys a computing cluster composed of three containers accounting for 850 cores (with hyper-threading). DPCG also provides a powerful ’interactive node’ that can be used to run the pipeline, R, or Python analytics.

  • Broker nodes: these provide middleware for message exchange between the various parts of the system. The broker nodes are currently based on a single-node Active MQ server.

  • Monitoring nodes: these provide web and application server(s), hosting for instance a visualisation tool, the continuous integration tool, and a centralised logging based on ElasticSearch.

  • Off-line backup system: this provides backup and recovery functionality of the Postgres database.

  • Data-exchange node: the front-end machine for data exchange, through the Gaia Transfer System, between DPCE and DPCG.

Gaia DR2

The Integrated Variability Pipeline (IVP) is built in a modular fashion, and selected parts of the variability analysis can be included or excluded through a configuration file. During normal operations, all ’scientific’ analyses are executed. The Gaia DR2 variability processing, sometimes referred to as Cycle 2 processing, has been performed with releases 20.1.0 of the following modules:

  • VariDataExchange;

  • VariConfiguration;

  • VariObjectModel;

  • VariFramework;

  • VariStatistics;

  • VariCharacterisation;

  • VariClassification;

  • VariSpecificObjects.

From the DPCG perspective, Gaia DR2 processing has been staged in three phases: ’initial’, processing DR1 data, ’intermediate’, processing intermediate Cycle-2 photometry, and ’final’, processing the final Cycle-2 photometry from CU5. During the intermediate and final phases, timeseries have been reconstructed for all 1 671 380 106 sources with photometry. This processing also involved applying barycentric corrections and cross-matching sources with literature catalogues. Several hundred runs have been executed in order to fine-tune configurations of the algorithms listed above.

During Cycle 2 processing, two branches have been used: one for sources having at least two observations (all sources with photometry) and, similarly as in Cycle 1 and Gaia DR1, one for sources with at least 20 observations. DPCG exported preliminary results for both branches encompassing some 3 million classified sources but validation removed most of these as variables. After thorough validation of the final Cycle 2 results, roughly half a million sources were retained. During validation, the results of some packages were removed completely while rigid filtering was applied in (some) other cases.

DPCT

Background

The Data Processing Centre at Torino (DPCT) provides the infrastructure and operational support to the Astrometric Verification Unit (AVU) activities of CU3 and the Italian participation to the Gaia data processing tasks. The AVU unit is responsible for the development and maintenance of the following CU3 software products:

  • AVU/AIM: the Astrometric Instrument Model data analysis software system, in charge of processing the telemetry of the astrometric data in order to monitor and analyse the astrometric-instrument response with time;

  • AVU/BAM: the Basic Angle Monitoring software system, in charge of processing the BAM device telemetry in order to monitor and analyse the BAM behaviour with time;

  • GSR: the mathematical and numerical framework of the Global Sphere Reconstruction, in charge of verifying the global astrometric results produced by AGIS.

Daily Processing

The daily pipelines of the AVU/AIM and AVU/BAM systems have run with versions 19.0, 20.0, and 21.0 (including a number of patch releases to fix specific issues found during operations).

The AVU/AIM daily pipeline has been running with the following modules: Ingestion, Raw Data Processing, Monitoring, Daily Calibration, Fine Selection, Report and Monthly Diagnostics. The AVU/AIM processing strategy is based on time, with each AVU/AIM run defined on 24 hours of observed data. The AVU/AIM pipeline starts with selecting AstroObservations with window classes 0–2 (Table 1.1). The Raw Data Processing processes AstroObservations of all window classes and estimates the image parameters such as centroid and flux. In Gaia DR2 processing, the AVU/AIM system used a PSF/LSF bootstrapping library including dedicated image-profile templates for each CCD, spectral-type bin, and window class. The Monitoring module is dedicated to extracting information on the instrument health, astrometric-instrument calibration parameters, image quality, and comparison between AVU/AIM and IDT outputs. The Daily Calibration module is devoted to the Gaia signal-profile reconstruction on a daily basis. Its work flow also includes diagnostics and validation functions. The calibration-related diagnostics include the image-moment variations over the focal plane. An automatic tool performs validation of the reconstructed image profiles before using them within the AVU/AIM chain. Depending on the scanning law and sky conditions, AVU/AIM treated between 2 and 11 million AstroObservations per day for Gaia DR2 processing. In those runs with more than 5 million AstroObservations, a filter was activated in order to process the minimum number of data in each bin, defined on several instrument and observation parameters as well as time intervals, with proper-quality results.

The AVU/BAM daily pipeline has been running with the following modules: Ingestion, Pre-Processing, Raw Data Processing, Monitoring, Weekly Analysis, Calibration, Extraction and Report. In the Raw Data Processing module, the following algorithms are running: Raw Data Processing, Gaiometro, Gaiometro2D, DFT, Chi Square, BAMBin, and comparison with IDT BamElementary. The AVU/BAM system has two run strategies named IDT and H24. In the IDT strategy, used from commissioning to December 2015 (covering Gaia DR1), an AVU/BAM run is defined when a transfer containing the BAM data is received at DPCT. The processing is started automatically without any check on the input data. In the H24 strategy, an AVU/BAM run is defined based on 24 hours of data and the processing starts automatically when the data availability reaches a threshold (e.g., 98%). The AVU/BAM system has been processing with the H24 strategy since December 2015 to produce AVU/BAM analyses at regular intervals. It takes about one hour to execute all modules in the pipeline.

DPCT houses data stores which are used to provide data services to all AVU data processing and analysis activities. In particular the DPCT database repository, implemented with Oracle technology, is collecting all data received and generated from/to the DPCT pipeline. This configuration allows having data online to conduct additional analyses not implemented in the pipelines. At the end of Gaia DR2 processing, the size of the DPCT database was 280 TB. In order to ensure that the automatic data reception and ingestion processes are executed without data losses, DPCT has implemented and executed a set of procedures to guarantee the consistency of the data in the DPCT database. Data-consistency checks are executed on all DPCT data stores and at different times, e.g., before and after data are used in the data-reduction pipelines. The DPCT data-consistency checks are working as expected, i.e., the data-management pipelines are reliable.

The following list provides the main data types produced at DPCT and subsequently delivered to the MDB at DPCE during Gaia DR2 processing:

  • BamElementaryT;

  • Bav;

  • CalibratedBav.

The output and findings of AVU/AIM and AVU/BAM, provided in the daily and periodic reports, have been used to check the instrument health by performing cross-checks with other DPAC systems providing the same instrument measurements.

Cyclic processing

The GSR pipeline is composed of the following modules: Ingestion, System Coefficient Generation, Solver, Solution Analysis, De-Rotation and Comparison, and Extraction and Report. The Ingestion step reads the billions of AstroElementaries and matches them to sources to populate the GSR data store. The System Coefficient Generation module calculates the parameter coefficients of the system of linearised equations to be solved to produce the GSR solution. The Solver module consists of the implementation of the LSQR algorithm for solving the system of linearised equations and it is the only module running on the Fermi supercomputer at CINECA. The Solver finds a GSR solution while the Analysis module checks the exit status of the solution algorithm and provides an alert in case of problems revealed by the stopping conditions implemented in the LSQR algorithm. The De-Rotation and Comparison module converts the GSR solution into a format compatible with that of AGIS. It also ’de-rotates’ the AGIS solution back into its internal reference frame to allow comparison with GSR. GSR results are collected in the final report.

The GSR system successfully completed a demonstration run on simulated data with release 19.0.0. With the aim of making a validation run of GSR based for the first time on real data, GSR was upgraded to version 19.0.1. This upgrade added some features needed for the treatment of real data, e.g. the capability of processing the commanded and the corrective attitude, or the corrections to the nominal value of the basic angle. At the time of writing (February 2018), the validation runs are ongoing and reporting of the comparisons with AGIS are planned for Gaia DR3.

During Gaia DR2 processing, the AVU/BAM cyclic pipeline has been activated in order to refine the BAM Fourier analysis and fringe-parameter estimations. The latter improvement results from a fringe-cleaning algorithm which identifies and masks pixels affected by cosmic rays. The first improvement results from detrending the Fourier analysis that is systematically made over 10- and 50-revolution time intervals with a third-order polynomial, after which the algorithm estimates the first 20 Fourier terms.