5.7.3 Computing arrangements
The alerts cluster is co-located with, but distinct from, DPCI. Our cluster is built around a large, PostgreSQL Relational Database Management System (RDBMS). A dedicated, 32-core server holds the primary copies of the databases and an identical machine holds a secondary, read-only copy of each database, kept current by synchronous replication. These replica databases were originally installed as protection against data loss, a role that has never been exercised. They now serve to increase performance by spreading the query load across two servers. Each database server has 176 TB of bulk-data space, arranged as a RAID 6. The alerts computers are interconnected, and connected to DPCI, by an Infiniband network.
The alerts pipeline is divided into: a batch application for ingestion of data, and primary data-reduction; a batch application for filtering of candidate alerts; a web application for human assessment of candidates surviving automatic filtering; and further web applications for publishing selected alerts. The data reduction application, which handles the greatest volume of data in each run, is assigned its own 64-core server. An identical server is held in readiness to take over should the first one fail. In practice, this reserve server is used for testing, for exploration of new algorithms, and occasionally for large-scale rearrangements of the main database. Each data-reduction server has 22 TB of data space to receive new data from DPCI. This holds roughly six months of data and allows reprocessing to correct errors or improve results over this span of the mission.
The other applications run in virtual machines hosted on a pair of smaller servers. There is also a disc-server for back-ups of the main databases, with a capacity of 284 TB.
The main data-reduction application was written in Java, for performance and to conform to DPAC standards. The web applications were written in Python, using the Django framework, for ease of development. The batch-filtering application was also written in Python.
The most notable feature of the computing design is that the bulk of the data is stored in the PostgreSQL database. Compare this with the more common arrangement where only metadata are stored in the RDBMS, while the bulk is stored in flat files. Ingesting both the full input data-set, and the results of processing, reduces significantly the amount of code needed to simply move data, and allows extreme flexibility in writing the applications that consume, analyse and display the results. However, it is challenging to obtain sufficient throughput from the relational architecture. Many design choices in the database and applications are forced by the need to reduce a day’s data in six hours or less.
As of mid-February 2021, the PostgreSQL system occupies 110 TB of its storage (with some unknown and variable fraction of this reclaimable for new data), almost all of which space is taken up in ingesting and recording the transit data. This is exceptionally large for a database that is extended daily and in which roughly 5% of the contents must be scanned for daily processing.
The data-set is far too large to fit comfortably into a single table for each record type, and is therefore partitioned by position of the sources into level-5 ( = 32) HEALPix pixels; there are 12,288 tables for each record type that deals with individual transits or sources. This partitioning is much coarser than the level-12 mesh used elsewhere by DPAC to group Gaia sources into catalogues. The level-5 mesh was chosen because it is the closest match to the width of the Gaia field of view. Empirically, it is most efficient to query historic transit-data for a whole HEALPix pixel, selecting transits only for those sources that have received new data in current run, than to make one query per source with new data. The level-5 mesh minimises simultaneously the number of these expensive queries and the number of rows touched by a query that relate to sources outside the current scan.
The queries to ingest and to raise data cannot be straightforward in such a highly-stressed system; they have to be highly tuned for performance. This precludes the use of generic, object-relational-mapping libraries to generate queries. Instead, the application code forms its queries directly. In order to achieve performance, much parallelism is needed, and the number of threads accessing the data has to be chosen carefully. We knew from pre-mission tests that the application was not the bottleneck and could be given sufficient threads of execution to saturate the database engine. Later experiments showed that the database performance, for this application, is limited by CPU power on the database server, not by I/O capacity; the complexity of the queries dominates. Therefore, the ideal arrangement is to have one application thread accessing the database per CPU core on the database server: 32 such threads in our system. The data-reduction servers have 64 CPU cores and to use these we run two second-level threads per HEALPix pixel to consume the raised data from the memory of the data-reduction server. We feel that we have reached a practical limit in the size of database that can be hosted on a single RDBMS-server. Any major expansion (beyond routine addition of daily data for the remaining mission) or rearrangement of the database would drive us to a multi-server solution with partitioning across computers. This would be massively more expensive and we are happy that we have managed to do the work with single servers. We note that while it is possible to tune the daily data-flow for acceptable performance, whole-database operations are uncomfortably slow. This includes data-model changes that add columns; weekly back-ups of the database; and especially changes to the IDT working-catalogue that force us to update the transit-source mapping for the whole data-set. The latter operation requires down-time of at least one week. When using PostgreSQL at this scale, any routine operation becomes a significant campaign that requires careful planning.