IDigBio API Software: Difference between revisions

Latest revision as of 22:04, 14 December 2022

Technical Details of iDigBio APIs

This document describes the software and server systems used to serve the iDigBio APIs. Please refer to the main iDigBio API document for end-user information about the APIs.

Software

The core of our search API service is Elasticsearch. This distributed document store is where we keep the current versions of all our specimen and media records. Both indexed fields and raw data are kept in this store. User queries expressed in the iDigBio query syntax are translated in to Elasticsearch queries and JSON results from Elasticsearch are then formatted and passed through to the user.

The search API server is written in Node.js and is available under the GPL3 license from Github. Mapping requests including image tile generation are done with the Mapnik library.

The download API is written in Python using Flask. The download API generates tasks that are sent to the Celery queue server which loads the needed data directly from Elasticsearch and generates a Darwin Core archive on disk. This archive is then loaded in to the Ceph object storage cluster and a record of the download is inserted in to a PostgreSQL database. API requests for download status check this database as well as a trigger to generate an email to the download requester if they supplied an address.

The record and media APIs are also written in Python using Flask. These APIs talk directly to the PostgreSQL server that stores all versions of all data iDigBio has as well as relational links between specimens and media. The look up individual records and fetch their attributes, format them as JSON, and pass them back to the users.

The code for the download, record, and media APIs is currently not public.

Hardware

Elasticsearch, PostgreSQL, and Ceph are all run on a dedicated hardware clusters engineered for the specific needs of each of these applications.

API server apps are deployed as Docker containers to a pool of virtual machines that run the API and other services such as the Celery server. The virtual machines are run in a shared XenServer pool along with iDigBio's hosted VPS machines and other infrastructure. A Redis virtual machine provides a short-lived cache of recent API request results to buffer repeated API requests, often the result of page reloads in the portal interface.

@@ Line 6: / Line 6: @@
 === Software ===
-The core of our API services is [https://www.elastic.co/ Elasticsearch]. This distributed document store is where we keep the current versions of all our specimen and media records. Both indexed fields and raw data are kept in this store. User queries expressed in the [https://github.com/iDigBio/idigbio-search-api/wiki/Query-Format iDigBio query syntax] are translated in to Elasticsearch queries and JSON results from Elasticsearch are then formatted and passed through to the user.
+The core of our search API service is [https://www.elastic.co/ Elasticsearch]. This distributed document store is where we keep the current versions of all our specimen and media records. Both indexed fields and raw data are kept in this store. User queries expressed in the [https://github.com/iDigBio/idigbio-search-api/wiki/Query-Format iDigBio query syntax] are translated in to Elasticsearch queries and JSON results from Elasticsearch are then formatted and passed through to the user.
-The API server is written in Node.js and is [https://github.com/iDigBio/idigbio-search-api available] under the GPL3 license from Github. Mapping requests including image tile generation are done with the Mapnik library.
+The search API server is written in Node.js and is [https://github.com/iDigBio/idigbio-search-api available] under the GPL3 license from Github. Mapping requests including image tile generation are done with the Mapnik library.
+The download API is written in Python using Flask. The download API generates tasks that are sent to the Celery queue server which loads the needed data directly from Elasticsearch and generates a Darwin Core archive on disk. This archive is then loaded in to the Ceph object storage cluster and a record of the download is inserted in to a PostgreSQL database. API requests for download status check this database as well as a trigger to generate an email to the download requester if they supplied an address.
+The record and media APIs are also written in Python using Flask. These APIs talk directly to the PostgreSQL server that stores all versions of all data iDigBio has as well as relational links between specimens and media. The look up individual records and fetch their attributes, format them as JSON, and pass them back to the users.
+The code for the download, record, and media APIs is currently not public.
 === Hardware ===
-Elasticsearch is run on a dedicated hardware cluster. API server apps are deployed as Docker containers to a pool of virtual machines that run the API and other services. The virtual machines are run in a shared XenServer pool along with iDigBio's hosted VPS machines and other infrastructure. A Reddis virtual machine provides a short-lived cache of recent API request results to buffer repeated API requests, often the result of page reloads in the portal interface.
+Elasticsearch, PostgreSQL, and Ceph are all run on a dedicated hardware clusters engineered for the specific needs of each of these applications.
+API server apps are deployed as Docker containers to a pool of virtual machines that run the API and other services such as the Celery server. The virtual machines are run in a shared XenServer pool along with iDigBio's hosted VPS machines and other infrastructure. A Redis virtual machine provides a short-lived cache of recent API request results to buffer repeated API requests, often the result of page reloads in the portal interface.