Install & run
The easiest way to run this project is by using Docker compose:
git clone https://gitlab.com/panoramax/server/meta-catalog.git
cd meta-catalog/
docker compose up -p meta-catalog --build
Then, the meta-catalog is accessible at localhost:9000 (without any data for the moment).
You can also run each service without docker, as is explained in each services documentation.
Database & migrations
Each service will need an accessible postgis database.
The database will be provided and its schema updated if you use the docker compose approach, else you need to provide it.
Info
In all the following examples, the database will be:
- On
localhost:5432 - Named as
panoramax - And accessed with
username:password
Thus, the connection string will be postgresql://username:password@localhost:5432/panoramax.
When using docker compose, the database schema is updated automatically by the migrations service.
Loading data
The harvester directory contains to code needed to harvest the data from several instances into the meta-catalog.
You need to add instances to the federation with the add-instance command.
stac-harvester add-instance --db "postgresql://username:password@localhost:5432/panoramax" my_instance --instance-url https://my-instance.com/api
Tip
You can add a .env file with DB_URL=<cnx_string> to avoid to pass the --db parameter every time.
Then you can harvest all instances with the harvest-all command:
The harvester will crawl all instances that have not been crawld since 5 minutes by default.
For production, this command should be run by a cron (every minutes seems a good default).
You can also crawl a specific instance with:
stac-harvester harvest --db "postgresql://username:password@localhost:5432/panoramax" <instance_name>
This is usefull to force a sync, or to do a full harvest (crawl all the data, not only the new ones, see the harvester documentation):
stac-harvester harvest --db "postgresql://username:password@localhost:5432/panoramax" --full-harvest <instance_name>
It's easy, just provide the path to the CSV file to the INSTANCES_FILE environment variable and run the additional docker compose file containing the harvester and a scheduler (like a cron):
INSTANCES_FILE=<CSV_FILE> docker compose -p meta-catalog -f docker-compose.yml -f docker-compose-harvester.yml up -d
This will crawl the requested instances every 5mn.
You can also just run the python code directly:
cd harvester
python -m pip install --upgrade virtualenv
virtualenv .venv
source .venv/bin/activate
pip install -e .
stac-harvester harvest-all --db "postgresql://username:password@localhost:5432/panoramax"
All CLI parameters can be listed with stac-harvester --help.
If you want to run a systemd service to craw automatically, you can check the detailed documentation section.
If you want to give a try to our codebase without crawling all public instances, you can as well reuse our public data exports.
Harvest type
By default the harvest is incremental, meaning that only the data that have changed since the last harvest is crawled (using a filter on the collections endpoint like ?filter=updated > {last_harvest}).
Sometimes, it can be usefull to crawl all the data again, for example when there is a major update in the instance. This can be done with the --full-harvest option (or with INCREMENTAL_HARVEST=false environment variable).
Instance configuration
The instance configuration (from the instance's /configuration endpoint) is synchronized every day in the database.
It can also be updated manually by running:.
cd harvester
python -m pip install --upgrade virtualenv
virtualenv .venv
source .venv/bin/activate
pip install -e .
stac-harvester sync-configuration --db "postgresql://username:password@localhost:5432/panoramax" --instance instance_name --instance another_instance_name
Deleting an instance
Deleting an instance needs to be done in the database for the moment:
This will cascade delete all the collections, items, providers (users) and harvests linked to it.
Install standalone API
The API is written in Rust.
The API is started as the api service in the docker compose file.
Info
You need Rust to build the API. The best is to follow the official documentation, but if you want a quick way to do this:
Note
You should have an up to date Rust version, as least 1.81.0.
To build and run the API:
If you want to run the API as a systemd service, you can use and adapt the example configuration file.
Data export
If you want to publish exports of the data, you can use simple cron jobs to do so.
pg_dump
The easiest backup is done using the pg_dump binary, exporting the public and stats schemas.
Uses zstd compression with the level 5, since it seems a good balance between speed and compression.
Write the following script in a file, for example /usr/local/bin/panoramax-dump.sh:
#!/bin/bash
# Script to dump the panoramax database
BACKUP_DIR="/data/pano-dump/pg_dump"
DATE=$(date +\%Y\%m\%dT\%H\%M\%S)
FILENAME="$BACKUP_DIR/panoramax-backup-$DATE.dump"
echo "Backuping Panoramax to $FILENAME"
time pg_dump --clean --if-exists --format c --dbname panoramax --no-owner --no-privileges --no-comments --schema 'public' --schema 'stats' --compress zstd:5 --file $FILENAME
# Replace the base file
mv $FILENAME $BACKUP_DIR/panoramax.dump
echo "Panoramax backup completed"
and add cron like this to run it every week, at midnight on sunday:
Backup restoration
To restore the backup, you can check the data section.
Parquet export
Apache Parquet is a powerful column-oriented data format, built as a modern alternative to CSV files. GeoParquet is an incubating Open Geospatial Consortium (OGC) standard that adds interoperable geospatial types (Point, Line, Polygon) to Parquet.
The federated catalog can export its data following the STAC GeoParquet specification (see tne data section for more details).
The GeoParquet can be generated by copying the geoparquet_export view.
Prerequisites
In order to do this, you need the PG Parquet extension installed on your database.
Write the following script in a file, for example /usr/local/bin/panoramax-parquet-export.sh:
#!/bin/bash
set -e -u -o pipefail
# Script to dump the panoramax database
BACKUP_DIR="/data/pano-dump/geoparquet"
DATE=$(date +\%Y\%m\%dT\%H\%M\%S)
FILENAME="$BACKUP_DIR/panoramax-$DATE.parquet"
echo "Exporting Panoramax to $FILENAME"
time psql -d panoramax -c "\COPY (select * from geoparquet_export) TO '$FILENAME' WITH (FORMAT 'parquet')";
# Replace the base file
mv $FILENAME $BACKUP_DIR/panoramax.parquet
echo "Panoramax export completed"
and add cron like this to run it every week, at 2am on sunday (on the postgres user if possible):