Install & run
The easiest way to run this project is by using Docker compose:
git clone https://gitlab.com/panoramax/server/meta-catalog.git
cd meta-catalog/
docker compose up -p meta-catalog --build
Then, the meta-catalog is accessible at localhost:9000 (without any data for the moment).
You can also run each service without docker, as is explained in each services documentation.
Database & migrations
Each service will need an accessible postgis database.
The database will be provided and its schema updated if you use the docker compose approach, else you need to provide it.
Note: in all the following examples, the database will be on localhost:5432, named as panoramax and accessed with username:password (thus the connection string will be postgresql://username:password@localhost:5432/panoramax).
When using docker compose, the database schema is updated automatically by the migrations service.
Loading data
The harvester directory contains to code needed to harvest the data from several instances into the meta-catalog.
The harvester depend on a simple CSV file to describe to instances to crawl.
The CSV should have the column:
id: identifier of the instanceapi_url: Root url of the stac catalog. Note: for geovisio instances, don't forget the trailing/api
You can check the config file of french public panoramax instances for an example.
It's easy, just provide the path to the CSV file to the INSTANCES_FILE environment variable and run the additional docker compose file containing the harvester and a scheduler (like a cron):
INSTANCES_FILE=<CSV_FILE> docker compose -p meta-catalog -f docker-compose.yml -f docker-compose-harvester.yml up -d
This will crawl the requested instances every 5mn.
You can also just run the python code directly:
cd harvester
python -m pip install --upgrade virtualenv
virtualenv .venv
source .venv/bin/activate
pip install -e .
stac-harvester harvest --db "postgresql://username:password@localhost:5432/panoramax" --instances-files ../panoramax-fr-instances.csv --target-hostname=https://panoramax.fr/api --collections-limit=5
This will import 5 collections from the panoramax.ign.fr instance into your database.
Note: the parameter --target-hostname is important since all STAC links from the crawled API will be replaced with this hostname. This should be the public hostname of your meta-catalog API. But your meta-catalog can be served behind several URLS or behind a proxy, since the API will respect the Host header provided for all its STAC links.
You can also give the parameters without a configuration file:
stac-harvester harvest --db "postgresql://username:password@localhost:5432/panoramax" --instance-name ign --instance-url https://panoramax.ign.fr/api --target-hostname=https://panoramax.fr/api
All CLI parameters can be listed with stac-harvester --help.
If you want to run a systemd service to craw automatically, you can check the detailed documentation section.
Harvest type
By default the harvest is incremental, meaning that only the data that have changed since the last harvest is crawled (using a filter on the collections endpoint like ?filter=updated > {last_harvest}).
Sometimes, it can be usefull to crawl all the data again, for example when there is a major update in the instance. This can be done with the --full-harvest option (or with INCREMENTAL_HARVEST=false environment variable).
Instance configuration
The instance configuration (from the instance's /configuration endpoint) is synchronized every day in the database.
It can also be updated manually by running:.
cd harvester
python -m pip install --upgrade virtualenv
virtualenv .venv
source .venv/bin/activate
pip install -e .
stac-harvester sync-configuration --db "postgresql://username:password@localhost:5432/panoramax" --instance instance_name --instance another_instance_name
Install standalone API
The API is written in Rust.
The API is started as the api service in the docker compose file.
Info
You need Rust to build the API. The best is to follow the official documentation, but if you want a quick way to do this:
Note
You should have an up to date Rust version, as least 1.81.0.
To build and run the API:
If you want to run the API as a systemd service, you can use and adapt the example configuration file.
Data export
If you want to publish exports of the data, you can use simple cron jobs to do so.
pg_dump
The easiest backup is done using the pg_dump binary, exporting the public and stats schemas.
Uses zstd compression with the level 5, since it seems a good balance between speed and compression.
Write the following script in a file, for example /usr/local/bin/panoramax-dump.sh:
#!/bin/bash
# Script to dump the panoramax database
BACKUP_DIR="/data/pano-dump/pg_dump"
DATE=$(date +\%Y\%m\%dT\%H\%M\%S)
FILENAME="$BACKUP_DIR/panoramax-backup-$DATE.dump"
echo "Backuping Panoramax to $FILENAME"
time pg_dump --clean --if-exists --format c --dbname panoramax --no-owner --no-privileges --no-comments --schema 'public' --schema 'stats' --compress zstd:5 --file $FILENAME
# Replace the base file
mv $FILENAME $BACKUP_DIR/panoramax.dump
echo "Panoramax backup completed"
and add cron like this to run it every week, at midnight on sunday:
Backup restoration
To restore the backup, you can check the data section.
Parquet export
Apache Parquet is a powerful column-oriented data format, built as a modern alternative to CSV files. GeoParquet is an incubating Open Geospatial Consortium (OGC) standard that adds interoperable geospatial types (Point, Line, Polygon) to Parquet.
The federated catalog can export its data following the STAC GeoParquet specification (see tne data section for more details).
The GeoParquet can be generated by copying the geoparquet_export view.
Prerequisites
In order to do this, you need the PG Parquet extension installed on your database.
Write the following script in a file, for example /usr/local/bin/panoramax-parquet-export.sh:
#!/bin/bash
set -e -u -o pipefail
# Script to dump the panoramax database
BACKUP_DIR="/data/pano-dump/geoparquet"
DATE=$(date +\%Y\%m\%dT\%H\%M\%S)
FILENAME="$BACKUP_DIR/panoramax-$DATE.parquet"
echo "Exporting Panoramax to $FILENAME"
time psql -d panoramax -c "\COPY (select * from geoparquet_export) TO '$FILENAME' WITH (FORMAT 'parquet')";
# Replace the base file
mv $FILENAME $BACKUP_DIR/panoramax.parquet
echo "Panoramax export completed"
and add cron like this to run it every week, at 2am on sunday (on the postgres user if possible):