Skip to content

Using the Parquet file with duckdb

DuckDB is a great tool to query data in various formats using SQL. We can leverage it to only read parts of the huge Parquet file, in order to extract only the data we need.

You'll also need the spatial extension. It is installed with:

INSTALL SPATIAL;
LOAD SPATIAL;

Filters

There are many ways to use the Parquet file, we'll show here some examples of how to handle the data.

The Parquet file contains almost all Panoramax data (extracted at the end of each week), so the resulting file is huge, the first step is often to filter it to only get the data you need.

Extract all the data in a bounding box

If we want to extract all the data in Paris' bounding box, we can use the following query:

COPY (
    SELECT
        *
    FROM
        'https://api.panoramax.xyz/data/geoparquet/panoramax.parquet'
    WHERE
        bbox.xmin > 2.2241
        and bbox.ymin > 48.8155
        and bbox.xmax < 2.4533
        and bbox.ymax < 48.9023
)
TO 'paris.parquet'
WITH
    (FORMAT 'parquet');

This will create a new paris.parquet file, in the current directory.

The export should be quite fast

The export should be quite fast (for a query on a 20G file) because Parquet chunks are made with pictures near each others, so the query can skip a lot of chunks.

Extract data on a precise boundary

In the previous example, we extract data on a rough bounding box, but we can have an export on a precise geometry to only get the data on a city/region/country.

This example is inspired by an Overture example

-- We use overturemaps divisions datasets to get the ID of the region/city/country we want to extract
SET variable division_id = (
    SELECT
        id
    FROM
        read_parquet('s3://overturemaps-us-west-2/release/2025-10-22.0/theme=divisions/type=division/*.parquet')
WHERE
        names.primary = 'Paris' and country = 'FR' and capital_of_divisions is not null
LIMIT 1
);

-- Fetch the bounds and geometry of the Region, and store it in a table
CREATE OR REPLACE TABLE bounds AS (
    SELECT
        id AS division_id, names.primary, geometry, bbox
    FROM
        read_parquet('s3://overturemaps-us-west-2/release/2025-10-22.0/theme=divisions/type=division_area/*.parquet')
    WHERE
        division_id = getvariable('division_id')
);

-- Extract the bounds and geometry of the division into variables for faster table scan
SET variable xmin = (select bbox.xmin FROM bounds);
SET variable ymin = (select bbox.ymin FROM bounds);
SET variable xmax = (select bbox.xmax FROM bounds);
SET variable ymax = (select bbox.ymax FROM bounds);
SET variable boundary = (select geometry FROM bounds);

-- Create a local GeoParquet file with exactly the data contained in the geometry of the chosen division
COPY(
    SELECT
        *
    FROM
        'https://api.panoramax.xyz/data/geoparquet/panoramax.parquet'
    WHERE
        bbox.xmin > getvariable('xmin')
        AND bbox.xmax < getvariable('xmax')
        AND bbox.ymin > getvariable('ymin')
        AND bbox.ymax < getvariable('ymax')
        AND ST_INTERSECTS(
            getvariable('boundary'),
            geometry
        )
) TO 'paris.parquet';

Filter based on the semantic

Filtering of the semantic is a bit more complex because the semantic field is an array.

Here is a query to filter the items that have a traffic sign tag (be it on the picture, or on an annotation (a specific part of the picture)):

SELECT
    *
FROM
    'https://api.panoramax.xyz/data/geoparquet/panoramax.parquet'
WHERE
    length(list_filter(semantics, s -> s.key = 'osm|traffic_sign')) > 0
LIMIT 10;

We can also do some statistics on the used tags with a query like:

WITH all_semantics AS (
  SELECT semantics FROM 'https://api.panoramax.xyz/data/geoparquet/panoramax.parquet'
)
, sem_tags AS (
  -- We unnest the semantics, to get one row for each semantic tag
  SELECT tags
  FROM all_semantics
  , UNNEST(semantics) AS semantic_struct(tags)
  WHERE tags.key NOT LIKE '%[%]%' -- we do not want the qualifiers here
)
SELECT
    tags.key, tags.value, count(*) AS nb_tags
FROM sem_tags
GROUP BY tags.key, tags.value;

Filter based on the users

We can get all the data of a given user with the query:

COPY(
    SELECT
        *
    FROM
        'https://api.panoramax.xyz/data/geoparquet/panoramax.parquet'
    WHERE
        providers[1].name = 'PanierAvide' -- The pictures can have several providers, and the first one is the Panoramax account
) TO 'panieravide.parquet' WITH (format 'parquet');

Conversion

GeoParquet to JSON

We can also convert the generated Parquet file (it's better to do it on smaller Parquet files, on the whole file, the resulting file will be huge) to a JSON file.

COPY(
    SELECT
        -- we can select only the fields we want, or use * to select all
        id,
        geometry,
        datetime, -- time of the picture
        assets[1].href, -- url of the picture
        semantics
    FROM 'extract.parquet'
) TO 'extract.json' WITH (FORMAT JSON);

GeoParquet to CSV

We can also convert easily to get a CSV file.

COPY(
    SELECT
        id,
        geometry,
        semantics
    FROM 'extract.parquet'
) TO 'extract.csv' WITH (FORMAT CSV);