Data structure & storage

File organisation

All Bilby Quant Data is stored as Parquet files organised in a hierarchical directory structure by year and month. This structure is mirrored in the Portal interface for easy navigation and download.

Critical: Date partitioning by insertion date

⚠️ IMPORTANT

All files are partitioned by the inserted_at timestamp, not the published_at timestamp.

This is the date on which a document entered our processing pipeline, which may differ significantly from when it was originally published. This design choice has important implications:

Documents are organised by when they were ingested into the system, not when they were published by their original sources. A document published in 2019 but scraped by Bilby in November 2024 will appear in a file dated November 2024, not 2019.

This approach ensures point-in-time reproducibility: The state of the dataset on any given date can be reconstructed precisely, because new data is always appended to new files rather than modifying existing ones. Any documents scraped tomorrow will be added as new daily files, preserving the integrity of previously published files that clients may have already downloaded.

However, this choice can lead to unusually large files if Bilby performs significant backfilling during a particular date range. When historical documents are batch-processed and inserted on a single day, all those documents will appear in that day's file, potentially necessitating a split into multiple parts (indicated by the _p01, _p02 suffixes). A file dated 2024-11-15 might contain thousands of documents published across many years, all inserted into the pipeline on that single day.

To filter documents by their actual publication date — which is often what you want for analysis — use the published_at field within the data itself.

Directory structure

root/
├── 2021/
│   ├── combined_yearly_2021.parquet
│   ├── 01/
│   │   ├── combined_monthly_2021-01.parquet
│   │   ├── daily_2021-01-01.parquet
│   │   ├── daily_2021-01-02.parquet
│   │   └── ...
│   ├── 02/
│   └── ...
├── 2022/
├── 2023/
├── 2024/
└── 2025/

Year folders

Each year folder (2021–2025) contains:

  • Annual file: A single parquet file containing all documents inserted during that year
    • Example: combined_yearly_2024.parquet
    • If the data volume is large: combined_yearly_2024_p0X.parquet (where X = 1, 2, 3...)
  • Month subdirectories: Twelve folders (01–12), one for each month

Month folders

Each month folder contains:

  • Monthly file: A single parquet file containing all documents inserted during that month
    • Example: combined_monthly_2024-10.parquet
    • If the data volume is large: combined_monthly_2024-10_p0X.parquet (where X = 1, 2, 3...)
  • Daily files: Individual parquet files for each day of the month
    • Example: daily_2024-10-12.parquet
    • If the data volume is large: daily_2024-10-12_p0X.parquet (where X = 1, 2, 3...)
    • For the current month, only files up to today's date are available

File naming conventions

All files follow consistent naming patterns:

  • Daily files: daily_YYYY-MM-DD.parquet or daily_YYYY-MM-DD_p0X.parquet
  • Monthly files: combined_monthly_YYYY-MM.parquet or combined_monthly_YYYY-MM_p0X.parquet
  • Annual files: combined_yearly_YYYY.parquet or combined_yearly_YYYY_p0X.parquet

The _p01, _p02, _p03, etc. suffix indicates that the file has been split into multiple parts due to the volume of data inserted on that date. This typically occurs when Bilby processes a large batch of historical documents on a single day.

Temporal granularity

The hierarchical structure allows you to download data at three different temporal granularities, to suit your workflow:

  • Daily files: Ideal for incremental updates. Download only new days to keep your local copy current.
  • Monthly files: Useful for moderate historical analysis or monthly backtesting workflows.
  • Annual files: Best for comprehensive historical analysis or initial dataset downloads.

File format

All files are in Apache Parquet format, a columnar storage format that provides:

  • Efficient compression
  • Fast read performance for analytical queries
  • Wide compatibility with data analysis tools (Python pandas, R, Spark, etc.)

All file granularities — daily, monthly, and annual — share the same schema. As a result, you can safely concatenate files from different time periods, without worrying about structural differences.