Data structure & storage
File organisation
All Bilby Quant Data is stored as Parquet files organised in a hierarchical directory structure by year and month. This structure is mirrored in the Portal interface for easy navigation and download.
Critical: Date partitioning by insertion date
⚠️ IMPORTANT
All files are partitioned by the inserted_at timestamp, not the
published_at timestamp.
This is the date on which a document entered our processing pipeline, which may differ significantly from when it was originally published. This design choice has important implications:
Documents are organised by when they were ingested into the system, not when they were published by their original sources. A document published in 2019 but scraped by Bilby in November 2024 will appear in a file dated November 2024, not 2019.
This approach ensures point-in-time reproducibility: The state of the dataset on any given date can be reconstructed precisely, because new data is always appended to new files rather than modifying existing ones. Any documents scraped tomorrow will be added as new daily files, preserving the integrity of previously published files that clients may have already downloaded.
However, this choice can lead to unusually large files if Bilby performs
significant backfilling during a particular date range. When historical
documents are batch-processed and inserted on a single day, all those
documents will appear in that day's file, potentially necessitating a split
into multiple parts (indicated by the _p01, _p02 suffixes). A file dated
2024-11-15 might contain thousands of documents published across many years,
all inserted into the pipeline on that single day.
To filter documents by their actual publication date — which is often what you
want for analysis — use the published_at field within the data itself.
Directory structure
root/
├── 2021/
│ ├── combined_yearly_2021.parquet
│ ├── 01/
│ │ ├── combined_monthly_2021-01.parquet
│ │ ├── daily_2021-01-01.parquet
│ │ ├── daily_2021-01-02.parquet
│ │ └── ...
│ ├── 02/
│ └── ...
├── 2022/
├── 2023/
├── 2024/
└── 2025/
Year folders
Each year folder (2021–2025) contains:
- Annual file: A single parquet file containing all documents inserted
during that year
- Example:
combined_yearly_2024.parquet - If the data volume is large:
combined_yearly_2024_p0X.parquet(where X = 1, 2, 3...)
- Example:
- Month subdirectories: Twelve folders (01–12), one for each month
Month folders
Each month folder contains:
- Monthly file: A single parquet file containing all documents inserted
during that month
- Example:
combined_monthly_2024-10.parquet - If the data volume is large:
combined_monthly_2024-10_p0X.parquet(where X = 1, 2, 3...)
- Example:
- Daily files: Individual parquet files for each day of the month
- Example:
daily_2024-10-12.parquet - If the data volume is large:
daily_2024-10-12_p0X.parquet(where X = 1, 2, 3...) - For the current month, only files up to today's date are available
- Example:
File naming conventions
All files follow consistent naming patterns:
- Daily files:
daily_YYYY-MM-DD.parquetordaily_YYYY-MM-DD_p0X.parquet - Monthly files:
combined_monthly_YYYY-MM.parquetorcombined_monthly_YYYY-MM_p0X.parquet - Annual files:
combined_yearly_YYYY.parquetorcombined_yearly_YYYY_p0X.parquet
The _p01, _p02, _p03, etc. suffix indicates that the file has been split
into multiple parts due to the volume of data inserted on that date. This
typically occurs when Bilby processes a large batch of historical documents on a
single day.
Temporal granularity
The hierarchical structure allows you to download data at three different temporal granularities, to suit your workflow:
- Daily files: Ideal for incremental updates. Download only new days to keep your local copy current.
- Monthly files: Useful for moderate historical analysis or monthly backtesting workflows.
- Annual files: Best for comprehensive historical analysis or initial dataset downloads.
File format
All files are in Apache Parquet format, a columnar storage format that provides:
- Efficient compression
- Fast read performance for analytical queries
- Wide compatibility with data analysis tools (Python pandas, R, Spark, etc.)
All file granularities — daily, monthly, and annual — share the same schema. As a result, you can safely concatenate files from different time periods, without worrying about structural differences.