Changelog

November 2025

Major Platform Update

We've restructured the Bilby Quant Data platform with significant changes to data delivery and model offerings.

Data Access Changes

All data is now accessed exclusively through the Portal interface — there is no API in the current build
The three-dataset structure (Basic, Commodities, GICS) has been replaced with a single unified dataset
Data is organised in a hierarchical directory structure by year and month, with daily, monthly, and annual parquet files available
Files are partitioned by inserted_at (pipeline insertion date) rather than publication date

Model Updates

We've focussed our machine learning pipeline down to two core models:

Entity Extraction: Identifies and classifies key entities mentioned in documents, including people, organisations (government bodies, companies, NGOs, IGOs), geo-political entities, currency mentions, events, and initiatives
Policy Lifecycle Classification: Labels documents according to their stage in the policy development process (not policy, informing, deciding, implementing)

The sentiment, importance, and sector classification models from previous versions have been deprecated.

Field Changes

Field naming conventions have been updated throughout the dataset. Please refer to the Field Documentation for current field names and descriptions.

April 2025

Full Document Text Support Added

We've added full document text to all API endpoints. You can now access the complete text of documents in both English and the original language when retrieving data through any of our datasets.

The fields are:

title_english: The title of the document in English
title_source_language: The title of the document in the source language
body_english: The complete body text of the document in English
body_source_language: The complete body text of the document in its original language
subhead_english: The subhead of the document in English
subhead_source_language: The subhead of the document in the source language
summary_english: The summary of the document in English
summary_source_language: The summary of the document in the source language

These new fields are available across all datasets (Basic, Commodities, and GICS) and can be accessed through the standard API endpoints. This enhancement allows for:

More comprehensive document analysis
Direct access to source material
Advanced NLP and text mining capabilities
Building your own models on the complete text

To use these fields, simply include them in your query parameters. For large documents, pagination is supported with the standard offset and limit parameters.

For more details, see our Text Fields documentation.

October 2024

The first release of the Bilby Quant API. This API serves numerical and categorial data derived from policy documents from the Chinese government, from the beginning of 2017 to the present day, as a REST API. In all cases, each row in any table corresponds to one document. However, for the commodities and GICS datasets, information from a single document may be repeated across many rows (for example, because the document is relevant to two or more commodities).

With the first release, we introduce three datasets: basic documents, commodity documents, and GICS documents.

Basic data — All Bilby documents from 2017 to the present day are represented. Each document is represented only once. Aside from standard metadata, the columns reflect GICS sector relevance, sentiment, importance and policy maturity. They do not provide:

More granular GICS tree classification information (i.e. down to industry group, industry or sub-industry level).
Information relating to commodities.

Commodities-related data — Only a subset of the Bilby documents, deemed by our models to be relevant to commodities, is represented. A single document may be represented across more than one row, if it is deemed relevant to more than one commodity. In addition to all columns from the Basic data silo, this silo includes commodity-related columns — see the documentation for a complete description of the columns.

GICS-related data — Only a subset of the Bilby documents, deemed by our models to be relevant to one or more nodes in the GICS classification tree (i.e. sectors, industry groups, industries and sub-industries) is represented. A single document may be represented across more than one row, if it is relevant to more than one GICS node. In addition to all columns from the Basic data silo, this silo includes GICS node-related columns — see the documentation for a complete description of the columns. Note that for this version of the API, the GICS tree classification labels are implemented by a suite of independent models. For this reason, it is possible that a single document is labelled as relevant to several nodes in the GICS tree.

Notes on the GICS and commodities data silos:

A document row is deemed relevant either to a GICS node or a commodity label if both of two conditions are satisfied:

The value of the field theme_relevance_score exceeds a fixed threshold, and;
The document row is among either the 300 that are most relevant to the given GICS node, or the 600 that are most relevant to the given commodity.

The numerical threshold theme_relevance_score determining relevance to a GICS node or commodity is set quite low, so that inclusion is permissive (Equivalently: The underlying classification model will have high recall and low precision, compared to an oracle). This decision, along with the inclusion of the theme_relevance_score in the columns served by the API, are both by design: Users can always cull more documents by enforce stricter relevance cutoffs, if desired.