Changelog
April 2025
Full Document Text Support Added
We've added full document text to all API endpoints. You can now access the complete text of documents in both English and the original language when retrieving data through any of our datasets.
The fields are:
title_english
: The title of the document in Englishtitle_source_language
: The title of the document in the source languagebody_english
: The complete body text of the document in Englishbody_source_language
: The complete body text of the document in its original languagesubhead_english
: The subhead of the document in Englishsubhead_source_language
: The subhead of the document in the source languagesummary_english
: The summary of the document in Englishsummary_source_language
: The summary of the document in the source language
These new fields are available across all datasets (Basic, Commodities, and GICS) and can be accessed through the standard API endpoints. This enhancement allows for:
- More comprehensive document analysis
- Direct access to source material
- Advanced NLP and text mining capabilities
- Building your own models on the complete text
To use these fields, simply include them in your query parameters. For large
documents, pagination is supported with the standard offset
and limit
parameters.
For more details, see our Text Fields documentation.
October 2024
The first release of the Bilby Quant API. This API serves numerical and categorial data derived from policy documents from the Chinese government, from the beginning of 2017 to the present day, as a REST API. In all cases, each row in any table corresponds to one document. However, for the commodities and GICS datasets, information from a single document may be repeated across many rows (for example, because the document is relevant to two or more commodities).
With the first release, we introduce three datasets: basic documents, commodity documents, and GICS documents.
Basic data — All Bilby documents from 2017 to the present day are represented. Each document is represented only once. Aside from standard metadata, the columns reflect GICS sector relevance, sentiment, importance and policy maturity. They do not provide:
- More granular GICS tree classification information (i.e. down to industry group, industry or sub-industry level).
- Information relating to commodities.
Commodities-related data — Only a subset of the Bilby documents, deemed by our models to be relevant to commodities, is represented. A single document may be represented across more than one row, if it is deemed relevant to more than one commodity. In addition to all columns from the Basic data silo, this silo includes commodity-related columns — see the documentation for a complete description of the columns.
GICS-related data — Only a subset of the Bilby documents, deemed by our models to be relevant to one or more nodes in the GICS classification tree (i.e. sectors, industry groups, industries and sub-industries) is represented. A single document may be represented across more than one row, if it is relevant to more than one GICS node. In addition to all columns from the Basic data silo, this silo includes GICS node-related columns — see the documentation for a complete description of the columns. Note that for this version of the API, the GICS tree classification labels are implemented by a suite of independent models. For this reason, it is possible that a single document is labelled as relevant to several nodes in the GICS tree.
Notes on the GICS and commodities data silos:
A document row is deemed relevant either to a GICS node or a commodity label if both of two conditions are satisfied:
- The value of the field
theme_relevance_score
exceeds a fixed threshold, and; - The document row is among either the 300 that are most relevant to the given GICS node, or the 600 that are most relevant to the given commodity.
The numerical threshold theme_relevance_score
determining relevance to a GICS
node or commodity is set quite low, so that inclusion is permissive
(Equivalently: The underlying classification model will have high recall and low
precision, compared to an oracle). This decision, along with the inclusion of
the theme_relevance_score
in the columns served by the API, are both by
design: Users can always cull more documents by enforce stricter relevance
cutoffs, if desired.