Datasets — Overview

In total, we have three datasets: basic documents, commodity documents, and GICS documents.

Basic data — All Bilby documents from 2017 to the present day are represented. Each document is represented only once. Aside from standard metadata, the columns reflect GICS sector relevance, sentiment, importance and policy maturity. They do not provide:

  1. More granular GICS tree classification information (i.e. down to industry group, industry or sub-industry level).
  2. Information relating to commodities.

Commodities-related data — Only a subset of the Bilby documents, deemed by our models to be relevant to commodities, is represented. A single document may be represented across more than one row, if it is deemed relevant to more than one commodity. In addition to all columns from the Basic data silo, this silo includes commodity-related columns — see the documentation for a complete description of the columns.

GICS-related data — Only a subset of the Bilby documents, deemed by our models to be relevant to one or more nodes in the GICS classification tree (i.e. sectors, industry groups, industries and sub-industries) is represented. A single document may be represented across more than one row, if it is relevant to more than one GICS node. In addition to all columns from the Basic data silo, this silo includes GICS node-related columns — see the documentation for a complete description of the columns. Note that for this version of the API, the GICS tree classification labels are implemented by a suite of independent models. For this reason, it is possible that a single document is labelled as relevant to several nodes in the GICS tree.

Notes on the GICS and commodities data silos:

A document row is deemed relevant either to a GICS node or a commodity label if both of two conditions are satisfied:

  1. The value of the field theme_relevance_score exceeds a fixed threshold, and;
  2. The document row is among either the 300 that are most relevant to the given GICS node, or the 600 that are most relevant to the given commodity. The numerical threshold theme_relevance_score determining relevance to a GICS node or commodity is set quite low, so that inclusion is permissive (Equivalently: The underlying classification model will have high recall and low precision, compared to an oracle). This decision, along with the inclusion of the theme_relevance_score in the columns served by the API, are both by design: Users can always cull more documents by enforce stricter relevance cutoffs, if desired.