Entity extraction
Bilby Quant Data includes comprehensive entity extraction performed on the English text of each document. Our Named Entity Recognition (NER) model identifies and classifies key entities such as people, organisations, locations, and events mentioned in policy documents.
Entity fields
extracted_entities_en
- Type: JSON Array
- Description: Array of entity objects extracted from the document. Each object represents a single entity mention with its classification, location in the text, and confidence score.
- Nullable: Yes
extracted_entities_en_count
- Type: Integer
- Description: Total count of entities extracted from the document. This
represents the length of the
extracted_entities_enarray. - Nullable: Yes
Entity object structure
Each entity in the extracted_entities_en array has the following structure:
{
"extracted_entity_text": "string",
"extracted_entity_type": "string",
"score": "float",
"start": "integer",
"end": "integer",
"occurrence_count": "integer",
"model": "string",
"timestamp": "string",
"source_document_uuid": "string"
}
Field definitions
extracted_entity_text
- Type: String
- Description: The exact text of the entity as it appears in the source document.
- Example:
"Chongqing Notary Office"
extracted_entity_type
- Type: String
- Description: The entity type classification assigned by the NER model. See Entity Types section below for all possible values.
- Example:
"Government Body"
score
- Type: Float
- Description: Model confidence score for the entity classification, ranging from 0.0 to 1.0. Higher values indicate greater confidence.
- Example:
0.9938766956329346 - Note: You may want to apply threshold filtering (e.g.,
score > 0.9) depending on your use case and tolerance for false positives.
start
- Type: Integer
- Description: Character position where the entity begins in the source document (zero-indexed).
- Example:
779
end
- Type: Integer
- Description: Character position where the entity ends in the source document (zero-indexed, exclusive).
- Example:
802
occurrence_count
- Type: Integer
- Description: Sequential count of how many times this specific entity has appeared in the document up to this point.
- Example:
1 - Note: The same entity may appear multiple times in a document. This field tracks each sequential appearance.
model
- Type: String
- Description: Name/identifier of the language model used for entity extraction.
- Example:
"entity_extraction_china_english_v2"
timestamp
- Type: String
- Description: ISO 8601 timestamp of when the entity was extracted.
- Example:
"2025-10-31T03:19:10.879146+00:00"
source_document_uuid
- Type: String
- Description: Unique identifier linking the entity to its source document.
- Example:
"d3f50ef8-5660-b523-863e-c24265401749"
Entity types
Our NER model classifies entities into eleven categories:
1. Person
Individual human beings, including political figures, business leaders, officials, and other notable individuals mentioned in documents.
Examples:
- Government officials: "Xi Jinping", "Chen Lu"
- Business leaders: "Eddie Yongming Wu"
- Other individuals referenced in policy contexts
2. Company
Business entities including corporations, state-owned enterprises (SOEs), private companies, and publicly traded firms. This includes both full company names and stock ticker symbol mentions.
Examples:
- "Alibaba Group"
- "Bank of Communications"
- Stock tickers: "NYSE: BABA", "HKEX: 9988"
3. Party body
Organisational units within the Communist Party of China (CPC) structure, including committees, commissions, working groups, and other party-specific administrative bodies.
Examples:
- "Central Political and Legal Affairs Commission"
- "Politburo of the Chinese Communist Party"
- Party committees at various administrative levels
4. Government body
Official government administrative organisations, ministries, bureaus, departments, and regulatory agencies at national, provincial, or local levels. Distinct from Party Bodies, these are formal state apparatus entities.
Examples:
- "Ministry of Civil Affairs"
- "Chongqing Municipal Civil Affairs Bureau"
- "Elderly Services Section of the Jiangbei District Civil Affairs Bureau"
5. NGO (Non-Governmental Organisation)
Non-profit, voluntary organisations that operate independently from government control, including charitable foundations, advocacy groups, think tanks, and civil society organisations.
Examples:
- "China Foundation for Poverty Alleviation"
- "China Red Cross Foundation"
- "Amity Foundation"
- Community development organisations
6. IGO (Intergovernmental Organisation)
International organisations formed by treaty or agreement between multiple sovereign states, operating across national boundaries to address shared interests or challenges.
Examples:
- "Asian Infrastructure Investment Bank" (AIIB)
- "Association of Southeast Asian Nations" (ASEAN)
- United Nations agencies
- Regional development banks
7. GPE (Geo-Political Entity)
Geographic locations with political or administrative significance, including countries, provinces, cities, districts, regions, and other territorially-defined administrative units.
Examples:
- Countries: "China"
- Cities/Municipalities: "Chongqing", "Beijing"
- Districts: "Jiangbei District", "Yuzhong"
- Provinces/Regions: "Sichuan", "Zhejiang"
8. Currency mention
References to monetary amounts, financial values, or currency-denominated figures, including the currency unit and numerical value.
Examples:
- "2.0447 million yuan"
- "5.579 million yuan"
- Dollar amounts: "$130.25 billion"
9. Event
Significant occurrences, activities, or incidents mentioned in documents, including conferences, meetings, state visits, policy announcements, agreements, and other notable happenings.
Examples:
- "The 20th National Congress of the Communist Party of China"
- "China-Africa Cooperation Forum Summit"
- "Belt and Road Forum for International Cooperation"
- "China International Import Expo"
- Policy announcement events
- Signing ceremonies
10. Initiative
Policy programmes, legislative acts, regulations, strategic plans, and government-led projects or campaigns. These represent formal efforts to achieve specific policy objectives.
Examples:
- "Belt and Road Initiative"
- "Made in China 2025"
- "Carbon Neutrality Action Plan for 2030"
- Regulatory frameworks
- Government pilot programmes
11. Miscellaneous Organisation
Organisational entities that don't fit into the above categories, primarily including educational institutions, cultural institutions, professional associations, and other structured organisations.
Examples:
- "Southwest University of Political Science & Law"
- "National Museum of China"
- Professional associations
- Cultural institutions
Working with entity data
Multiple mentions
The same entity may appear multiple times in a document. The occurrence_count
field tracks these sequential appearances, allowing you to analyse the frequency
and distribution of entity mentions.
Confidence filtering
The score field indicates model confidence. Depending on your use case and
tolerance for false positives, you may want to apply threshold filtering:
- High precision:
score > 0.95 - Balanced:
score > 0.90 - High recall:
score > 0.80
Position information
The start and end fields enable you to extract surrounding context from the
source document for validation or deeper analysis. Use these positions with the
body_en field to retrieve the text around each entity mention.
Entity overlap
In some cases, entity spans may overlap. For example, "Bank of Communications" might be classified as both a Company and a Government Body if it's a state-owned bank. The model assigns the most contextually appropriate type based on the document context.