Example Usage

This notebook provides a quick start guide to understanding and exploring the Bilby dataset.

Overview

The dataset contains full text government data, enriched with named entities (people, organizations, companies, etc.) and policy life cycle (PLC) identified using our machine learning models.

Each article includes:

  • Article metadata (URL, publication date, newspaper, etc.)
  • Article content (title and body in English and original language)
  • Extracted entities with confidence scores and character offsets
  • Policy life cycle (PLC) classification with confidence scores

Table of Contents

  1. Load and Inspect Data
  2. Dataset Overview
  3. Temporal Analysis
  4. Source Analysis
  5. Working with Extracted Entities
  6. Entity Statistics
  7. Sample Entity Exploration for One Article
  8. Policy Label Classification (PLC)

1. Load and Inspect Data

First, let's load the required libraries and read the dataset.

In [1]:python
# Import required libraries
import pandas as pd
import json
from collections import Counter
from datetime import datetime
import warnings
import glob

warnings.filterwarnings("ignore")

print("✅ Libraries loaded successfully")
✅ Libraries loaded successfully
In [2]:python
print("📊 LOADING DATASET")
print("=" * 70)

# REPLACE WITH YOUR DATASET PATH BELOW
data_path = "./data"
parquet_files = glob.glob(f"{data_path}/*.parquet")

print(f"Found {len(parquet_files)} parquet files")
print(f"Files: {[f.split('/')[-1] for f in parquet_files[:5]]}...")

# Read and concatenate all files
bilby_df = []
for f in parquet_files:
    try:
        df = pd.read_parquet(f)
        bilby_df.append(df)
    except Exception as e:
        print(f"❌ Error reading {f}: {e}")

bilby_df = pd.concat(bilby_df, ignore_index=True)

print(f"\n✅ Dataset loaded: {len(bilby_df):,} rows × {len(bilby_df.columns)} columns")
print(f"   Total records: {len(bilby_df):,}")
print(f"   Total columns: {len(bilby_df.columns)}")

print(f"Memory usage: {bilby_df.memory_usage(deep=True).sum() / 1024**2:.1f} MB")
📊 LOADING DATASET
======================================================================
Found 2 parquet files
Files: ['daily_2025-06-03.parquet', 'daily_2025-06-04.parquet']...

✅ Dataset loaded: 10,088 rows × 27 columns
   Total records: 10,088
   Total columns: 27
Memory usage: 168.8 MB

2. Dataset Overview

Let's examine the structure and content of our dataset.

In [4]:python
# Display basic information
print("📊 Dataset Shape:")
print(f"   Rows: {bilby_df.shape[0]:,}")
print(f"   Columns: {bilby_df.shape[1]}")
print()

print("📋 Column Names:")
for i, col in enumerate(bilby_df.columns, 1):
    print(f"   {i:2d}. {col}")
📊 Dataset Shape:
   Rows: 10,088
   Columns: 27

📋 Column Names:
    1. uuid
    2. branch_id
    3. published_at
    4. news_line
    5. newspaper
    6. author
    7. article_url
    8. title
    9. body
   10. title_en
   11. subhead_en
   12. body_en
   13. summary
   14. translated_summary
   15. inserted_at
   16. country
   17. language
   18. subhead
   19. first_date
   20. is_original
   21. copies
   22. copy_sources
   23. copy_urls
   24. extracted_entities_en
   25. extracted_entities_en_count
   26. plc_label
   27. plc_label_scores
In [5]:python
# Display column data types
print("🔍 Column Data Types:")
print()
bilby_df.dtypes
🔍 Column Data Types:
Out[5]:
uuid                                        object
branch_id                                    Int64
published_at                   datetime64[us, UTC]
news_line                                   object
newspaper                                   object
author                                      object
article_url                                 object
title                                       object
body                                        object
title_en                                    object
subhead_en                                  object
body_en                                     object
summary                                     object
translated_summary                          object
inserted_at                    datetime64[us, UTC]
country                                     object
language                                    object
subhead                                     object
first_date                     datetime64[us, UTC]
is_original                                  Int64
copies                                       Int64
copy_sources                                object
copy_urls                                   object
extracted_entities_en                       object
extracted_entities_en_count                float64
plc_label                                   object
plc_label_scores                            object
dtype: object
In [6]:python
pd.set_option("display.max_columns", None)
# Display first few rows
print("👀 First 5 Rows of Data:")
print()
bilby_df.head(5)
👀 First 5 Rows of Data:
Out[6]:
uuid  branch_id  \
0  5d3fb794-8631-ade6-8337-238377c0b637ee557291-8...       <NA>   
1  ecd511a5-fbce-8027-6149-75e9837d24716130ac82-a...       <NA>   
2  7b644162-8959-314a-8fad-bc0e03ba36868d6c0888-7...       <NA>   
3  37b28e26-885e-9203-88f6-199ee9911cfe9ab6ca56-1...       <NA>   
4  f61ae2f1-a18e-32d7-bfae-2e1348bfdb84836476b7-3...       <NA>   

               published_at      news_line      newspaper author  \
0 2025-06-03 16:00:00+00:00  official_line  Sichuan Daily          
1 2025-06-03 16:00:00+00:00  official_line  Sichuan Daily          
2 2025-06-03 16:00:00+00:00  official_line  Sichuan Daily          
3 2025-06-03 16:00:00+00:00  official_line  Sichuan Daily          
4 2025-06-03 16:00:00+00:00  official_line  Sichuan Daily          

                                         article_url           title  \
0  https://epaper.scdaily.cn/shtml/scrb/20250604/...  新“铝”程上,广元如何提速?   
1  https://epaper.scdaily.cn/shtml/scrb/20250604/...       游客服务保障更高效   
2  https://epaper.scdaily.cn/shtml/scrb/20250604/...   民宿长成村落,避暑更有滋味   
3  https://epaper.scdaily.cn/shtml/scrb/20250604/...   区域人才,未来可“云共享”   
4  https://epaper.scdaily.cn/shtml/scrb/20250604/...    共计发放奖励260余万元   

                                                body  \
0  □四川日报全媒体记者 张敏\n近日,广元市铝基新材料产业投资推介会暨经济合作项目签约仪式举行...   
1  本报讯\n 5月27日,剑门关景区北门,几名身着红马甲的党员志愿者站在入口位置,看到一位游客...   
2  □四川日报全媒体记者 张敏 文/图\n5月30日,2025四川(曾家山)公路自行车赛暨“骑遍...   
3  本报讯\n 5月29日,记者从广元市利州区委组织部获悉,截至目前,今年利州区11家重点工业、...   
4  本报讯\n 近日,广元市民马文华发现利州区南河街道接官亭社区大一污水处理厂外的马路有塌陷,影...   

                                            title_en  \
0  How can Guangyuan speed up on the new "aluminu...   
1  Tourist services are more efficient and guaran...   
2  Homestays have become villages, and summer vac...   
3  Regional talents, can be "cloud shared" in the...   
4  A total of more than 2.6 million yuan in rewar...   

                                          subhead_en  \
0  A total of 31 aluminum-based new material indu...   
1  Thousands of party members and cadres in Jiang...   
2  Zengjiashan Mountain Resort transforms into a ...   
3  11 key enterprises in Lizhou District recruit ...   
4  More than 100,000 people have joined the "Guan...   

                                             body_en  \
0  Sichuan Daily All-Media Reporter Zhang Min\nRe...   
1  News from this newspaper\nOn May 27th, at the ...   
2  Sichuan Daily All-Media Reporter Zhang Min Tex...   
3  News from this newspaper\nOn May 29, it was le...   
4  News from this newspaper\nRecently, a resident...   

                                             summary  \
0  广元市铝基新材料产业投资推介会签约31个项目,总金额超210亿元。广元铝产业发展迅速,全产业...   
1  剑阁县党员干部志愿者在景区提供服务,缓解游客激增压力,为60万人次游客提供帮助。志愿者接受培...   
2  2025四川(曾家山)公路自行车赛在曾家山举行,展示了曾家山的生态美景。曾家山发展民宿,提升...   
3  广元市利州区与陕西省宁强县、甘肃省文县联合举办招聘会,吸纳125人到企业就业。未来还有100...   
4  广元市民通过“广元安全隐患随手拍”系统成功反映道路塌陷问题,得到1000元举报奖励。该系统运...   

                                  translated_summary  \
0  Guangyuan City signed contracts for 31 project...   
1  Party members, cadres, and volunteers in Jiang...   
2  The 2025 Sichuan (Zengjiashan) road cycling ra...   
3  Lizhou District of Guangyuan City, in conjunct...   
4  Guangyuan citizens successfully reported road ...   

                inserted_at country language                      subhead  \
0 2025-06-03 23:41:31+00:00   China  Chinese  一次签约31个铝基新材料产业相关项目,金额超210亿元   
1 2025-06-03 23:41:31+00:00   China  Chinese             剑阁县上千名党员干部“驻点下沉”   
2 2025-06-03 23:41:31+00:00   China  Chinese              曾家山朝全景避暑康养旅游地转型   
3 2025-06-03 23:41:31+00:00   China  Chinese             利州区11家重点企业跨省“招贤”   
4 2025-06-03 23:41:31+00:00   China  Chinese           10万余人加入“广元安全隐患随手拍”   

                 first_date  is_original  copies copy_sources copy_urls  \
0 0010-01-19 15:54:17+00:00            1       1           []        []   
1 0010-01-19 15:54:17+00:00            1       1           []        []   
2 0010-01-19 15:54:17+00:00            1       1           []        []   
3 0010-01-19 15:54:17+00:00            1       1           []        []   
4 0010-01-19 15:54:17+00:00            1       1           []        []   

                               extracted_entities_en  \
0  [{"extracted_entity_text": "Sichuan Daily All-...   
1  [{"extracted_entity_text": "Jianmen Pass", "ex...   
2  [{"extracted_entity_text": "Sichuan Daily All-...   
3  [{"extracted_entity_text": "Organization Depar...   
4  [{"extracted_entity_text": "Guangyuan City", "...   

   extracted_entities_en_count   plc_label                  plc_label_scores  
0                         51.0  NOT_POLICY     [0.9998, 0.0001, 0.0, 0.0001]  
1                          4.0  NOT_POLICY  [0.9995, 0.0002, 0.0002, 0.0001]  
2                         67.0  NOT_POLICY     [0.9998, 0.0001, 0.0001, 0.0]  
3                         26.0  NOT_POLICY  [0.9928, 0.0008, 0.0061, 0.0003]  
4                         15.0  NOT_POLICY  [0.9996, 0.0002, 0.0001, 0.0001]

3. Temporal Analysis

Understanding the time range in the dataset.

In [7]:python
# Convert inserted_at to datetime if needed
bilby_df["inserted_at"] = pd.to_datetime(bilby_df["inserted_at"])

print("📅 Temporal Coverage:")
print()
print(f"   Earliest article: {bilby_df['inserted_at'].min()}")
print(f"   Latest article:   {bilby_df['inserted_at'].max()}")
print(
    f"   Date range:       {(bilby_df['inserted_at'].max() - bilby_df['inserted_at'].min()).days} days"
)
print()

# Extract date component
bilby_df["date_only"] = bilby_df["inserted_at"].dt.date

print(f"   Unique dates:     {bilby_df['date_only'].nunique()}")
📅 Temporal Coverage:

   Earliest article: 2025-06-03 00:01:05+00:00
   Latest article:   2025-06-04 23:17:53+00:00
   Date range:       1 days

   Unique dates:     2

4. Source Analysis

Analyzing the sources of articles (newspapers and news lines).

In [8]:python
# Newspaper analysis
print("📰 Newspaper Analysis:")
print()

if "newspaper" in bilby_df.columns:
    newspaper_counts = bilby_df["newspaper"].value_counts()

    print(f"   Total unique newspapers: {bilby_df['newspaper'].nunique()}")
    print()
    print("   Top 10 newspapers by article count:")
    for newspaper, count in newspaper_counts.head(10).items():
        percentage = (count / len(bilby_df)) * 100
        print(f"      {newspaper}: {count:,} articles ({percentage:.1f}%)")
else:
    print("   ⚠️  'newspaper' column not found in dataset")
📰 Newspaper Analysis:

   Total unique newspapers: 399

   Top 10 newspapers by article count:
      Securities Daily: 856 articles (8.5%)
      ChinaNationalPharmaceuticalPackagingAssociation: 540 articles (5.4%)
      Hikvision: 349 articles (3.5%)
      Wen Wei Po: 260 articles (2.6%)
      Ta Kung Pao: 230 articles (2.3%)
      People's Daily: 186 articles (1.8%)
      Tianjin daily: 173 articles (1.7%)
      Guizhou Daily: 169 articles (1.7%)
      Xinhua Daily: 159 articles (1.6%)
      Procuratorate daily: 145 articles (1.4%)
In [9]:python
# News line analysis
print("📡 News Line Analysis:")
print()

if "news_line" in bilby_df.columns:
    news_line_counts = bilby_df["news_line"].value_counts()

    print("   Total unique news lines: {bilby_df['news_line'].nunique()}")
    print()
    print("   Top 10 news lines by article count:")
    for news_line, count in news_line_counts.head(10).items():
        percentage = (count / len(bilby_df)) * 100
        print(f"      {news_line}: {count:,} articles ({percentage:.1f}%)")
else:
    print("   ⚠️  'news_line' column not found in dataset")
📡 News Line Analysis:

   Total unique news lines: {bilby_df['news_line'].nunique()}

   Top 10 news lines by article count:
      official_line: 6,990 articles (69.3%)
      IndustryAssociation: 1,682 articles (16.7%)
      private_enterprise: 588 articles (5.8%)
      ministry: 361 articles (3.6%)
      SOE: 240 articles (2.4%)
      private_line: 149 articles (1.5%)
      party: 45 articles (0.4%)
      stockexchange: 25 articles (0.2%)
      bank: 8 articles (0.1%)

5. Working with Extracted Entities

The extracted_entities_en column contains JSON data with detailed information about each extracted entity.

Entity JSON Structure

Each entity in the JSON array has the following fields:

  • extracted_entity_text: The actual text of the entity
  • extracted_entity_type: Type/category (see below for complete list)
  • score: Confidence score (0-1)
  • start: Starting character position in the text
  • end: Ending character position in the text
  • occurrence_count: How many times this entity appears
  • model: The model used for extraction
  • timestamp: When the extraction was performed
  • source_document_uuid: Reference to the source article

Entity Types Extracted

Our model identifies the following entity types:

  • Person: Individual people (e.g., "Xi Jinping", "Elon Musk")
  • Company: Commercial organizations and businesses (e.g., "Apple Inc.", "Alibaba Group")
  • Government Body: Government agencies and departments (e.g., "Ministry of Finance", "SEC")
  • Party Body: Political party organizations (e.g., "Communist Party", "Democratic Party")
  • NGO (Non-Governmental Organization): Non-profit and civil society organizations (e.g., "Red Cross", "WWF")
  • IGO (Intergovernmental Organization): International governmental organizations (e.g., "United Nations", "World Bank")
  • GPE (Geo-Political Entity): Geographic locations with political significance (e.g., "China", "California", "Beijing")
  • Currency Mention: References to monetary amounts (e.g., "$1 billion", "100 million yuan")
  • Event: Named events and occurrences (e.g., "World Cup", "Spring Festival")
  • Initiative: Programs, policies, and strategic initiatives (e.g., "Belt and Road Initiative", "Green New Deal")
  • Miscellaneous Organization: Other organizational entities not covered by the above categories
In [10]:python
# Example: Parse and display entities from one article
print("🔍 Example: Parsing Extracted Entities")
print()

# Find an article with entities
sample_idx = df[df["extracted_entities_en"].notna()].index[0]
sample_entities_str = df.loc[sample_idx, "extracted_entities_en"]

# Parse JSON
entities = json.loads(sample_entities_str)

print(f"   Article index: {sample_idx}")
print(f"   Total entities: {len(entities)}")
print()
print("   First 3 entities:")
print()

for i, entity in enumerate(entities[:3], 1):
    print(f"   Entity {i}:")
    print(f"      Text:  '{entity['extracted_entity_text']}'")
    print(f"      Type:  {entity['extracted_entity_type']}")
    print(f"      Score: {entity['score']:.4f}")
    print(f"      Position: [{entity['start']}:{entity['end']}]")
    print()
🔍 Example: Parsing Extracted Entities

   Article index: 0
   Total entities: 11

   First 3 entities:

   Entity 1:
      Text:  'China'
      Type:  GPE
      Score: 0.9863
      Position: [11:16]

   Entity 2:
      Text:  'Sun Tzu'
      Type:  Person
      Score: 0.9788
      Position: [258:265]

   Entity 3:
      Text:  'China'
      Type:  GPE
      Score: 0.9857
      Position: [1270:1275]
In [11]:python
# Helper function to parse all entities from an article
def parse_entities(entities_json_str):
    """
    Parse entities from JSON string.

    Args:
        entities_json_str: JSON string containing entity data

    Returns:
        List of entity dictionaries, or empty list if parsing fails
    """
    if pd.isna(entities_json_str):
        return []

    try:
        entities = json.loads(entities_json_str)
        return entities if isinstance(entities, list) else []
    except (json.JSONDecodeError, TypeError):
        return []


# Helper function to extract entities by type
def get_entities_by_type(entities_json_str, entity_type):
    """
    Extract entities of a specific type from an article.

    Args:
        entities_json_str: JSON string containing entity data
        entity_type: Type to filter (e.g., 'Person', 'Company', 'GPE')

    Returns:
        List of entities matching the specified type
    """
    entities = parse_entities(entities_json_str)
    return [e for e in entities if e.get("extracted_entity_type") == entity_type]


print("✅ Helper functions defined:")
print("   - parse_entities(entities_json_str)")
print("   - get_entities_by_type(entities_json_str, entity_type)")
✅ Helper functions defined:
   - parse_entities(entities_json_str)
   - get_entities_by_type(entities_json_str, entity_type)
In [12]:python
# Example: Filter dataframe to get only articles mentioning companies
print("📝 Filtering Articles by Entity Type (Company)")
print("=" * 80)
print()


# Create a function to check if an article has company entities
def has_company_entities(entities_str):
    """Check if an article contains any Company entities"""
    companies = get_entities_by_type(entities_str, "Company")
    return len(companies) > 0


# Apply filter (on first 500 for performance)
sample_df = bilby_df.head(500)
company_articles = sample_df[
    sample_df["extracted_entities_en"].apply(has_company_entities)
]

print(f"Total articles analyzed:     {len(sample_df):,}")
print(
    f"Articles with companies:     {len(company_articles):,} ({len(company_articles) / len(sample_df) * 100:.1f}%)"
)
print()

# Show sample articles
print("Sample articles mentioning companies:")
print()
for idx in company_articles.head(5).index:
    article = df.loc[idx]
    companies = get_entities_by_type(article["extracted_entities_en"], "Company")
    company_names = [c["extracted_entity_text"] for c in companies[:3]]

    print(f"Article: {article['title_en'][:70]}...")
    print(f"   Companies: {', '.join(company_names)}")
    if len(companies) > 3:
        print(f"   ... and {len(companies) - 3} more")
    print()
📝 Filtering Articles by Entity Type (Company)
================================================================================

Total articles analyzed:     500
Articles with companies:     92 (18.4%)

Sample articles mentioning companies:

Article: Climb the ladder first to break through the enemy's pass....
   Companies: 

Article: The story of the vase in the depths of time...
   Companies: 

Article: Shizong County strengthens the urban flood control and drainage networ...
   Companies: 

Article: The misty rain is beautiful in a different way, welcoming guests with ...
   Companies: 

Article: Spring City Science π released...
   Companies:
In [ ]:python
# Example: Find all unique companies in the dataset
print("🔍 Finding All Unique Companies in Dataset")
print("=" * 80)
print()

all_companies = set()
articles_with_companies = 0

# Analyze first 500 articles for performance
sample_size = min(500, len(bilby_df))
print(f"Analyzing first {sample_size} articles...")
print()

for idx, entities_str in enumerate(
    bilby_df["extracted_entities_en"].dropna().head(sample_size)
):
    companies = get_entities_by_type(entities_str, "Company")
    if companies:
        articles_with_companies += 1
        all_companies.update([c["extracted_entity_text"] for c in companies])

print("📊 Results:")
print(f"   Articles analyzed:         {sample_size:,}")
print(
    f"   Articles with companies:   {articles_with_companies:,} ({articles_with_companies / sample_size * 100:.1f}%)"
)
print(f"   Unique companies found:    {len(all_companies):,}")
print()

# Show sample companies
print("Sample companies (showing first 20):")
print()
for i, company in enumerate(sorted(all_companies)[:20], 1):
    print(f"   {i:2d}. {company}")

if len(all_companies) > 20:
    print(f"   ... and {len(all_companies) - 20} more")
🔍 Finding All Unique Companies in Dataset
================================================================================

Analyzing first 500 articles...

📊 Results:
   Articles analyzed:         500
   Articles with companies:   140 (28.0%)
   Unique companies found:    476

Sample companies (showing first 20):

    1. "Germanwatch,"
    2. "Innovation Investment Consortium"
    3. 48 Group Club
    4. @Visual China
    5. ANTA Sports
    6. AYDO
    7. AbbVie
    8. Accenture
    9. Agence France-Presse
   10. Al-Dawaa Medical Services Company
   11. Alipay
   12. Alumni Seed Fund
   13. Amazon
   14. Anyi
   15. Asia Pacific Aviation
   16. AstraZeneca
   17. Aurobindo Pharma
   18. BGI Genomics
   19. BYD
   20. Baidu
   ... and 456 more

6. Entity Statistics

Let's analyze entity extraction patterns across the entire dataset.

In [ ]:python
# Overall entity statistics
print("📊 Entity Extraction Statistics:")
print()

if "extracted_entities_en_count" in bilby_df.columns:
    entity_counts = df["extracted_entities_en_count"]

    print(f"   Total entities extracted:    {entity_counts.sum():,}")
    print(f"   Average per article:         {entity_counts.mean():.1f}")
    print(f"   Median per article:          {entity_counts.median():.1f}")
    print(f"   Max entities in one article: {entity_counts.max()}")
    print(f"   Articles with 0 entities:    {(entity_counts == 0).sum():,}")
    print()

    # Distribution
    print("   Entity count distribution:")
    print(f"      0 entities:    {(entity_counts == 0).sum():,} articles")
    print(
        f"      1-10 entities: {((entity_counts >= 1) & (entity_counts <= 10)).sum():,} articles"
    )
    print(
        f"      11-25 entities: {((entity_counts >= 11) & (entity_counts <= 25)).sum():,} articles"
    )
    print(
        f"      26-50 entities: {((entity_counts >= 26) & (entity_counts <= 50)).sum():,} articles"
    )
    print(f"      50+ entities:   {(entity_counts > 50).sum():,} articles")
else:
    print("   ⚠️  'extracted_entities_en_count' column not found")
📊 Entity Extraction Statistics:

   Total entities extracted:    124,991.0
   Average per article:         25.6
   Median per article:          17.0
   Max entities in one article: 375.0
   Articles with 0 entities:    502

   Entity count distribution:
      0 entities:    502 articles
      1-10 entities: 1,136 articles
      11-25 entities: 1,565 articles
      26-50 entities: 1,022 articles
      50+ entities:   663 articles
In [ ]:python
# Analyze entity types across all articles
print("🏷️  Entity Type Distribution:")
print()
print("   Analyzing entity types across all articles...")

all_entity_types = []

# Sample first 1000 articles for performance
sample_size = min(1000, len(bilby_df))
sample_df = bilby_df.head(sample_size)

for entities_str in sample_df["extracted_entities_en"].dropna():
    entities = parse_entities(entities_str)
    all_entity_types.extend(
        [e.get("extracted_entity_type", "Unknown") for e in entities]
    )

type_distribution = Counter(all_entity_types)
total_entities_analyzed = len(all_entity_types)

print(f"   Analyzed {sample_size:,} articles")
print(f"   Total entities found: {total_entities_analyzed:,}")
print()
print("   Top 10 entity types:")
print()

for entity_type, count in type_distribution.most_common(10):
    percentage = (count / total_entities_analyzed) * 100
    print(f"      {entity_type:30s}: {count:6,} ({percentage:5.1f}%)")
🏷️  Entity Type Distribution:

   Analyzing entity types across all articles...
   Analyzed 1,000 articles
   Total entities found: 19,397

   Top 10 entity types:

      GPE                           :  7,839 ( 40.4%)
      Person                        :  3,680 ( 19.0%)
      Government Body               :  2,000 ( 10.3%)
      Event                         :  1,762 (  9.1%)
      Initiative                    :    890 (  4.6%)
      Party Body                    :    853 (  4.4%)
      Company                       :    847 (  4.4%)
      Currency Mention              :    781 (  4.0%)
      Miscellaneous Organization    :    603 (  3.1%)
      IGO                           :     83 (  0.4%)
In [ ]:python
# Confidence score distribution
print("🎯 Confidence Score Analysis:")
print()

all_scores = []

for entities_str in sample_df["extracted_entities_en"].dropna():
    entities = parse_entities(entities_str)
    all_scores.extend([e.get("score", 0) for e in entities])

scores_series = pd.Series(all_scores)

print(f"   Total scores analyzed: {len(all_scores):,}")
print()
print(f"   Average score:    {scores_series.mean():.4f}")
print(f"   Median score:     {scores_series.median():.4f}")
print(f"   Min score:        {scores_series.min():.4f}")
print(f"   Max score:        {scores_series.max():.4f}")
print()
print("   Score distribution:")
print(
    f"      0.90 - 1.00:  {((scores_series >= 0.90) & (scores_series <= 1.00)).sum():,} entities ({((scores_series >= 0.90) & (scores_series <= 1.00)).sum() / len(all_scores) * 100:.1f}%)"
)
print(
    f"      0.80 - 0.90:  {((scores_series >= 0.80) & (scores_series < 0.90)).sum():,} entities ({((scores_series >= 0.80) & (scores_series < 0.90)).sum() / len(all_scores) * 100:.1f}%)"
)
print(
    f"      0.70 - 0.80:  {((scores_series >= 0.70) & (scores_series < 0.80)).sum():,} entities ({((scores_series >= 0.70) & (scores_series < 0.80)).sum() / len(all_scores) * 100:.1f}%)"
)
print(
    f"      0.60 - 0.70:  {((scores_series >= 0.60) & (scores_series < 0.70)).sum():,} entities ({((scores_series >= 0.60) & (scores_series < 0.70)).sum() / len(all_scores) * 100:.1f}%)"
)
print(
    f"      < 0.60:       {(scores_series < 0.60).sum():,} entities ({(scores_series < 0.60).sum() / len(all_scores) * 100:.1f}%)"
)
🎯 Confidence Score Analysis:

   Total scores analyzed: 19,397

   Average score:    0.9399
   Median score:     0.9930
   Min score:        0.2041
   Max score:        0.9991

   Score distribution:
      0.90 - 1.00:  16,265 entities (83.9%)
      0.80 - 0.90:  1,000 entities (5.2%)
      0.70 - 0.80:  692 entities (3.6%)
      0.60 - 0.70:  550 entities (2.8%)
      < 0.60:       890 entities (4.6%)

7. Sample Entity Exploration for One Article

Let's look at a complete example showing how entities relate to the article text.

In [ ]:python
# Find an article with a good number of entities
print("📖 Sample Article with Extracted Entities")
print("=" * 80)

# Find article with 15-30 entities for a good example
if "extracted_entities_en_count" in bilby_df.columns:
    sample_articles = bilby_df[
        (bilby_df["extracted_entities_en_count"] >= 15)
        & (bilby_df["extracted_entities_en_count"] <= 30)
    ]
    if len(sample_articles) > 0:
        sample_idx = sample_articles.index[0]
    else:
        sample_idx = bilby_df[bilby_df["extracted_entities_en"].notna()].index[0]
else:
    sample_idx = bilby_df[bilby_df["extracted_entities_en"].notna()].index[0]

sample_article = bilby_df.loc[sample_idx]

print()
print(f"Article Title: {sample_article['title_en'][:100]}...")
print()
print(f"Published: {sample_article['inserted_at']}")
if "newspaper" in sample_article:
    print(f"Source: {sample_article['newspaper']}")
print()
print("Body (first 500 characters):")
print(sample_article["body_en"][:500] + "...")
print()
📖 Sample Article with Extracted Entities
================================================================================

Article Title: True love crosses mountains and seas, blessing warms the border....

Published: 2025-01-22 23:29:38+00:00
Source: China National Defense News

Body (first 500 characters):
This report is from Zhang Feiran and Xing Dong: In mid-January, the border and coastal defense forces stationed in Xinjiang, Tibet, and other places gradually received New Year's gift packages from Lanzhou City, Gansu Province. Various delicious foods and heartfelt New Year greeting cards made the officers and soldiers of a border defense company feel warm.

"As the Spring Festival approaches, the soldiers are stationed on the snowy plateau and border islands, guarding the country's border and p...
In [ ]:python
# Display entities from the sample article
entities = parse_entities(sample_article["extracted_entities_en"])

print(f"🏷️  Extracted Entities: {len(entities)}")
print("=" * 80)
print()

# Group by type
entities_by_type = {}
for entity in entities:
    entity_type = entity.get("extracted_entity_type", "Unknown")
    if entity_type not in entities_by_type:
        entities_by_type[entity_type] = []
    entities_by_type[entity_type].append(entity)

# Display by type
for entity_type, type_entities in sorted(entities_by_type.items()):
    print(f"\n{entity_type} ({len(type_entities)} entities):")
    print("-" * 60)

    for i, entity in enumerate(type_entities[:5], 1):  # Show max 5 per type
        text = entity["extracted_entity_text"]
        score = entity["score"]
        start = entity["start"]
        end = entity["end"]

        # Verify the entity matches the text at the specified position
        body_text = sample_article["body_en"]
        extracted_text = (
            body_text[start:end]
            if start < len(body_text) and end <= len(body_text)
            else "[OUT OF RANGE]"
        )

        match_status = "✓" if extracted_text == text else "✗"

        print(f"   [{i}] '{text}'")
        print(
            f"       Score: {score:.4f} | Position: [{start}:{end}] | Match: {match_status}"
        )

    if len(type_entities) > 5:
        print(f"   ... and {len(type_entities) - 5} more")
🏷️  Extracted Entities: 17
================================================================================


Event (1 entities):
------------------------------------------------------------
   [1] 'Spring Festival'
       Score: 0.9858 | Position: [369:384] | Match: ✓

GPE (10 entities):
------------------------------------------------------------
   [1] 'Xinjiang'
       Score: 0.9969 | Position: [115:123] | Match: ✓
   [2] 'Tibet'
       Score: 0.9960 | Position: [125:130] | Match: ✓
   [3] 'Lanzhou City'
       Score: 0.9972 | Position: [198:210] | Match: ✓
   [4] 'Gansu Province'
       Score: 0.9972 | Position: [212:226] | Match: ✓
   [5] 'Lanzhou'
       Score: 0.9981 | Position: [957:964] | Match: ✓
   ... and 5 more

Government Body (2 entities):
------------------------------------------------------------
   [1] 'People's Liberation Army'
       Score: 0.8645 | Position: [2111:2135] | Match: ✓
   [2] 'City military-civilian support office'
       Score: 0.6770 | Position: [2673:2710] | Match: ✓

Person (4 entities):
------------------------------------------------------------
   [1] 'Zhang Feiran'
       Score: 0.9986 | Position: [20:32] | Match: ✓
   [2] 'Xing Dong'
       Score: 0.9978 | Position: [37:46] | Match: ✓
   [3] 'Luo Na'
       Score: 0.9983 | Position: [640:646] | Match: ✓
   [4] 'Li Haoran'
       Score: 0.9987 | Position: [1774:1783] | Match: ✓

8. Policy Label Classification (PLC)

The dataset includes policy label classifications that categorize articles based on their content type. Each article has been classified by a machine learning model into one of four categories:

  • NOT_POLICY: Articles that do not discuss policy matters
  • INFORMING: Articles that inform about existing policies, regulations, or government actions
  • PLANNING: Articles discussing future policy plans or proposals
  • IMPLEMENTING: Articles about the execution or implementation of policies

The plc_label column contains the predicted label (the category with the highest confidence), and plc_label_scores contains the confidence scores for all four categories as a list: [NOT_POLICY, INFORMING, PLANNING, IMPLEMENTING].

In [114]:python
# PLC Label Distribution
print("🏷️  Policy Label Classification (PLC) Distribution:")
print()

if "plc_label" in bilby_df.columns:
    plc_counts = bilby_df["plc_label"].value_counts()

    print(f"   Total articles with PLC labels: {bilby_df['plc_label'].notna().sum():,}")
    print()
    print("   Label distribution:")
    for label, count in plc_counts.items():
        percentage = (count / len(bilby_df)) * 100
        print(f"      {label:20s}: {count:6,} ({percentage:5.1f}%)")
else:
    print("   ⚠️  'plc_label' column not found")
🏷️  Policy Label Classification (PLC) Distribution:

   Total articles with PLC labels: 92,504

   Label distribution:
      NOT_POLICY          : 79,663 ( 86.1%)
      INFORMING           :  8,961 (  9.7%)
      DECIDING            :  3,478 (  3.8%)
      IMPLEMENTING        :    402 (  0.4%)
In [115]:python
# Explore articles in the INFORMING category
print("🔍 Exploring INFORMING Category Articles:")
print("=" * 80 + "\n")


# Helper function to parse PLC scores
def parse_plc_scores(scores):
    """Parse PLC scores - handles both JSON strings and numpy arrays"""
    try:
        if isinstance(scores, str):
            scores = json.loads(scores)
        return [float(s) for s in scores]
    except:
        return None


# Get INFORMING articles
plc_filter = "INFORMING"
plc_filtered_articles = bilby_df[bilby_df["plc_label"] == plc_filter]

print(f"Total {plc_filter} articles: {len(plc_filtered_articles):,}\n")

# Show 3 sample articles with their scores
print(f"Sample {plc_filter} articles:\n")

for idx, (_, article) in enumerate(plc_filtered_articles.head(3).iterrows(), 1):
    scores = parse_plc_scores(article["plc_label_scores"])

    print(f"[{idx}] {article['title_en'][:80]}...")
    print(f"    Newspaper: {article['newspaper']}")

    if scores:
        print(f"    Scores: {scores}")
        print(f"    Label confidence: {scores[1]:.3%} ({plc_filter})")
    else:
        print(f"    Scores: N/A")

    print(f"    Summary: {article['translated_summary'][:150]}...")
    print()

print("(Scores represent: [NOT_POLICY, INFORMING, PLANNING, IMPLEMENTING])")
🔍 Exploring INFORMING Category Articles:
================================================================================

Total INFORMING articles: 8,961

Sample INFORMING articles:

[1] The National Special Equipment Safety Supervision Work Symposium will be held in...
    Newspaper: China Quality Daily
    Scores: [0.0008, 0.9988, 0.0002, 0.0002]
    Label confidence: 99.880% (INFORMING)
    Summary: The National Symposium on the Safety Supervision of Special Equipment in 2025 summarized the work of 2024, emphasizing the need to strictly adhere to ...

[2] The State Administration for Market Regulation held a meeting on the constructio...
    Newspaper: China Quality Daily
    Scores: [0.4614, 0.5308, 0.0065, 0.0013]
    Label confidence: 53.080% (INFORMING)
    Summary: The Market Supervision Administration held a meeting on party conduct and clean governance construction, summarizing the work of 2024, analyzing the s...

[3] Shandong: Anchor the target and take on the main beam, leading the way and makin...
    Newspaper: China Quality Daily
    Scores: [0.0003, 0.9992, 0.0003, 0.0002]
    Label confidence: 99.920% (INFORMING)
    Summary: The market supervision work conference in Shandong Province summarized the work of 2024 and arranged the key tasks for 2025. It emphasized the need to...

(Scores represent: [NOT_POLICY, INFORMING, PLANNING, IMPLEMENTING])

Summary

This notebook covered:

✅ Loading and inspecting the dataset structure
✅ Understanding temporal coverage and article distribution
✅ Analyzing article sources (newspapers and news lines)
✅ Working with the JSON entity data structure
✅ Computing entity statistics and distributions
✅ Exploring sample entities in context
✅ Understanding policy label classifications (PLC)

You now have the foundation to perform custom analyses on this entity extraction dataset!


For questions or support, please contact our support team.