Basic Documents — Text Fields
The text fields are the most fundamental fields in the dataset. They contain the raw textual content of documents and serve as the foundation for derived fields such as sentiment analysis, entity extraction, and topic modeling.
Overview
Field Name | Type | Example/Possible Values | Description |
---|---|---|---|
title_english | text | "China Post Insurance launches new products for military personnel" | The title of the document in English |
title_source_language | text | "中邮保险与中国融通保险联合推出现役退役军人专属保险产品" | The title of the document in the source language |
body_english | text | "Recently, China Post Insurance and China Rongtong Insurance jointly..." | The body of the document in English |
body_source_language | text | "日前,中邮保险与中国融通保险联合推出针对现役和退役军人的专属保险产品..." | The body of the document in the source language |
subhead_english | text | "New insurance product offers affordable healthcare options" | The subhead of the document in English |
subhead_source_language | text | "新保险产品提供经济实惠的医疗保健选择" | The subhead of the document in the source language |
summary_english | text | "China Post Insurance launched a new affordable insurance product..." | The summary of the document in English |
summary_source_language | text | "中邮保险推出新的经济实惠保险产品..." | The summary of the document in the source language |
Working with Text Fields
Common Use Cases
Here are some common use cases for working with text fields:
- Cross-Field Analysis - Compare English and source language fields to identify nuances lost in translation
- Contextual Search - Use both title and body fields for comprehensive content searches
- Efficient Processing - Use summaries for initial screening before analyzing full body text
- Language-Specific Analysis - Leverage source language fields when working with language-specific models
- Sentiment Analysis - Extract sentiment from
body_english
orbody_source_language
- Topic Modeling - Identify key themes across document collections
- Entity Extraction - Identify people, organizations, and locations mentioned in documents
- Trend Analysis - Track how specific topics evolve over time in the dataset
Field Descriptions
title_english
Definition: The title of the document translated into English from its original language.
Format: Plain text string, typically between 5-20 words.
Usage Context: Titles are crucial for document identification and quick content assessment. They often contain key entities, topics, or themes that are central to the document.
Data Quality Notes:
- Titles are machine-translated when the original document is not in English
- Titles maintain proper nouns and entity names as accurately as possible
- Special characters and formatting from the original title are preserved when relevant
title_source_language
Definition: The original title of the document in its source language.
Format: Plain text string in the document's original language encoding.
Usage Context: Useful for researchers who want to analyze the original
language nuances or verify translation accuracy. Often paired with
title_english
for comparative analysis.
Supported Languages: Primarily Chinese, but also includes documents in English, and other languages.
Data Quality Notes:
- Character encoding is UTF-8 to support all language scripts
- Original formatting and punctuation are preserved
- May contain language-specific idioms or expressions not fully captured in the English translation
body_english
Definition: The main content of the document translated into English.
Format: Plain text string, typically ranging from several paragraphs to multiple pages.
Usage Context: The body contains the detailed information of the document and is the primary text used for content analysis, sentiment extraction, and topic modeling.
Data Processing:
- Paragraphs are preserved with appropriate breaks
- Bullet points and numbered lists are maintained where possible
- Tables and structured data are converted to text format
- Footnotes are appended at the end of the body with reference markers
- All text is preserved, sometimes resulting in massive amounts of text
Limitations:
- Some formatting from the original document may be lost
- Embedded images, charts, or multimedia content are not included
body_source_language
Definition: The original main content of the document in its source language.
Format: Plain text string in the document's original language encoding.
Usage Context: Valuable for researchers conducting native language analysis or verifying translation accuracy. Essential for linguistic studies and cultural context preservation.
Data Quality Notes:
- Maintains original language idioms, expressions, and cultural references
- Preserves source-specific formatting where possible, sometimes including special characters only used by the source language
- UTF-8 encoded to support all character sets
Relationship to Other Fields:
- Paired with
body_english
for translation verification - Used as input for language-specific sentiment analysis
- Serves as source material for entity extraction in original language
subhead_english
Definition: Secondary headlines or subtitles of the document translated into English.
Format: Plain text string, typically shorter than the main title.
Usage Context: Subheads often provide additional context or highlight specific aspects of the document. They can contain important thematic elements not present in the main title.
Data Availability:
- Not all documents contain subheads
- When absent, this field will be null
subhead_source_language
Definition: Original secondary headlines or subtitles in the document's source language.
Format: Plain text string in the document's original language encoding.
Usage Context: Provides the authentic subheading content for language specialists and researchers interested in original phrasing and emphasis.
Data Quality Notes:
- Maintains original formatting and punctuation
- May contain language-specific terms not fully captured in translation
- UTF-8 encoded to support all character sets
summary_english
Definition: A concise overview of the document's key points translated into English.
Format: Plain text string, typically 1-3 paragraphs.
Creation Method:
- Machine-generated with LLMs
- Prioritizes key facts, entities, and central themes
- Maintains factual accuracy while condensing content
Limitations:
- For extremely long documents, summaries may not capture all nuances
Usage Context: Summaries provide a quick understanding of document content without reading the full text. They're particularly valuable for initial document screening and rapid information extraction.
Best Practices:
- Use summaries for initial content assessment
- Refer to full body text for detailed analysis
- Consider summaries as complementary to, not replacements for, the complete document
summary_source_language
Definition: A concise overview of the document's key points in its original language.
Format: Plain text string in the document's original language encoding.
Usage Context: Provides researchers with a condensed version of the original content while preserving native language nuances and terminology.
Data Quality Notes:
- Machine-generated with LLMs
- Attempts to maintain key terminology and phrasing from the source
- Preserves cultural context and language-specific expressions
- UTF-8 encoded to support all character sets