Basic Documents — Text Fields

The text fields are the most fundamental fields in the dataset. They contain the raw textual content of documents and serve as the foundation for derived fields such as sentiment analysis, entity extraction, and topic modeling.

Overview

Field Name	Type	Example/Possible Values	Description
`title_english`	text	"China Post Insurance launches new products for military personnel"	The title of the document in English
`title_source_language`	text	"中邮保险与中国融通保险联合推出现役退役军人专属保险产品"	The title of the document in the source language
`body_english`	text	"Recently, China Post Insurance and China Rongtong Insurance jointly..."	The body of the document in English
`body_source_language`	text	"日前，中邮保险与中国融通保险联合推出针对现役和退役军人的专属保险产品..."	The body of the document in the source language
`subhead_english`	text	"New insurance product offers affordable healthcare options"	The subhead of the document in English
`subhead_source_language`	text	"新保险产品提供经济实惠的医疗保健选择"	The subhead of the document in the source language
`summary_english`	text	"China Post Insurance launched a new affordable insurance product..."	The summary of the document in English
`summary_source_language`	text	"中邮保险推出新的经济实惠保险产品..."	The summary of the document in the source language

Working with Text Fields

Common Use Cases

Here are some common use cases for working with text fields:

Cross-Field Analysis - Compare English and source language fields to identify nuances lost in translation
Contextual Search - Use both title and body fields for comprehensive content searches
Efficient Processing - Use summaries for initial screening before analyzing full body text
Language-Specific Analysis - Leverage source language fields when working with language-specific models
Sentiment Analysis - Extract sentiment from body_english or body_source_language
Topic Modeling - Identify key themes across document collections
Entity Extraction - Identify people, organizations, and locations mentioned in documents
Trend Analysis - Track how specific topics evolve over time in the dataset

Field Descriptions

`title_english`

Definition: The title of the document translated into English from its original language.

Format: Plain text string, typically between 5-20 words.

Usage Context: Titles are crucial for document identification and quick content assessment. They often contain key entities, topics, or themes that are central to the document.

Data Quality Notes:

Titles are machine-translated when the original document is not in English
Titles maintain proper nouns and entity names as accurately as possible
Special characters and formatting from the original title are preserved when relevant

`title_source_language`

Definition: The original title of the document in its source language.

Format: Plain text string in the document's original language encoding.

Usage Context: Useful for researchers who want to analyze the original language nuances or verify translation accuracy. Often paired with title_english for comparative analysis.

Supported Languages: Primarily Chinese, but also includes documents in English, and other languages.

Data Quality Notes:

Character encoding is UTF-8 to support all language scripts
Original formatting and punctuation are preserved
May contain language-specific idioms or expressions not fully captured in the English translation

`body_english`

Definition: The main content of the document translated into English.

Format: Plain text string, typically ranging from several paragraphs to multiple pages.

Usage Context: The body contains the detailed information of the document and is the primary text used for content analysis, sentiment extraction, and topic modeling.

Data Processing:

Paragraphs are preserved with appropriate breaks
Bullet points and numbered lists are maintained where possible
Tables and structured data are converted to text format
Footnotes are appended at the end of the body with reference markers
All text is preserved, sometimes resulting in massive amounts of text

Limitations:

Some formatting from the original document may be lost
Embedded images, charts, or multimedia content are not included

`body_source_language`

Definition: The original main content of the document in its source language.

Format: Plain text string in the document's original language encoding.

Usage Context: Valuable for researchers conducting native language analysis or verifying translation accuracy. Essential for linguistic studies and cultural context preservation.

Data Quality Notes:

Maintains original language idioms, expressions, and cultural references
Preserves source-specific formatting where possible, sometimes including special characters only used by the source language
UTF-8 encoded to support all character sets

Relationship to Other Fields:

Paired with body_english for translation verification
Used as input for language-specific sentiment analysis
Serves as source material for entity extraction in original language

`subhead_english`

Definition: Secondary headlines or subtitles of the document translated into English.

Format: Plain text string, typically shorter than the main title.

Usage Context: Subheads often provide additional context or highlight specific aspects of the document. They can contain important thematic elements not present in the main title.

Data Availability:

Not all documents contain subheads
When absent, this field will be null

`subhead_source_language`

Definition: Original secondary headlines or subtitles in the document's source language.

Format: Plain text string in the document's original language encoding.

Usage Context: Provides the authentic subheading content for language specialists and researchers interested in original phrasing and emphasis.

Data Quality Notes:

Maintains original formatting and punctuation
May contain language-specific terms not fully captured in translation
UTF-8 encoded to support all character sets

`summary_english`

Definition: A concise overview of the document's key points translated into English.

Format: Plain text string, typically 1-3 paragraphs.

Creation Method:

Machine-generated with LLMs
Prioritizes key facts, entities, and central themes
Maintains factual accuracy while condensing content

Limitations:

For extremely long documents, summaries may not capture all nuances

Usage Context: Summaries provide a quick understanding of document content without reading the full text. They're particularly valuable for initial document screening and rapid information extraction.

Best Practices:

Use summaries for initial content assessment
Refer to full body text for detailed analysis
Consider summaries as complementary to, not replacements for, the complete document

`summary_source_language`

Definition: A concise overview of the document's key points in its original language.

Format: Plain text string in the document's original language encoding.

Usage Context: Provides researchers with a condensed version of the original content while preserving native language nuances and terminology.

Data Quality Notes:

Machine-generated with LLMs
Attempts to maintain key terminology and phrasing from the source
Preserves cultural context and language-specific expressions
UTF-8 encoded to support all character sets