Basic Documents — Text Fields

The text fields are the most fundamental fields in the dataset. They contain the raw textual content of documents and serve as the foundation for derived fields such as sentiment analysis, entity extraction, and topic modeling.

Overview

Field NameTypeExample/Possible ValuesDescription
title_englishtext"China Post Insurance launches new products for military personnel"The title of the document in English
title_source_languagetext"中邮保险与中国融通保险联合推出现役退役军人专属保险产品"The title of the document in the source language
body_englishtext"Recently, China Post Insurance and China Rongtong Insurance jointly..."The body of the document in English
body_source_languagetext"日前,中邮保险与中国融通保险联合推出针对现役和退役军人的专属保险产品..."The body of the document in the source language
subhead_englishtext"New insurance product offers affordable healthcare options"The subhead of the document in English
subhead_source_languagetext"新保险产品提供经济实惠的医疗保健选择"The subhead of the document in the source language
summary_englishtext"China Post Insurance launched a new affordable insurance product..."The summary of the document in English
summary_source_languagetext"中邮保险推出新的经济实惠保险产品..."The summary of the document in the source language

Working with Text Fields

Common Use Cases

Here are some common use cases for working with text fields:

  1. Cross-Field Analysis - Compare English and source language fields to identify nuances lost in translation
  2. Contextual Search - Use both title and body fields for comprehensive content searches
  3. Efficient Processing - Use summaries for initial screening before analyzing full body text
  4. Language-Specific Analysis - Leverage source language fields when working with language-specific models
  5. Sentiment Analysis - Extract sentiment from body_english or body_source_language
  6. Topic Modeling - Identify key themes across document collections
  7. Entity Extraction - Identify people, organizations, and locations mentioned in documents
  8. Trend Analysis - Track how specific topics evolve over time in the dataset

Field Descriptions

title_english

Definition: The title of the document translated into English from its original language.

Format: Plain text string, typically between 5-20 words.

Usage Context: Titles are crucial for document identification and quick content assessment. They often contain key entities, topics, or themes that are central to the document.

Data Quality Notes:

  • Titles are machine-translated when the original document is not in English
  • Titles maintain proper nouns and entity names as accurately as possible
  • Special characters and formatting from the original title are preserved when relevant

title_source_language

Definition: The original title of the document in its source language.

Format: Plain text string in the document's original language encoding.

Usage Context: Useful for researchers who want to analyze the original language nuances or verify translation accuracy. Often paired with title_english for comparative analysis.

Supported Languages: Primarily Chinese, but also includes documents in English, and other languages.

Data Quality Notes:

  • Character encoding is UTF-8 to support all language scripts
  • Original formatting and punctuation are preserved
  • May contain language-specific idioms or expressions not fully captured in the English translation

body_english

Definition: The main content of the document translated into English.

Format: Plain text string, typically ranging from several paragraphs to multiple pages.

Usage Context: The body contains the detailed information of the document and is the primary text used for content analysis, sentiment extraction, and topic modeling.

Data Processing:

  • Paragraphs are preserved with appropriate breaks
  • Bullet points and numbered lists are maintained where possible
  • Tables and structured data are converted to text format
  • Footnotes are appended at the end of the body with reference markers
  • All text is preserved, sometimes resulting in massive amounts of text

Limitations:

  • Some formatting from the original document may be lost
  • Embedded images, charts, or multimedia content are not included

body_source_language

Definition: The original main content of the document in its source language.

Format: Plain text string in the document's original language encoding.

Usage Context: Valuable for researchers conducting native language analysis or verifying translation accuracy. Essential for linguistic studies and cultural context preservation.

Data Quality Notes:

  • Maintains original language idioms, expressions, and cultural references
  • Preserves source-specific formatting where possible, sometimes including special characters only used by the source language
  • UTF-8 encoded to support all character sets

Relationship to Other Fields:

  • Paired with body_english for translation verification
  • Used as input for language-specific sentiment analysis
  • Serves as source material for entity extraction in original language

subhead_english

Definition: Secondary headlines or subtitles of the document translated into English.

Format: Plain text string, typically shorter than the main title.

Usage Context: Subheads often provide additional context or highlight specific aspects of the document. They can contain important thematic elements not present in the main title.

Data Availability:

  • Not all documents contain subheads
  • When absent, this field will be null

subhead_source_language

Definition: Original secondary headlines or subtitles in the document's source language.

Format: Plain text string in the document's original language encoding.

Usage Context: Provides the authentic subheading content for language specialists and researchers interested in original phrasing and emphasis.

Data Quality Notes:

  • Maintains original formatting and punctuation
  • May contain language-specific terms not fully captured in translation
  • UTF-8 encoded to support all character sets

summary_english

Definition: A concise overview of the document's key points translated into English.

Format: Plain text string, typically 1-3 paragraphs.

Creation Method:

  • Machine-generated with LLMs
  • Prioritizes key facts, entities, and central themes
  • Maintains factual accuracy while condensing content

Limitations:

  • For extremely long documents, summaries may not capture all nuances

Usage Context: Summaries provide a quick understanding of document content without reading the full text. They're particularly valuable for initial document screening and rapid information extraction.

Best Practices:

  • Use summaries for initial content assessment
  • Refer to full body text for detailed analysis
  • Consider summaries as complementary to, not replacements for, the complete document

summary_source_language

Definition: A concise overview of the document's key points in its original language.

Format: Plain text string in the document's original language encoding.

Usage Context: Provides researchers with a condensed version of the original content while preserving native language nuances and terminology.

Data Quality Notes:

  • Machine-generated with LLMs
  • Attempts to maintain key terminology and phrasing from the source
  • Preserves cultural context and language-specific expressions
  • UTF-8 encoded to support all character sets