File Formats

MarketCheck Data Feeds are delivered in CSV, JSONL, or Parquet formats, with compression and splitting options for large datasets.

Overview

MarketCheck Data Feeds are available in three formats — CSV, JSONL (JSON Lines), and Parquet — each optimized for different use cases. All formats are UTF-8 encoded and may be compressed to reduce transfer size. Choose the format that best fits your processing tools, data complexity, and performance needs.

CSV

Generated via BigQuery and/or Apache Spark using the following defaults:

  • Delimiter: Comma (,), configurable to tab (\t) or pipe (|)
  • Quote Character: Double quotes (")
  • Escape Character: Double quotes (") - uses quote doubling for escapes
  • Line Ending: LF (\n)
  • Header Row: Included with column names
  • Null Values: Empty string (unquoted)
  • Compression: Gzip (default)

Pros:

  • Widely supported in tools & spreadsheets
  • Compact compared to JSONL
  • Simple import/export workflows

Cons:

  • Limited support for nested data
  • Multi-valued fields concatenated with pipe (|) or quoted if pipe is the delimiter
CSV output follows RFC 4180 with the above deviations for null and multi-valued fields.

JSONL

Structured data format where each line is a valid JSON object:

  • Line Ending: LF (\n)
  • Null Values: Empty string ("") instead of JSON null
  • Compression: Gzip (default)

Pros:

  • Supports nested structures
  • Easy to process line-by-line for streaming

Cons:

  • Larger files than CSV
  • Less efficient for very large datasets
Null handling deviates from standard JSON; downstream applications should treat empty strings as null-equivalent where applicable.

Parquet

Columnar storage format optimized for big data processing:

  • Encoding: UTF-8
  • Null Values: Standard Parquet null
  • Compression: Snappy (default), Gzip optional

Pros:

  • Highly efficient for analytics
  • Supports complex types & nested structures

Cons:

  • Requires compatible big data tools
  • Large datasets delivered as multiple files; cannot be coalesced into a single file

Choosing a Format

  • CSV — Best for simplicity, tool compatibility, and moderate to large file sizes
  • JSONL — Best for complex or nested data, and small to medium file sizes
  • Parquet — Best for large-scale analytics in Spark, Hive, or similar platforms

File Splitting

Large datasets could be split into smaller files for efficiency:

  • By size — e.g., 1 GB per file
  • By row count — e.g., 1 million rows
  • By time — e.g., monthly or yearly (for historical feeds)
  • By key — e.g., by state or dealer ID

Split files are named sequentially or by the split key (e.g., us_used_20250831_CA.csv)

For historical data feeds, files must be split; there is no way to avoid this. But for daily feeds, we try to coalesce CSV and JSONL files into a single file when possible, as those could be read-in chunks.

See Also