File Formats

MarketCheck Data Feeds are delivered in CSV, JSONL, or Parquet formats, with compression and splitting options for large datasets.

Overview

MarketCheck Data Feeds are available in three formats — CSV, JSONL (JSON Lines), and Parquet — each optimized for different use cases. All formats are UTF-8 encoded and may be compressed to reduce transfer size. Choose the format that best fits your processing tools, data complexity, and performance needs.

CSV

Generated via BigQuery and/or Apache Spark using the following defaults:

Delimiter: Comma (,), configurable to tab (\t) or pipe (|)
Quote Character: Double quotes (")
Escape Character: Double quotes (") - uses quote doubling for escapes
Line Ending: LF (\n)
Header Row: Included with column names
Null Values: Empty string (unquoted)
Compression: Gzip (default)

Pros:

Widely supported in tools & spreadsheets
Compact compared to JSONL
Simple import/export workflows

Cons:

Limited support for nested data
Multi-valued fields concatenated with pipe (|) or quoted if pipe is the delimiter

CSV output follows RFC 4180 with the above deviations for null and multi-valued fields.

JSONL

Structured data format where each line is a valid JSON object:

Line Ending: LF (\n)
Null Values: Empty string ("") instead of JSON null
Compression: Gzip (default)

Pros:

Supports nested structures
Easy to process line-by-line for streaming

Cons:

Larger files than CSV
Less efficient for very large datasets

Null handling deviates from standard JSON; downstream applications should treat empty strings as null-equivalent where applicable.

Parquet

Columnar storage format optimized for big data processing:

Encoding: UTF-8
Null Values: Standard Parquet null
Compression: Snappy (default), Gzip optional

Pros:

Highly efficient for analytics
Supports complex types & nested structures

Cons:

Requires compatible big data tools
Large datasets delivered as multiple files; cannot be coalesced into a single file

Choosing a Format

CSV — Best for simplicity, tool compatibility, and moderate to large file sizes
JSONL — Best for complex or nested data, and small to medium file sizes
Parquet — Best for large-scale analytics in Spark, Hive, or similar platforms

File Splitting

Large datasets could be split into smaller files for efficiency:

By size — e.g., 1 GB per file
By row count — e.g., 1 million rows
By time — e.g., monthly or yearly (for historical feeds)
By key — e.g., by state or dealer ID

Split files are named sequentially or by the split key (e.g., us_used_20250831_CA.csv)

For historical data feeds, files must be split; there is no way to avoid this. But for daily feeds, we try to coalesce CSV and JSONL files into a single file when possible, as those could be read-in chunks.

File Formats

Overview

CSV

JSONL

Parquet

Choosing a Format

File Splitting

See Also