MarketCheck Data Feeds are available in three formats — CSV, JSONL (JSON Lines), and Parquet — each optimized for different use cases. All formats are UTF-8 encoded and may be compressed to reduce transfer size. Choose the format that best fits your processing tools, data complexity, and performance needs.
Generated via BigQuery and/or Apache Spark using the following defaults:
- Delimiter: Comma (
,
), configurable to tab (\t
) or pipe (|
) - Quote Character: Double quotes (
"
) - Escape Character: Double quotes (
"
) - uses quote doubling for escapes - Line Ending: LF (
\n
) - Header Row: Included with column names
- Null Values: Empty string (unquoted)
- Compression: Gzip (default)
Pros:
- Widely supported in tools & spreadsheets
- Compact compared to JSONL
- Simple import/export workflows
Cons:
- Limited support for nested data
- Multi-valued fields concatenated with pipe (
|
) or quoted if pipe is the delimiter
CSV output follows RFC 4180 with the above deviations for null and multi-valued fields.
Structured data format where each line is a valid JSON object:
- Line Ending: LF (
\n
) - Null Values: Empty string (
""
) instead of JSON null
- Compression: Gzip (default)
Pros:
- Supports nested structures
- Easy to process line-by-line for streaming
Cons:
- Larger files than CSV
- Less efficient for very large datasets
Null handling deviates from standard JSON; downstream applications should treat empty strings as null-equivalent where applicable.
Columnar storage format optimized for big data processing:
- Encoding: UTF-8
- Null Values: Standard Parquet
null
- Compression: Snappy (default), Gzip optional
Pros:
- Highly efficient for analytics
- Supports complex types & nested structures
Cons:
- Requires compatible big data tools
- Large datasets delivered as multiple files; cannot be coalesced into a single file
- CSV — Best for simplicity, tool compatibility, and moderate to large file sizes
- JSONL — Best for complex or nested data, and small to medium file sizes
- Parquet — Best for large-scale analytics in Spark, Hive, or similar platforms
Large datasets could be split into smaller files for efficiency:
- By size — e.g., 1 GB per file
- By row count — e.g., 1 million rows
- By time — e.g., monthly or yearly (for historical feeds)
- By key — e.g., by state or dealer ID
Split files are named sequentially or by the split key (e.g., us_used_20250831_CA.csv
)
For historical data feeds, files must be split; there is no way to avoid this. But for daily feeds, we try to coalesce CSV and JSONL files into a single file when possible, as those could be read-in chunks.