CSV looks like the simplest file format ever invented: values separated by commas, one record per line. That impression is exactly why it breaks so often in production. The format has no single standard, and every spreadsheet, database export, and library makes slightly different assumptions.

This guide covers what actually matters when you read or write CSV: how delimiters and quoting interact, how embedded commas and newlines are handled, why encoding bites you, the specific things Excel does, and how to convert CSV to JSON without corrupting your data.

There is no single CSV standard

The closest thing to a spec is RFC 4180, published in 2005 as an informational document. It describes the common case but explicitly notes that implementations vary. Real-world CSV predates the RFC by decades, so plenty of files in the wild ignore it.

The practical consequence: you cannot assume anything about a CSV file until you have inspected it. Treat every incoming file as a dialect you need to detect, not a known quantity.

Delimiters: the C in CSV lies

The comma is conventional, not required. You will frequently encounter:

  • Semicolons (
    text
    ;
    ) — standard in locales where the comma is the decimal separator (most of continental Europe). Excel in a German or French locale writes semicolons by default.
  • Tabs (
    text
    \t
    ) — TSV files, common in bioinformatics and database dumps. Tabs rarely appear inside real data, which makes them safer than commas.
  • Pipes (
    text
    |
    ) — popular in finance and legacy ETL pipelines for the same reason.

Because the delimiter varies, sniffing it is a real step. A simple heuristic: count candidate delimiters in the first few lines and pick the one that yields a consistent column count across rows.

python
import csv with open("data.csv", newline="") as f: sample = f.read(4096) dialect = csv.Sniffer().sniff(sample, delimiters=",;\t|") f.seek(0) reader = csv.reader(f, dialect) for row in reader: print(row)

The sniffer is a heuristic and will occasionally guess wrong, especially on files with very few rows or sparse data. When you control both ends of the pipe, fix the delimiter explicitly instead of relying on detection.

Quoting: where most bugs live

The entire reason CSV needs quoting is that a value might contain the delimiter itself. The convention from RFC 4180:

  • Fields may be wrapped in double quotes.
  • Fields that contain the delimiter, a double quote, or a line break must be wrapped in double quotes.
  • A literal double quote inside a quoted field is escaped by doubling it (
    text
    ""
    ), not with a backslash.

So a product name like

text
12" Monitor, Black
becomes:

text
"12"" Monitor, Black"

That one line trips up naive parsers constantly. If you split on commas with a regex or

text
line.split(",")
, you will break this field into pieces. Do not parse CSV by splitting on the delimiter. Use a real parser that understands quoting state. Every mainstream language ships one — Python's
text
csv
, JavaScript needs a library or careful hand-rolled state machine, Go's
text
encoding/csv
, and so on.

Quoting style also varies on the writing side. Some writers quote every field (

text
QUOTE_ALL
), some only quote when necessary (
text
QUOTE_MINIMAL
), and some quote all non-numeric fields. None of these is wrong, but consumers must handle quoted and unquoted fields interchangeably.

If you just need to clean up or normalize a messy file's quoting and spacing before handing it to another system, a tool like the CSV formatter can re-emit it with consistent quoting so downstream parsers stop choking.

Embedded commas and newlines

Two gotchas deserve their own callout because they cause silent data corruption rather than loud errors.

Embedded commas

Address and description fields are the usual offenders:

text
id,name,address 1,Acme,"123 Main St, Suite 400"

A correct parser reads three fields. A comma-splitter reads four and shifts every column after it. The damage is silent — no exception, just wrong data in your database.

Embedded newlines

This is the one that surprises people most: a single CSV record can span multiple physical lines. A quoted field is allowed to contain

text
\n
:

text
id,note 1,"Line one Line two"

That is two rows of data (header plus one record), not three. Any code that processes a CSV file line by line —

text
for line in file
,
text
readLines()
,
text
wc -l
to count records — is wrong the moment a field contains a newline. You must let the parser consume the file and emit logical records, never assume one line equals one record.

This also means counting records with line-based tools gives you the wrong number. Count rows by iterating the parser, not the lines.

Encoding: the invisible failure

CSV files carry no encoding declaration. The bytes are just bytes, and the reader has to guess. The common failure modes:

  • UTF-8 vs. legacy code pages. A file saved as Windows-1252 read as UTF-8 turns
    text
    é
    ,
    text
    , and smart quotes into mojibake or raises decode errors.
  • The UTF-8 BOM. Excel on Windows often prepends a byte-order mark (
    text
    EF BB BF
    ). If you read with plain UTF-8, the BOM shows up as
    text
    \ufeff
    glued to your first header name, so a lookup for the column
    text
    id
    fails because the actual key is
    text
    \ufeffid
    . In Python, open with
    text
    encoding="utf-8-sig"
    to strip it transparently.
  • Line endings. Files move between Windows (
    text
    \r\n
    ), Unix (
    text
    \n
    ), and old Mac (
    text
    \r
    ). Always open files in a mode that handles this — in Python that means passing
    text
    newline=""
    to the file and letting the
    text
    csv
    module deal with line endings, rather than translating them yourself.

When you can, standardize on UTF-8 without a BOM for files you produce, and document it. When you consume files, detect encoding defensively and fail loudly on decode errors rather than silently substituting characters.

Excel quirks worth knowing

Excel is the most common producer and consumer of CSV, and it has opinions.

  • Locale-driven delimiters. As noted, Excel uses the list separator from the OS regional settings — comma in the US, semicolon in much of Europe. A file that opens cleanly on your machine may split into one column on a colleague's.
  • Leading-zero destruction. Open a CSV with a ZIP code
    text
    01234
    or a product code
    text
    007
    in Excel and it helpfully converts them to numbers, dropping the leading zeros. The data in the file was fine; Excel mangled it on display and will save the mangled version.
  • Scientific notation. Long numeric strings like order IDs or phone numbers (
    text
    1234567890123
    ) can render as
    text
    1.23457E+12
    .
  • Date coercion. Strings that look like dates get reformatted to the local date format. The infamous case is gene names being turned into dates, which led researchers to rename genes rather than fight the tool.

The takeaway for developers: Excel is a lossy CSV round-trip. If a file passed through Excel, treat numeric-looking identifiers and dates as suspect. When you need values preserved exactly, deliver data in a format that carries types — which is a good segue to JSON.

Converting CSV to JSON cleanly

JSON is the usual destination for CSV data in an API or app, and the conversion is where subtle bugs creep in. A clean conversion has to make several decisions explicitly.

1. Map rows to objects, not arrays. The first row is almost always a header. Turn each subsequent row into an object keyed by the header names:

python
import csv, json with open("data.csv", newline="", encoding="utf-8-sig") as f: rows = list(csv.DictReader(f)) print(json.dumps(rows, ensure_ascii=False, indent=2))

2. Decide how to handle types. Everything in CSV is a string. JSON distinguishes strings, numbers, booleans, and null. Blindly casting every numeric-looking field to a number is how you destroy ZIP codes and IDs — the same mistake Excel makes. Cast deliberately, per column, based on what the field actually means. When in doubt, keep it a string; a consumer can always parse a known-numeric string, but cannot recover a lost leading zero.

3. Handle empty fields. Is an empty cell an empty string

text
""
, a
text
null
, or a missing key? Pick one rule and apply it consistently. Mixing them breaks downstream code that checks for one but not the other.

4. Watch for duplicate and blank headers. Real exports ship columns named

text
""
or two columns both called
text
name
. A dict-based reader silently drops one of the duplicates. Detect this and rename or reject rather than losing a column quietly.

5. Preserve the embedded structure. Because a proper parser already handled quoting and embedded newlines, your JSON strings will correctly contain those commas and line breaks. This is the payoff for not splitting on commas — the structural data survives the trip.

If you just need the conversion done without writing a script, the JSON to CSV converter handles the round trip in the browser, and the CSV formatter is useful for normalizing the source file first so the parse is clean.

A short checklist

Before you trust a CSV file, confirm:

  • Delimiter — comma, semicolon, tab, or pipe? Detect or pin it.
  • Quoting — does the data contain the delimiter, quotes, or newlines? Use a real parser, never a split.
  • Encoding — UTF-8? Is there a BOM? Open accordingly.
  • Records vs. lines — count rows with the parser, not by counting
    text
    \n
    .
  • Types — cast per column on purpose; keep identifiers as strings.
  • Excel exposure — if it passed through a spreadsheet, recheck IDs and dates.

CSV is not hard once you stop treating it as plain text and start treating it as a quoted, delimited format with no fixed dialect. The bugs almost all come from the shortcut of splitting on a character. Skip the shortcut, use a parser that respects quoting, and most of the gotchas disappear.