The Day I Chose VARCHAR(255) and Regretted It: A PostgreSQL String Story

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • MyrinNew
    Senior Member
    • Feb 2024
    • 5175

    #1

    The Day I Chose VARCHAR(255) and Regretted It: A PostgreSQL String Story

    Meta Description: Discover why your PostgreSQL string type choices matter through a wardrobe analogy. Learn CHAR, VARCHAR, and TEXT differences with real performance data and migration horror stories.


    Cover Image Concept: Split image showing a messy closet overflowing with clothes labeled "VARCHAR(255)" on one side, and a perfectly organized modular wardrobe system labeled "TEXT/VARCHAR/CHAR" on the other.





    The $14,000 Migration

    "Just use VARCHAR(255) for everything. It's a safe default."


    That advice seemed reasonable when I started building our user profile system. Names? VARCHAR(255). Bios? VARCHAR(255). Even zip codes—you guessed it—VARCHAR(255).


    Fast forward 18 months: 12 million user records, and I needed to let users write longer bios. The migration took 47 hours of downtime and cost us $14,000 in lost revenue.


    All because I didn't understand how PostgreSQL actually handles strings.


    The brutal truth? In PostgreSQL, VARCHAR(255) isn't the universal answer. In fact, it's often the wrong answer. Let me show you why, using a metaphor that finally made it click for me: your wardrobe.





    Your Database is a Wardrobe (Stay With Me)

    Think about how you organize clothes:

    1. Fixed-size drawers (dress shirts, folded t-shirts) = CHAR(n)
    2. Expandable shelves (sweaters that vary in bulk) = VARCHAR(n)
    3. The closet rod (coats of wildly different lengths) = TEXT or VARCHAR (no limit)


    Each storage method has trade-offs. Use a drawer for a winter coat? Waste of space. Hang a tie on the closet rod? Works, but inefficient.


    PostgreSQL string types work the same way. Let me prove it.





    The Three Closets: CHAR, VARCHAR, and TEXT

    CHAR(n): The Rigid Drawer





    CREATE TABLE user_codes (
    user_id BIGSERIAL PRIMARY KEY,
    country_code CHAR(2), -- Always exactly 2 characters
    postal_code CHAR(5), -- US zip codes, always 5 digits
    access_level CHAR(1) -- 'A', 'B', 'C' only
    );

    -- What happens under the hood?
    INSERT INTO user_codes (country_code, postal_code, access_level)
    VALUES ('US', '90210', 'A');

    -- PostgreSQL pads with spaces!
    SELECT
    country_code,
    LENGTH(country_code) as len,
    '|' || country_code || '|' as visual
    FROM user_codes;

    /*
    country_code | len | visual
    --------------+-----+--------
    US | 2 | |US|
    */

    -- But watch this gotcha:
    INSERT INTO user_codes (country_code) VALUES ('U');
    -- Stored as 'U ' (with trailing space!)







    When to use CHAR:
    • Fixed-length data (country codes, US state abbreviations, gender codes)
    • Small, known sizes (MD5 hashes are always 32 chars)
    • Performance-critical lookups where every byte counts


    The wardrobe rule: Use drawers only when every item is the exact same size.


    VARCHAR(n): The Adjustable Shelf





    CREATE TABLE users (
    id BIGSERIAL PRIMARY KEY,
    username VARCHAR(30), -- Twitter-style limit
    email VARCHAR(254), -- Max valid email length (RFC 5321)
    display_name VARCHAR(100)
    );

    -- No padding, only stores actual characters
    INSERT INTO users (username, email, display_name)
    VALUES ('alice', 'alice@example.com', 'Alice Anderson');

    SELECT
    username,
    LENGTH(username) as len,
    pg_column_size(username) as bytes
    FROM users;

    /*
    username | len | bytes
    ----------+-----+-------
    alice | 5 | 10 (5 chars + 5 bytes overhead)
    */

    -- The dangerous assumption:
    ALTER TABLE users ALTER COLUMN username TYPE VARCHAR(50);
    -- On 10M rows: This can take HOURS and lock your table!







    Here's the plot twist that blew my mind:






    -- Test: Does VARCHAR length limit affect performance?
    CREATE TABLE test_varchar (
    short_limit VARCHAR(10),
    medium_limit VARCHAR(255),
    long_limit VARCHAR(10000)
    );

    -- Insert same data into all columns
    INSERT INTO test_varchar
    SELECT 'hello', 'hello', 'hello'
    FROM generate_series(1, 1000000);

    -- Check storage size
    SELECT
    pg_column_size(short_limit) as short_bytes,
    pg_column_size(medium_limit) as medium_bytes,
    pg_column_size(long_limit) as long_bytes
    FROM test_varchar LIMIT 1;

    /*
    short_bytes | medium_bytes | long_bytes
    -------------+--------------+------------
    10 | 10 | 10
    */







    MIND. BLOWN.


    VARCHAR(10) and VARCHAR(10000) take the same storage space for the same data. The limit is just a constraint, not a storage directive!


    TEXT: The Unlimited Closet Rod





    CREATE TABLE posts (
    id BIGSERIAL PRIMARY KEY,
    content TEXT, -- No arbitrary limit
    author_bio TEXT,
    comments TEXT[] -- Array of text (PostgreSQL magic!)
    );

    -- TEXT and VARCHAR (no limit) are identical in PostgreSQL
    -- Seriously. Check the source code. Same type internally.

    -- This is valid:
    INSERT INTO posts (content)
    VALUES (repeat('a', 1000000)); -- 1 million characters!

    -- Storage is the same as VARCHAR
    SELECT pg_column_size(content) FROM posts;







    The shocking truth: In PostgreSQL, TEXT and VARCHAR (without length) are literally the same type. The only difference is that VARCHAR lets you add a length constraint.






    -- These are functionally identical:
    name TEXT
    name VARCHAR

    -- This adds a constraint:
    name VARCHAR(50)

    -- Which is equivalent to:
    name TEXT CHECK (LENGTH(name) <= 50)










    The Performance Mythology: Debunked

    Myth #1: "VARCHAR(255) is faster than TEXT"






    -- Performance test
    CREATE TABLE test_performance (
    id SERIAL,
    text_col TEXT,
    varchar_col VARCHAR(255),
    varchar_unlimited VARCHAR
    );

    -- Insert 1 million rows with random string lengths
    INSERT INTO test_performance
    SELECT
    i,
    repeat('x', (random() * 200)::int),
    repeat('x', (random() * 200)::int),
    repeat('x', (random() * 200)::int)
    FROM generate_series(1, 1000000) i;

    -- Create identical indexes
    CREATE INDEX idx_text ON test_performance(text_col);
    CREATE INDEX idx_varchar ON test_performance(varchar_col);
    CREATE INDEX idx_varchar_unl ON test_performance(varchar_unlimited);

    -- Query performance test
    EXPLAIN ANALYZE
    SELECT * FROM test_performance WHERE text_col = 'test';

    EXPLAIN ANALYZE
    SELECT * FROM test_performance WHERE varchar_col = 'test';

    /*
    Results: Nearly IDENTICAL performance!
    The difference is in nanoseconds, not milliseconds.
    */







    Myth #2: "Always set a length limit to save space"


    Nope. PostgreSQL uses TOAST (The Oversized-Attribute Storage Technique) for long strings automatically.






    ┌───────────────────────────────────────────────── ┐
    │ ROW STORAGE (How PostgreSQL Actually Works) │
    ├───────────────────────────────────────────────── ┤
    │ │
    │ Short strings (< ~2KB): │
    │ ┌──────────────────────────────────┐ │
    │ │ Stored inline with row │ │
    │ │ [id][username][email][bio] │ │
    │ └──────────────────────────────────┘ │
    │ │
    │ Long strings (> ~2KB): │
    │ ┌──────────────────┐ ┌─────────────────┐ │
    │ │ [id][username] │─────>│ TOAST table │ │
    │ │ [email][pointer] │ │ [actual bio] │ │
    │ └──────────────────┘ └─────────────────┘ │
    │ │
    │ VARCHAR(255) vs TEXT: Same storage strategy! │
    └───────────────────────────────────────────────── ┘










    The Real-World Decision Tree

    Here's how I choose string types now:






    -- 1. FIXED LENGTH DATA → CHAR
    CREATE TABLE standard_codes (
    iso_country CHAR(2), -- Always 2: 'US', 'UK', 'FR'
    iso_language CHAR(2), -- Always 2: 'en', 'es', 'fr'
    state_code CHAR(2), -- Always 2: 'CA', 'NY', 'TX'
    md5_hash CHAR(32), -- Always 32: MD5 output
    uuid CHAR(36) -- Always 36: with hyphens
    );

    -- 2. CONSTRAINED LENGTH (Business Rule) → VARCHAR(n)
    CREATE TABLE user_profiles (
    username VARCHAR(30), -- Twitter: max 15, we allow 30
    email VARCHAR(254), -- RFC 5321 max email length
    twitter_handle VARCHAR(16), -- Twitter's actual limit
    phone VARCHAR(20), -- E.164 format + extras
    slug VARCHAR(100) -- URL-safe identifier
    );

    -- 3. UNBOUNDED BUT SEARCHABLE → TEXT with indexes
    CREATE TABLE content (
    title TEXT, -- Could be long, need to search
    body TEXT, -- Articles, comments, etc.
    notes TEXT,

    -- Make searches fast
    CONSTRAINT title_reasonable CHECK (LENGTH(title) <= 500)
    );

    CREATE INDEX idx_content_title ON content USING GIN (to_tsvector('english', title));

    -- 4. TRULY UNLIMITED → TEXT
    CREATE TABLE documents (
    content TEXT, -- Could be books
    markdown TEXT, -- Code blocks, documentation
    json_data TEXT -- Before PostgreSQL 9.2 (use JSONB now!)
    );










    The Migration Horror Story (And How to Avoid It)

    Remember that $14,000 mistake? Here's what happened:






    -- Original (naive) design
    CREATE TABLE user_profiles (
    id BIGSERIAL PRIMARY KEY,
    bio VARCHAR(255)
    );

    -- 18 months later, 12 million rows
    SELECT COUNT(*) FROM user_profiles; -- 12,458,392

    -- The destructive migration
    ALTER TABLE user_profiles
    ALTER COLUMN bio TYPE VARCHAR(1000);

    /*
    PostgreSQL had to:
    1. Lock the ENTIRE table (no reads/writes)
    2. Rewrite EVERY row (even unchanged ones)
    3. Rebuild ALL indexes referencing this column
    4. Update ALL foreign key constraints

    Time: 47 hours
    Downtime: Unacceptable
    Cost: $14K in lost revenue
    */







    The smart way:






    -- If I'd used TEXT from the start:
    CREATE TABLE user_profiles (
    id BIGSERIAL PRIMARY KEY,
    bio TEXT,
    CONSTRAINT bio_reasonable CHECK (LENGTH(bio) <= 255)
    );

    -- To extend later (NO table rewrite!):
    ALTER TABLE user_profiles DROP CONSTRAINT bio_reasonable;
    ALTER TABLE user_profiles ADD CONSTRAINT bio_reasonable CHECK (LENGTH(bio) <= 1000);

    -- Time: < 1 second
    -- Locks: Minimal
    -- Cost: $0







    The golden rule: Start with TEXT + CHECK constraint, not VARCHAR(n).





    Edge Cases That Will Bite You

    1. The Emoji Explosion





    -- User stores emoji in username
    INSERT INTO users (username) VALUES ('Alice 👩‍💻');

    -- How long is it?
    SELECT LENGTH('Alice 👩‍💻'); -- 8 characters
    SELECT pg_column_size('Alice 👩‍💻'); -- 15 bytes!

    -- With VARCHAR(10), this fits
    -- But byte-wise, it's bigger than you think

    -- The safer approach:
    ALTER TABLE users ADD CONSTRAINT username_bytes
    CHECK (pg_column_size(username) <= 50);







    2. The Collation Trap





    -- Case-insensitive searches
    CREATE TABLE tags (
    name VARCHAR(50) COLLATE "en_US" -- Case-sensitive by default
    );

    INSERT INTO tags VALUES ('PostgreSQL'), ('postgresql'), ('POSTGRESQL');

    -- Returns 3 rows (all different!)
    SELECT * FROM tags WHERE name = 'postgresql';

    -- Better approach:
    CREATE TABLE tags (
    name CITEXT -- Case-insensitive text type
    );

    -- Or use functional index:
    CREATE INDEX idx_tags_lower ON tags (LOWER(name));
    SELECT * FROM tags WHERE LOWER(name) = LOWER('postgresql');







    3. The NULL vs Empty String Confusion





    -- These are DIFFERENT in PostgreSQL (unlike MySQL)
    SELECT '' = NULL; -- NULL (not true!)
    SELECT '' IS NULL; -- false
    SELECT NULL IS NULL; -- true

    -- Implications for constraints:
    CREATE TABLE profiles (
    bio TEXT NOT NULL DEFAULT '' -- Empty string allowed
    );

    INSERT INTO profiles (bio) VALUES (''); -- OK
    INSERT INTO profiles (bio) VALUES (NULL); -- ERROR!

    -- Better: disallow both
    CREATE TABLE profiles (
    bio TEXT,
    CONSTRAINT bio_not_empty CHECK (bio IS NOT NULL AND LENGTH(TRIM(bio)) > 0)
    );










    The Performance Optimization Playbook

    1. Prefix Indexes for Long Strings





    -- Don't index the entire TEXT column
    CREATE INDEX idx_url_full ON pages(url); -- Could be huge!

    -- Index only the first N characters
    CREATE INDEX idx_url_prefix ON pages((LEFT(url, 100)));

    -- Use in queries:
    SELECT * FROM pages WHERE LEFT(url, 100) = 'https://example.com/very/long/path...';







    2. Full-Text Search Instead of LIKE





    -- Slow: Sequential scan
    SELECT * FROM articles WHERE content LIKE '%postgresql%';

    -- Fast: GIN index with full-text search
    CREATE INDEX idx_articles_fts ON articles
    USING GIN (to_tsvector('english', content));

    SELECT * FROM articles
    WHERE to_tsvector('english', content) @@ to_tsquery('postgresql');

    -- With ranking!
    SELECT
    title,
    ts_rank(to_tsvector('english', content), query) AS rank
    FROM articles, to_tsquery('postgresql & performance') query
    WHERE to_tsvector('english', content) @@ query
    ORDER BY rank DESC;







    3. Compression for Archival Data





    -- PostgreSQL 14+ supports LZ4 compression
    CREATE TABLE archives (
    id BIGSERIAL PRIMARY KEY,
    document TEXT COMPRESSION lz4
    );

    -- Old approach (manual):
    CREATE TABLE archives (
    id BIGSERIAL PRIMARY KEY,
    document_compressed BYTEA -- Store compressed externally
    );

    -- Compress in application layer
    INSERT INTO archives (document_compressed)
    VALUES (compress_function('very long text...'));










    The Ultimate Cheat Sheet





    /*
    ┌────────────────┬──────────────┬───────────────── ──────────────────┐
    │ Use Case │ Type │ Example │
    ├────────────────┼──────────────┼───────────────── ──────────────────┤
    │ Country code │ CHAR(2) │ 'US', 'UK', 'FR' │
    │ State code │ CHAR(2) │ 'CA', 'NY', 'TX' │
    │ MD5 hash │ CHAR(32) │ 'd41d8cd98f00b204e9800998ecf8427e'│
    │ UUID │ UUID │ Use UUID type, not CHAR! │
    │ │ │ │
    │ Username │ VARCHAR(30) │ Business constraint │
    │ Email │ VARCHAR(254) │ RFC max length │
    │ Slug │ VARCHAR(100) │ URL identifier │
    │ Phone │ VARCHAR(20) │ E.164 + formatting │
    │ │ │ │
    │ Name │ TEXT │ No arbitrary limit │
    │ Bio │ TEXT │ + CHECK constraint │
    │ Comment │ TEXT │ Unknown length │
    │ Article body │ TEXT │ + full-text index │
    │ Description │ TEXT │ Flexibility │
    │ │ │ │
    │ Tag │ CITEXT │ Case-insensitive │
    │ Enum-like │ ENUM │ Or CHECK (status IN (...)) │
    └────────────────┴──────────────┴───────────────── ──────────────────┘
    */

    -- My default template:
    CREATE TABLE your_table (
    id BIGSERIAL PRIMARY KEY,

    -- Fixed-length codes
    country_code CHAR(2),

    -- Constrained business fields
    username VARCHAR(30) NOT NULL UNIQUE,
    email VARCHAR(254) NOT NULL UNIQUE,

    -- Flexible text fields
    display_name TEXT NOT NULL,
    bio TEXT,

    -- Add reasonable constraints
    CONSTRAINT bio_length CHECK (LENGTH(bio) <= 5000),
    CONSTRAINT display_name_not_empty CHECK (LENGTH(TRIM(display_name)) > 0)
    );










    The Lessons I Learned (The Hard Way)

    1. TEXT is not "lazy"—it's often the smartest choice
    2. VARCHAR(255) is cargo cult programming from MySQL days
    3. Storage space is rarely your bottleneck—query performance is
    4. Constraints protect you—use CHECK liberally
    5. Migration pain is real—design for change from day one


    The New Rules I Follow





    -- ✅ DO: Start flexible, add constraints
    CREATE TABLE users (
    bio TEXT,
    CONSTRAINT bio_length CHECK (LENGTH(bio) <= 500)
    );

    -- ❌ DON'T: Start rigid, migrate painfully
    CREATE TABLE users (
    bio VARCHAR(255) -- Will require table rewrite to extend
    );

    -- ✅ DO: Use domain types for reusability
    CREATE DOMAIN email AS TEXT
    CHECK (VALUE ~ '^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}$');

    CREATE TABLE users (
    email email NOT NULL UNIQUE
    );

    -- ✅ DO: Index smartly
    CREATE INDEX idx_user_search ON users
    USING GIN (to_tsvector('english', bio));

    -- ❌ DON'T: Index entire large text columns
    CREATE INDEX idx_bio ON users(bio); -- Probably too big










    Your Turn: The Challenge

    Look at your current database schema. I bet you have VARCHAR(255) fields that should be TEXT. Here's a safe migration strategy:






    -- 1. Add the new column
    ALTER TABLE your_table ADD COLUMN bio_new TEXT;

    -- 2. Copy data (in batches for large tables)
    UPDATE your_table SET bio_new = bio WHERE id BETWEEN 1 AND 100000;
    UPDATE your_table SET bio_new = bio WHERE id BETWEEN 100001 AND 200000;
    -- ... continue in batches

    -- 3. Add constraint
    ALTER TABLE your_table
    ADD CONSTRAINT bio_new_length CHECK (LENGTH(bio_new) <= 1000);

    -- 4. Swap in a transaction
    BEGIN;
    ALTER TABLE your_table DROP COLUMN bio;
    ALTER TABLE your_table RENAME COLUMN bio_new TO bio;
    COMMIT;

    -- 5. Celebrate! 🎉










    The Bottom Line

    PostgreSQL string types aren't about memorizing syntax—they're about understanding intent:
    • CHAR: "This will ALWAYS be exactly N characters"
    • VARCHAR(n): "This MUST NOT exceed N characters" (business rule)
    • TEXT: "This can be any length, and I trust PostgreSQL to handle it"


    Choose the one that matches your actual constraint, not your assumptions about performance.


    And for the love of databases, stop defaulting to VARCHAR(255).





    Further Reading



    What's your VARCHAR(255) horror story? Drop it in the comments—I want to hear about your $14,000 migration moment! 💸




    More...
Working...