Building a Fast File Transfer Tool, Part 2: Beating rsync by 58% with kTLS
In Part 1, I built uring-sync—a file copier that's 4.2x faster than cp for local copies using io_uring. Now I've added network transfer with kernel TLS encryption, achieving 58% faster transfers than rsync.
The Problem: SSH is the Bottleneck
When transferring ML datasets between machines, rsync over SSH is the go-to tool:
rsync -az /data/ml_dataset user@server:/backup/
It works, but it's slow. For a 9.7GB dataset (100K files), rsync took 390 seconds—a throughput of just 25 MB/s.
The bottleneck isn't the network. It's encryption in userspace.
┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐
│ File │────▶│ rsync │────▶│ SSH │────▶│ Network │
│ Read │ │ (delta) │ │ encrypt │ │ Send │
└─────────┘ └─────────┘ └─────────┘ └─────────┘
│
Context switches,
userspace copies,
CPU-bound AES
Every byte passes through the SSH process, which encrypts it using OpenSSL in userspace. This involves:
The Solution: kTLS (Kernel TLS)
Linux 4.13+ supports kTLS—TLS encryption handled directly in the kernel. Once you set up the TLS session, the kernel encrypts data as it flows through the socket.
┌─────────┐ ┌─────────┐ ┌──────────────────┐
│ File │────▶│ read │────▶│ Socket (kTLS) │
│ │ │ │ │ encrypt + send │
└─────────┘ └─────────┘ └──────────────────┘
│
One kernel operation,
no userspace copies,
AES-NI in kernel
Benefits:
Implementation
Setting up kTLS requires:
// After deriving keys from shared secret...
struct tls12_crypto_info_aes_gcm_128 crypto_info = {
.info.version = TLS_1_2_VERSION,
.info.cipher_type = TLS_CIPHER_AES_GCM_128,
};
memcpy(crypto_info.key, key, 16);
memcpy(crypto_info.iv, iv, 8);
memcpy(crypto_info.salt, salt, 4);
setsockopt(sock, SOL_TLS, TLS_TX, &crypto_info, sizeof(crypto_info));
// Now all send() calls are automatically encrypted!
Benchmark Results
Testing on real network: Laptop → GCP VM (public internet)
The Headline Number
The Pattern
Data size: 60MB → 589MB → 9.7GB
Improvement: 0% → 34% → 58%
The larger the transfer, the bigger the kTLS advantage.
Why? Per-connection overhead (handshake, key derivation) is amortized over more data. And SSH's userspace encryption overhead grows linearly with data size.
Throughput Comparison
kTLS achieves 2.4x the throughput of rsync while using less CPU.
Why Not Zero-Copy Splice?
In theory, kTLS supports splice() for true zero-copy transfers:
File → Pipe → kTLS Socket (no userspace copies!)
I implemented this and expected it to be fastest. Instead, it was 2.9x slower.
The Investigation
Using strace, I found the problem:
splice(file→pipe): 27μs ← instant
splice(pipe→socket): 33ms ← 1000x slower!
The splice(pipe → kTLS socket) call blocks waiting for TCP ACKs. The kernel can't buffer aggressively like it does with regular send() calls.
The Lesson
Zero-copy isn't always faster. For many-file workloads:
Splice might help for single huge files, but for ML datasets (many small files), stick with read/send.
When to Use This
Use kTLS file transfer when:
Stick with rsync when:
The Protocol
Our wire protocol is minimal:
HELLO (secret hash) ──────────────────▶ Verify
◀────────────────── HELLO_OK (+ enable kTLS)
FILE_HDR (path, size, mode) ──────────▶ Create file
FILE_DATA (chunks) ────────────────────▶ Write data
FILE_END ──────────────────────────────▶ Close file
(repeat for all files)
ALL_DONE ──────────────────────────────▶ Complete
No delta encoding, no checksums (kTLS provides integrity via GCM). Just raw file transfer with authentication and encryption.
Code
Usage:
# Receiver (on remote host)
uring-sync recv /backup --listen 9999 --secret mykey --tls
# Sender (on local host)
uring-sync send /data remote-host:9999 --secret mykey --tls
The implementation uses:
Full source: github.com/VincentDu2021/uring_sync
Conclusion
By moving encryption from userspace SSH to kernel kTLS, we achieved:
The key insight: for bulk data transfer, SSH's flexibility is overhead. A purpose-built tool with kernel encryption wins.
Appendix: Full Benchmark Data
Test Environment
Raw Results
Splice Investigation (ml_images)
Benchmarks run January 2026. Your mileage may vary depending on network conditions and hardware.
Tags: #linux #ktls #tls #rsync #performance #networking #encryption
More...
In Part 1, I built uring-sync—a file copier that's 4.2x faster than cp for local copies using io_uring. Now I've added network transfer with kernel TLS encryption, achieving 58% faster transfers than rsync.
The Problem: SSH is the Bottleneck
When transferring ML datasets between machines, rsync over SSH is the go-to tool:
rsync -az /data/ml_dataset user@server:/backup/
It works, but it's slow. For a 9.7GB dataset (100K files), rsync took 390 seconds—a throughput of just 25 MB/s.
The bottleneck isn't the network. It's encryption in userspace.
┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐
│ File │────▶│ rsync │────▶│ SSH │────▶│ Network │
│ Read │ │ (delta) │ │ encrypt │ │ Send │
└─────────┘ └─────────┘ └─────────┘ └─────────┘
│
Context switches,
userspace copies,
CPU-bound AES
Every byte passes through the SSH process, which encrypts it using OpenSSL in userspace. This involves:
- Multiple context switches between kernel and userspace
- Copying data between kernel buffers and userspace buffers
- CPU time for AES encryption (even with AES-NI)
The Solution: kTLS (Kernel TLS)
Linux 4.13+ supports kTLS—TLS encryption handled directly in the kernel. Once you set up the TLS session, the kernel encrypts data as it flows through the socket.
┌─────────┐ ┌─────────┐ ┌──────────────────┐
│ File │────▶│ read │────▶│ Socket (kTLS) │
│ │ │ │ │ encrypt + send │
└─────────┘ └─────────┘ └──────────────────┘
│
One kernel operation,
no userspace copies,
AES-NI in kernel
Benefits:
- No userspace encryption process - kernel handles it directly
- Fewer copies - data doesn't bounce through userspace
- AES-NI in kernel - hardware acceleration without context switches
Implementation
Setting up kTLS requires:
- TLS handshake - Exchange keys (we use a pre-shared secret + HKDF)
- Configure kernel - setsockopt(SOL_TLS, TLS_TX, ...) with cipher keys
- Send data - Regular send() calls, kernel encrypts automatically
// After deriving keys from shared secret...
struct tls12_crypto_info_aes_gcm_128 crypto_info = {
.info.version = TLS_1_2_VERSION,
.info.cipher_type = TLS_CIPHER_AES_GCM_128,
};
memcpy(crypto_info.key, key, 16);
memcpy(crypto_info.iv, iv, 8);
memcpy(crypto_info.salt, salt, 4);
setsockopt(sock, SOL_TLS, TLS_TX, &crypto_info, sizeof(crypto_info));
// Now all send() calls are automatically encrypted!
Benchmark Results
Testing on real network: Laptop → GCP VM (public internet)
The Headline Number
| ml_small (60MB, 10K files) | 2.98s | 2.63s | ~equal |
| ml_large (589MB, 100K files) | 16.4s | 24.8s | 34% faster |
| ml_images (9.7GB, 100K files) | 165s | 390s | 58% faster |
The Pattern
Data size: 60MB → 589MB → 9.7GB
Improvement: 0% → 34% → 58%
The larger the transfer, the bigger the kTLS advantage.
Why? Per-connection overhead (handshake, key derivation) is amortized over more data. And SSH's userspace encryption overhead grows linearly with data size.
Throughput Comparison
| rsync (SSH) | 25 MB/s | High (userspace encryption) |
| uring-sync + kTLS | 60 MB/s | Low (kernel encryption) |
kTLS achieves 2.4x the throughput of rsync while using less CPU.
Why Not Zero-Copy Splice?
In theory, kTLS supports splice() for true zero-copy transfers:
File → Pipe → kTLS Socket (no userspace copies!)
I implemented this and expected it to be fastest. Instead, it was 2.9x slower.
The Investigation
Using strace, I found the problem:
splice(file→pipe): 27μs ← instant
splice(pipe→socket): 33ms ← 1000x slower!
The splice(pipe → kTLS socket) call blocks waiting for TCP ACKs. The kernel can't buffer aggressively like it does with regular send() calls.
The Lesson
Zero-copy isn't always faster. For many-file workloads:
- read/send: Kernel manages buffering efficiently
- splice: Blocks on each chunk, killing throughput
Splice might help for single huge files, but for ML datasets (many small files), stick with read/send.
When to Use This
Use kTLS file transfer when:
- Transferring large datasets (>500MB)
- Network has bandwidth to spare
- You control both endpoints
- Security is required (not just over VPN)
Stick with rsync when:
- You need delta sync (only changed bytes)
- Destination already has partial data
- SSH infrastructure is already in place
- Simplicity matters more than speed
The Protocol
Our wire protocol is minimal:
HELLO (secret hash) ──────────────────▶ Verify
◀────────────────── HELLO_OK (+ enable kTLS)
FILE_HDR (path, size, mode) ──────────▶ Create file
FILE_DATA (chunks) ────────────────────▶ Write data
FILE_END ──────────────────────────────▶ Close file
(repeat for all files)
ALL_DONE ──────────────────────────────▶ Complete
No delta encoding, no checksums (kTLS provides integrity via GCM). Just raw file transfer with authentication and encryption.
Code
Usage:
# Receiver (on remote host)
uring-sync recv /backup --listen 9999 --secret mykey --tls
# Sender (on local host)
uring-sync send /data remote-host:9999 --secret mykey --tls
The implementation uses:
- HKDF for key derivation from shared secret
- AES-128-GCM via kTLS
- Simple TCP protocol (no HTTP, no gRPC)
Full source: github.com/VincentDu2021/uring_sync
Conclusion
By moving encryption from userspace SSH to kernel kTLS, we achieved:
- 58% faster than rsync for large transfers
- 2.4x throughput (60 MB/s vs 25 MB/s)
- Lower CPU usage (kernel AES-NI vs userspace OpenSSL)
The key insight: for bulk data transfer, SSH's flexibility is overhead. A purpose-built tool with kernel encryption wins.
Appendix: Full Benchmark Data
Test Environment
- Sender: Ubuntu laptop, local NVMe
- Receiver: GCP VM (us-central1-a)
- Network: Public internet
- All tests with cold cache (echo 3 > /proc/sys/vm/drop_caches)
Raw Results
| ml_small | 10K | 60MB | 2.98s | 20 MB/s | 2.63s | 23 MB/s |
| ml_large | 100K | 589MB | 16.4s | 36 MB/s | 24.8s | 24 MB/s |
| ml_images | 100K | 9.7GB | 165s | 60 MB/s | 390s | 25 MB/s |
Splice Investigation (ml_images)
| Plaintext + read/send | 146s | 68 MB/s | Fastest (no encryption) |
| Plaintext + splice | 157s | 63 MB/s | +8% overhead |
| kTLS + read/send | 165s | 60 MB/s | +13% (encryption cost) |
| kTLS + splice | 428s | 23 MB/s | 2.9x slower (broken) |
Benchmarks run January 2026. Your mileage may vary depending on network conditions and hardware.
Tags: #linux #ktls #tls #rsync #performance #networking #encryption
More...