CppNet

A high-performance C++17 deep learning library for building and training neural networks from scratch. Built on Eigen for fast tensor operations, OpenMP for CPU parallelism, and CUDA for GPU acceleration.

High Performance

Eigen SIMD vectorization, OpenMP CPU parallelism, and 41 CUDA GPU kernels for end-to-end training.

🧩

Rich Layer Library

Linear, Conv2D, MaxPool2D, RNN, LSTM, GRU, Multi-Head Attention, BatchNorm, Dropout, Embedding, Residual, and more.

🔄

Multiple Backends

Per-layer backend selection: cpu-eigen, cpu (OpenMP), or gpu (CUDA).

🔧

Training Utilities

DataLoader, LR schedulers, early stopping, gradient clipping, and model serialization out of the box.

🚀

Full CUDA Coverage

41 CUDA kernels covering all layers, activations, losses, and optimizers — up to 56x GPU speedup.

Quick Start

Get CppNet up and running in minutes.

Prerequisites
  • C++17 compiler (GCC, Clang, or MSVC)
  • CMake ≥ 3.18
  • Eigen3 ≥ 3.3
  • OpenMP (optional — CPU parallelism)
  • CUDA Toolkit (optional — GPU acceleration)
Step 1

Clone & Build

git clone https://github.com/LoqmanSamani/CppNet.git
cd CppNet
mkdir build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Release
make -j$(nproc)
Step 2

Install System-Wide (optional)

sudo make install
Step 3

Use in Your CMake Project

find_package(CppNet REQUIRED)
target_link_libraries(your_target PRIVATE CppNet::CppNet)
Minimal Example

Binary Classification in ~20 Lines

#include <CppNet/CppNet.hpp>
#include <iostream>

int main() {
    // Define layers
    CppNet::Layers::Linear layer1(30, 64, "fc1", true, true, "cpu-eigen", "xavier");
    CppNet::Layers::Linear layer2(64, 1,  "fc2", true, true, "cpu-eigen", "xavier");
    CppNet::Activations::ReLU relu("cpu-eigen");
    CppNet::Activations::Sigmoid sigmoid;

    // Loss & optimizer
    CppNet::Losses::BinaryCrossEntropy loss_fn("mean");
    CppNet::Optimizers::Adam optimizer;
    float lr = 0.001;

    // Training loop
    for (int epoch = 0; epoch < 100; ++epoch) {
        auto h = relu.forward(layer1.forward(X_train));
        auto pred = sigmoid.forward(layer2.forward(h));

        float loss = loss_fn.forward(pred, Y_train);
        auto grad = loss_fn.backward(pred, Y_train);

        grad = layer2.backward(sigmoid.backward(grad));
        layer1.backward(relu.backward(grad));

        layer2.step(optimizer, lr);
        layer1.step(optimizer, lr);

        std::cout << "Epoch " << epoch << " — Loss: " << loss << std::endl;
    }
    return 0;
}

Examples

Complete, self-contained programs that train on synthetic data — no downloads required.

MLP Classification

Multi-layer perceptron for 3-class spiral classification using SoftmaxCrossEntropy and Adam optimizer.

// Architecture: Linear(2,64) -> ReLU -> Linear(64,32) -> ReLU -> Linear(32,3)
auto linear1 = std::make_shared<Layers::Linear>(2, 64, "fc1", true, true, "cpu-eigen");
auto linear2 = std::make_shared<Layers::Linear>(64, 32, "fc2", true, true, "cpu-eigen");
auto linear3 = std::make_shared<Layers::Linear>(32, 3, "fc3", true, true, "cpu-eigen");

Activations::ReLU relu1("cpu-eigen");
Activations::ReLU relu2("cpu-eigen");
Losses::SoftmaxCrossEntropy loss;
Optimizers::Adam optimizer;

CNN Image Classification

Convolutional neural network classifying 8×8 synthetic stripe images (horizontal vs vertical) with Conv2D, MaxPool, and Flatten layers.

// Conv2D(1,4,3,pad=1) -> ReLU -> MaxPool2D(2) -> Flatten -> Linear(64,2)
auto conv1   = std::make_shared<Layers::Conv2D>(1, 4, 3, 1, 1, true, "cpu-eigen");
auto pool1   = std::make_shared<Layers::MaxPool2D>(2, -1, "cpu-eigen");
auto flatten = std::make_shared<Layers::Flatten>();
auto fc      = std::make_shared<Layers::Linear>(64, 2, "classifier",
                                                  true, true, "cpu-eigen");
Activations::ReLU relu1("cpu-eigen");
Losses::SoftmaxCrossEntropy loss;
Optimizers::Adam optimizer;

LSTM Sequence Prediction

LSTM-based recurrent network for sine-wave regression — predicting the next value in a discretised sine wave sequence.

// LSTM(1,16,return_sequences=true) -> last timestep -> Linear(16,1)
auto lstm   = std::make_shared<Layers::LSTM>(1, hidden_size, true, "cpu-eigen");
auto linear = std::make_shared<Layers::Linear>(hidden_size, 1, "output",
                                                  true, true, "cpu-eigen");
Losses::MSE loss;
Optimizers::Adam optimizer;

auto lstm_out = lstm->forward(x_batch);
// Take last timestep hidden state
Eigen::Tensor<float, 2> last_hidden(batch_size, hidden_size);
for (int b = 0; b < batch_size; ++b)
    for (int d = 0; d < hidden_size; ++d)
        last_hidden(b, d) = lstm_out(b, seq_len - 1, d);

GRU Sequence Prediction

GRU-based sine-wave regression demonstrating the GRU layer with MAE loss and Momentum optimizer.

// GRU(1,16,return_sequences=true) -> last timestep -> Linear(16,1)
auto gru    = std::make_shared<Layers::GRU>(1, hidden_size, true, "cpu-eigen");
auto linear = std::make_shared<Layers::Linear>(hidden_size, 1, "output",
                                                  true, true, "cpu-eigen");
Losses::MAE loss;
Optimizers::Momentum optimizer(0.9f);

auto gru_out = gru->forward(x_batch);
// Take last timestep
Eigen::Tensor<float, 2> last_hidden(batch_size, hidden_size);
for (int b = 0; b < batch_size; ++b)
    for (int d = 0; d < hidden_size; ++d)
        last_hidden(b, d) = gru_out(b, seq_len - 1, d);

Transformer Classifier

Token-sequence classifier using Embedding, Multi-Head Self-Attention with skip connections, and mean pooling.

// Embedding(vocab,embed_dim) -> Self-Attention -> Mean-pool -> ReLU -> Linear
auto embedding = std::make_shared<Layers::Embedding>(vocab_size, embed_dim, "cpu-eigen");
auto attention = std::make_shared<Layers::MultiHeadAttention>(embed_dim, num_heads, "cpu-eigen");
auto fc        = std::make_shared<Layers::Linear>(embed_dim, num_classes, "classifier",
                                                     true, true, "cpu-eigen");
Activations::ReLU relu("cpu-eigen");
Losses::SoftmaxCrossEntropy loss;
Optimizers::Adam optimizer;

ResNet Classifier

ResNet-style binary classifier with skip connections for concentric circles dataset, using gradient clipping and He initialization.

// Linear(2,H) -> ReLU -> [ResBlock(H,H)] + skip -> ReLU -> Linear(H,1) -> Sigmoid
Layers::Linear proj(2, H, "proj", true, true, "cpu-eigen", "he");
Layers::Linear b1a(H, H, "b1a", true, true, "cpu-eigen", "he");
Layers::Linear b1b(H, H, "b1b", true, true, "cpu-eigen", "he");
Layers::Linear head(H, 1, "head", true, true, "cpu-eigen", "xavier");

auto skip = h;                    // save for skip connection
auto blk = b1a.forward(h);
blk = r1a.forward(blk);
blk = b1b.forward(blk);
h = add_tensors(blk, skip);       // residual add

Regularized CNN

CNN with BatchNorm, Dropout, and LeakyReLU on 3-class synthetic image data using CategoricalCrossEntropy and Adagrad.

// Conv2D(1,8,3,pad=1) -> LeakyReLU -> Pool -> Flatten -> BatchNorm -> Dropout -> FC
auto conv1    = std::make_shared<Layers::Conv2D>(1, 8, 3, 1, 1, true, "cpu-eigen");
auto pool1    = std::make_shared<Layers::MaxPool2D>(2, -1, "cpu-eigen");
auto flatten  = std::make_shared<Layers::Flatten>();
auto fc       = std::make_shared<Layers::Linear>(128, 3, "classifier",
                                                    true, true, "cpu-eigen");
Activations::LeakyReLU leaky_relu(0.01f, "cpu-eigen");
Layers::BatchNorm      batch_norm(128);
Layers::Dropout        dropout(0.3f);
Losses::CategoricalCrossEntropy loss("mean", true);
Optimizers::Adagrad optimizer;

Optimizer Comparison

Compares SGD, Momentum, Adagrad, RMSProp, and Adam on the same MLP and regression dataset with Tanh activation and Huber loss.

// Linear(2,32) -> Tanh -> Linear(32,16) -> Tanh -> Linear(16,1)
Optimizers::SGD      opt_sgd;
Optimizers::Momentum opt_momentum(0.9f);
Optimizers::Adagrad  opt_adagrad;
Optimizers::RMSProp  opt_rmsprop;
Optimizers::Adam     opt_adam;

Activations::Tanh tanh1;
Activations::Tanh tanh2;
Losses::Huber loss(1.0f);

// Each optimizer trains an identical fresh model
auto losses = train_with_optimizer(X, Y, *e.opt, e.name,
                                   epochs, batch_size, learning_rate);

Benchmarks

Performance comparisons across three compute backends — cpu-eigen, cpu (OpenMP), and gpu (CUDA) — on five architectures.

Test Environment

NVIDIA GeForce GTX 1650 (4 GB), multi-core CPU, GCC 13.3, CUDA 12.0, Release build (-O2). OpenMP: 4 threads. All results are averages over full training runs with fixed random seeds for reproducibility.

📈 Average GPU Speedups by Architecture

Architecture Avg GPU Speedup Best GPU Speedup Best Configuration
MLP 12.1x 25.3x XLarge (2.6M params)
CNN 35.4x 42.0x Medium (Conv32→64→FC128)
Sequence (RNN/LSTM/GRU) 13.7x 56.4x GRU Large (H=256)
Transformer 0.9x 1.2x Large (d=128, h=8)
ResNet 5.8x 9.0x Large (W=256, D=6)

🧠 MLP — Spiral Classification

5-class 2D spiral, 15,000 samples. Adam optimizer, ReLU activations, Xavier init.

Config Architecture cpu-eigen cpu (OpenMP) gpu (CUDA) GPU Speedup
Small 2→64→64→5 7.4 s 17.9 s 3.6 s 2.03x
Medium 2→128→256→128→5 61.6 s 241.9 s 8.9 s 6.91x
Large 2→256→512→512→256→5 114.5 s 669.9 s 8.0 s 14.31x
XLarge 2→512→1024→1024→512→5 473.1 s 3,027.0 s 18.7 s 25.28x

📸 CNN — Image Classification

Synthetic CIFAR-10-shaped images (3×32×32), 1,000 samples, 10 classes. ReLU + MaxPool2D.

Config Architecture cpu-eigen cpu (OpenMP) gpu (CUDA) GPU Speedup
Small Conv16→Conv32→FC10 84.8 s 91.0 s 2.9 s 28.82x
Medium Conv32→Conv64→FC128→FC10 252.3 s 277.4 s 6.0 s 41.97x

🔁 Sequence Layers — Sine-Wave Regression

RNN, LSTM, and GRU compared across three sizes. Momentum optimizer, MSE loss.

Config Layer cpu-eigen cpu (OpenMP) gpu (CUDA) GPU Speedup
Small (H=64) RNN 0.87 s 0.76 s 0.40 s 2.2x
LSTM 2.70 s 2.89 s 0.79 s 3.4x
GRU 3.98 s 3.92 s 0.76 s 5.2x
Medium (H=128) RNN 2.53 s 2.84 s 0.54 s 4.7x
LSTM 11.41 s 11.73 s 1.57 s 7.3x
GRU 21.59 s 21.69 s 1.39 s 15.5x
Large (H=256) RNN 22.93 s 24.24 s 1.88 s 12.2x
LSTM 100.40 s 101.28 s 6.16 s 16.3x
GRU 318.00 s 318.07 s 5.64 s 56.4x

🤖 Transformer — Token Classification

Embedding → Self-Attention → Mean Pool → ReLU → Linear. 4 classes, Adam optimizer.

Config Architecture cpu-eigen cpu (OpenMP) gpu (CUDA) GPU Speedup
Small Emb(200,32)→Attn(h=2)→FC(32,4) 0.79 s 0.83 s 1.51 s 0.52x
Medium Emb(500,64)→Attn(h=4)→FC(64,4) 2.52 s 3.04 s 2.51 s 1.00x
Large Emb(1000,128)→Attn(h=8)→FC(128,4) 9.89 s 22.36 s 8.51 s 1.16x

🏗 ResNet — Spiral Classification

Deep residual networks with stacked skip-connection blocks. Adam optimizer, He init.

Config Architecture cpu-eigen cpu (OpenMP) gpu (CUDA) GPU Speedup
Small 2→64→[ResBlock×2]→5 6.35 s 24.85 s 4.06 s 1.56x
Medium 2→128→[ResBlock×4]→5 26.60 s 77.50 s 3.99 s 6.68x
Large 2→256→[ResBlock×6]→5 125.27 s 517.33 s 13.88 s 9.03x

GPU scales with model size

Across all architectures, larger models see dramatically higher GPU speedups as matrix sizes better saturate GPU cores.

CNNs & RNNs benefit most

Convolution achieves up to 42x speedup; GRU achieves up to 56.4x — the highest across all benchmarks.

Eigen beats OpenMP on CPU

Eigen’s SIMD vectorization and cache-optimal layouts consistently outperform manual OpenMP loops for all architectures.

Numerical consistency verified

All devices converge to equivalent loss and accuracy values, confirming correctness across all 41 CUDA kernels.

Full per-epoch results and methodology: benchmarks/benchmarks.md

Loghman Samani

Loghman Samani

Computational biologist with a focus on generative AI, molecular dynamics, and machine learning. M.Sc. Computational Biology, University of Stuttgart.