CppNet

Quick Start

Get CppNet up and running in minutes.

Prerequisites

C++17 compiler (GCC, Clang, or MSVC)
CMake ≥ 3.18
Eigen3 ≥ 3.3
OpenMP (optional — CPU parallelism)
CUDA Toolkit (optional — GPU acceleration)

Step 1

Clone & Build

git clone https://github.com/LoqmanSamani/CppNet.git
cd CppNet
mkdir build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Release
make -j$(nproc)

Step 2

Install System-Wide (optional)

sudo make install

Step 3

Use in Your CMake Project

find_package(CppNet REQUIRED)
target_link_libraries(your_target PRIVATE CppNet::CppNet)

Minimal Example

Binary Classification in ~20 Lines

#include <CppNet/CppNet.hpp>
#include <iostream>

int main() {
    // Define layers
    CppNet::Layers::Linear layer1(30, 64, "fc1", true, true, "cpu-eigen", "xavier");
    CppNet::Layers::Linear layer2(64, 1,  "fc2", true, true, "cpu-eigen", "xavier");
    CppNet::Activations::ReLU relu("cpu-eigen");
    CppNet::Activations::Sigmoid sigmoid;

    // Loss & optimizer
    CppNet::Losses::BinaryCrossEntropy loss_fn("mean");
    CppNet::Optimizers::Adam optimizer;
    float lr = 0.001;

    // Training loop
    for (int epoch = 0; epoch < 100; ++epoch) {
        auto h = relu.forward(layer1.forward(X_train));
        auto pred = sigmoid.forward(layer2.forward(h));

        float loss = loss_fn.forward(pred, Y_train);
        auto grad = loss_fn.backward(pred, Y_train);

        grad = layer2.backward(sigmoid.backward(grad));
        layer1.backward(relu.backward(grad));

        layer2.step(optimizer, lr);
        layer1.step(optimizer, lr);

        std::cout << "Epoch " << epoch << " — Loss: " << loss << std::endl;
    }
    return 0;
}

Examples

Complete, self-contained programs that train on synthetic data — no downloads required.

MLP Classification

Multi-layer perceptron for 3-class spiral classification using SoftmaxCrossEntropy and Adam optimizer.

// Architecture: Linear(2,64) -> ReLU -> Linear(64,32) -> ReLU -> Linear(32,3)
auto linear1 = std::make_shared<Layers::Linear>(2, 64, "fc1", true, true, "cpu-eigen");
auto linear2 = std::make_shared<Layers::Linear>(64, 32, "fc2", true, true, "cpu-eigen");
auto linear3 = std::make_shared<Layers::Linear>(32, 3, "fc3", true, true, "cpu-eigen");

Activations::ReLU relu1("cpu-eigen");
Activations::ReLU relu2("cpu-eigen");
Losses::SoftmaxCrossEntropy loss;
Optimizers::Adam optimizer;

CNN Image Classification

Convolutional neural network classifying 8×8 synthetic stripe images (horizontal vs vertical) with Conv2D, MaxPool, and Flatten layers.

// Conv2D(1,4,3,pad=1) -> ReLU -> MaxPool2D(2) -> Flatten -> Linear(64,2)
auto conv1   = std::make_shared<Layers::Conv2D>(1, 4, 3, 1, 1, true, "cpu-eigen");
auto pool1   = std::make_shared<Layers::MaxPool2D>(2, -1, "cpu-eigen");
auto flatten = std::make_shared<Layers::Flatten>();
auto fc      = std::make_shared<Layers::Linear>(64, 2, "classifier",
                                                  true, true, "cpu-eigen");
Activations::ReLU relu1("cpu-eigen");
Losses::SoftmaxCrossEntropy loss;
Optimizers::Adam optimizer;

LSTM Sequence Prediction

LSTM-based recurrent network for sine-wave regression — predicting the next value in a discretised sine wave sequence.

// LSTM(1,16,return_sequences=true) -> last timestep -> Linear(16,1)
auto lstm   = std::make_shared<Layers::LSTM>(1, hidden_size, true, "cpu-eigen");
auto linear = std::make_shared<Layers::Linear>(hidden_size, 1, "output",
                                                  true, true, "cpu-eigen");
Losses::MSE loss;
Optimizers::Adam optimizer;

auto lstm_out = lstm->forward(x_batch);
// Take last timestep hidden state
Eigen::Tensor<float, 2> last_hidden(batch_size, hidden_size);
for (int b = 0; b < batch_size; ++b)
    for (int d = 0; d < hidden_size; ++d)
        last_hidden(b, d) = lstm_out(b, seq_len - 1, d);

GRU Sequence Prediction

GRU-based sine-wave regression demonstrating the GRU layer with MAE loss and Momentum optimizer.

// GRU(1,16,return_sequences=true) -> last timestep -> Linear(16,1)
auto gru    = std::make_shared<Layers::GRU>(1, hidden_size, true, "cpu-eigen");
auto linear = std::make_shared<Layers::Linear>(hidden_size, 1, "output",
                                                  true, true, "cpu-eigen");
Losses::MAE loss;
Optimizers::Momentum optimizer(0.9f);

auto gru_out = gru->forward(x_batch);
// Take last timestep
Eigen::Tensor<float, 2> last_hidden(batch_size, hidden_size);
for (int b = 0; b < batch_size; ++b)
    for (int d = 0; d < hidden_size; ++d)
        last_hidden(b, d) = gru_out(b, seq_len - 1, d);

Transformer Classifier

Token-sequence classifier using Embedding, Multi-Head Self-Attention with skip connections, and mean pooling.

// Embedding(vocab,embed_dim) -> Self-Attention -> Mean-pool -> ReLU -> Linear
auto embedding = std::make_shared<Layers::Embedding>(vocab_size, embed_dim, "cpu-eigen");
auto attention = std::make_shared<Layers::MultiHeadAttention>(embed_dim, num_heads, "cpu-eigen");
auto fc        = std::make_shared<Layers::Linear>(embed_dim, num_classes, "classifier",
                                                     true, true, "cpu-eigen");
Activations::ReLU relu("cpu-eigen");
Losses::SoftmaxCrossEntropy loss;
Optimizers::Adam optimizer;

ResNet Classifier

ResNet-style binary classifier with skip connections for concentric circles dataset, using gradient clipping and He initialization.

// Linear(2,H) -> ReLU -> [ResBlock(H,H)] + skip -> ReLU -> Linear(H,1) -> Sigmoid
Layers::Linear proj(2, H, "proj", true, true, "cpu-eigen", "he");
Layers::Linear b1a(H, H, "b1a", true, true, "cpu-eigen", "he");
Layers::Linear b1b(H, H, "b1b", true, true, "cpu-eigen", "he");
Layers::Linear head(H, 1, "head", true, true, "cpu-eigen", "xavier");

auto skip = h;                    // save for skip connection
auto blk = b1a.forward(h);
blk = r1a.forward(blk);
blk = b1b.forward(blk);
h = add_tensors(blk, skip);       // residual add

Regularized CNN

CNN with BatchNorm, Dropout, and LeakyReLU on 3-class synthetic image data using CategoricalCrossEntropy and Adagrad.

// Conv2D(1,8,3,pad=1) -> LeakyReLU -> Pool -> Flatten -> BatchNorm -> Dropout -> FC
auto conv1    = std::make_shared<Layers::Conv2D>(1, 8, 3, 1, 1, true, "cpu-eigen");
auto pool1    = std::make_shared<Layers::MaxPool2D>(2, -1, "cpu-eigen");
auto flatten  = std::make_shared<Layers::Flatten>();
auto fc       = std::make_shared<Layers::Linear>(128, 3, "classifier",
                                                    true, true, "cpu-eigen");
Activations::LeakyReLU leaky_relu(0.01f, "cpu-eigen");
Layers::BatchNorm      batch_norm(128);
Layers::Dropout        dropout(0.3f);
Losses::CategoricalCrossEntropy loss("mean", true);
Optimizers::Adagrad optimizer;

Optimizer Comparison

Compares SGD, Momentum, Adagrad, RMSProp, and Adam on the same MLP and regression dataset with Tanh activation and Huber loss.

// Linear(2,32) -> Tanh -> Linear(32,16) -> Tanh -> Linear(16,1)
Optimizers::SGD      opt_sgd;
Optimizers::Momentum opt_momentum(0.9f);
Optimizers::Adagrad  opt_adagrad;
Optimizers::RMSProp  opt_rmsprop;
Optimizers::Adam     opt_adam;

Activations::Tanh tanh1;
Activations::Tanh tanh2;
Losses::Huber loss(1.0f);

// Each optimizer trains an identical fresh model
auto losses = train_with_optimizer(X, Y, *e.opt, e.name,
                                   epochs, batch_size, learning_rate);

Benchmarks

Performance comparisons across three compute backends — cpu-eigen, cpu (OpenMP), and gpu (CUDA) — on five architectures.

Test Environment

NVIDIA GeForce GTX 1650 (4 GB), multi-core CPU, GCC 13.3, CUDA 12.0, Release build (-O2). OpenMP: 4 threads. All results are averages over full training runs with fixed random seeds for reproducibility.

📈 Average GPU Speedups by Architecture

Architecture	Avg GPU Speedup	Best GPU Speedup	Best Configuration
MLP	12.1x	25.3x	XLarge (2.6M params)
CNN	35.4x	42.0x	Medium (Conv32→64→FC128)
Sequence (RNN/LSTM/GRU)	13.7x	56.4x	GRU Large (H=256)
Transformer	0.9x	1.2x	Large (d=128, h=8)
ResNet	5.8x	9.0x	Large (W=256, D=6)

🧠 MLP — Spiral Classification

5-class 2D spiral, 15,000 samples. Adam optimizer, ReLU activations, Xavier init.

Config	Architecture	cpu-eigen	cpu (OpenMP)	gpu (CUDA)	GPU Speedup
Small	2→64→64→5	7.4 s	17.9 s	3.6 s	2.03x
Medium	2→128→256→128→5	61.6 s	241.9 s	8.9 s	6.91x
Large	2→256→512→512→256→5	114.5 s	669.9 s	8.0 s	14.31x
XLarge	2→512→1024→1024→512→5	473.1 s	3,027.0 s	18.7 s	25.28x

📸 CNN — Image Classification

Synthetic CIFAR-10-shaped images (3×32×32), 1,000 samples, 10 classes. ReLU + MaxPool2D.

Config	Architecture	cpu-eigen	cpu (OpenMP)	gpu (CUDA)	GPU Speedup
Small	Conv16→Conv32→FC10	84.8 s	91.0 s	2.9 s	28.82x
Medium	Conv32→Conv64→FC128→FC10	252.3 s	277.4 s	6.0 s	41.97x

🔁 Sequence Layers — Sine-Wave Regression

RNN, LSTM, and GRU compared across three sizes. Momentum optimizer, MSE loss.

Config	Layer	cpu-eigen	cpu (OpenMP)	gpu (CUDA)	GPU Speedup
Small (H=64)	RNN	0.87 s	0.76 s	0.40 s	2.2x
	LSTM	2.70 s	2.89 s	0.79 s	3.4x
	GRU	3.98 s	3.92 s	0.76 s	5.2x
Medium (H=128)	RNN	2.53 s	2.84 s	0.54 s	4.7x
	LSTM	11.41 s	11.73 s	1.57 s	7.3x
	GRU	21.59 s	21.69 s	1.39 s	15.5x
Large (H=256)	RNN	22.93 s	24.24 s	1.88 s	12.2x
	LSTM	100.40 s	101.28 s	6.16 s	16.3x
	GRU	318.00 s	318.07 s	5.64 s	56.4x

🤖 Transformer — Token Classification

Embedding → Self-Attention → Mean Pool → ReLU → Linear. 4 classes, Adam optimizer.

Config	Architecture	cpu-eigen	cpu (OpenMP)	gpu (CUDA)	GPU Speedup
Small	Emb(200,32)→Attn(h=2)→FC(32,4)	0.79 s	0.83 s	1.51 s	0.52x
Medium	Emb(500,64)→Attn(h=4)→FC(64,4)	2.52 s	3.04 s	2.51 s	1.00x
Large	Emb(1000,128)→Attn(h=8)→FC(128,4)	9.89 s	22.36 s	8.51 s	1.16x

🏗 ResNet — Spiral Classification

Deep residual networks with stacked skip-connection blocks. Adam optimizer, He init.

Config	Architecture	cpu-eigen	cpu (OpenMP)	gpu (CUDA)	GPU Speedup
Small	2→64→[ResBlock×2]→5	6.35 s	24.85 s	4.06 s	1.56x
Medium	2→128→[ResBlock×4]→5	26.60 s	77.50 s	3.99 s	6.68x
Large	2→256→[ResBlock×6]→5	125.27 s	517.33 s	13.88 s	9.03x

GPU scales with model size

Across all architectures, larger models see dramatically higher GPU speedups as matrix sizes better saturate GPU cores.

CNNs & RNNs benefit most

Convolution achieves up to 42x speedup; GRU achieves up to 56.4x — the highest across all benchmarks.

Eigen beats OpenMP on CPU

Eigen’s SIMD vectorization and cache-optimal layouts consistently outperform manual OpenMP loops for all architectures.

Numerical consistency verified

All devices converge to equivalent loss and accuracy values, confirming correctness across all 41 CUDA kernels.

Full per-epoch results and methodology: benchmarks/benchmarks.md

High Performance

Rich Layer Library

Multiple Backends

Training Utilities

Full CUDA Coverage

Quick Start

Clone & Build

Install System-Wide (optional)

Use in Your CMake Project

Binary Classification in ~20 Lines

Examples

MLP Classification

CNN Image Classification

LSTM Sequence Prediction

GRU Sequence Prediction

Transformer Classifier

ResNet Classifier

Regularized CNN

Optimizer Comparison

Benchmarks

📈 Average GPU Speedups by Architecture

🧠 MLP — Spiral Classification

📸 CNN — Image Classification

🔁 Sequence Layers — Sine-Wave Regression

🤖 Transformer — Token Classification

🏗 ResNet — Spiral Classification

GPU scales with model size

CNNs & RNNs benefit most

Eigen beats OpenMP on CPU

Numerical consistency verified

Loghman Samani