A high-performance C++17 deep learning library for building and training neural networks from scratch. Built on Eigen for fast tensor operations, OpenMP for CPU parallelism, and CUDA for GPU acceleration.
Eigen SIMD vectorization, OpenMP CPU parallelism, and 41 CUDA GPU kernels for end-to-end training.
Linear, Conv2D, MaxPool2D, RNN, LSTM, GRU, Multi-Head Attention, BatchNorm, Dropout, Embedding, Residual, and more.
Per-layer backend selection: cpu-eigen, cpu (OpenMP), or gpu (CUDA).
DataLoader, LR schedulers, early stopping, gradient clipping, and model serialization out of the box.
41 CUDA kernels covering all layers, activations, losses, and optimizers — up to 56x GPU speedup.
Get CppNet up and running in minutes.
git clone https://github.com/LoqmanSamani/CppNet.git
cd CppNet
mkdir build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Release
make -j$(nproc)
sudo make install
find_package(CppNet REQUIRED)
target_link_libraries(your_target PRIVATE CppNet::CppNet)
#include <CppNet/CppNet.hpp>
#include <iostream>
int main() {
// Define layers
CppNet::Layers::Linear layer1(30, 64, "fc1", true, true, "cpu-eigen", "xavier");
CppNet::Layers::Linear layer2(64, 1, "fc2", true, true, "cpu-eigen", "xavier");
CppNet::Activations::ReLU relu("cpu-eigen");
CppNet::Activations::Sigmoid sigmoid;
// Loss & optimizer
CppNet::Losses::BinaryCrossEntropy loss_fn("mean");
CppNet::Optimizers::Adam optimizer;
float lr = 0.001;
// Training loop
for (int epoch = 0; epoch < 100; ++epoch) {
auto h = relu.forward(layer1.forward(X_train));
auto pred = sigmoid.forward(layer2.forward(h));
float loss = loss_fn.forward(pred, Y_train);
auto grad = loss_fn.backward(pred, Y_train);
grad = layer2.backward(sigmoid.backward(grad));
layer1.backward(relu.backward(grad));
layer2.step(optimizer, lr);
layer1.step(optimizer, lr);
std::cout << "Epoch " << epoch << " — Loss: " << loss << std::endl;
}
return 0;
}
Complete, self-contained programs that train on synthetic data — no downloads required.
Multi-layer perceptron for 3-class spiral classification using SoftmaxCrossEntropy and Adam optimizer.
// Architecture: Linear(2,64) -> ReLU -> Linear(64,32) -> ReLU -> Linear(32,3)
auto linear1 = std::make_shared<Layers::Linear>(2, 64, "fc1", true, true, "cpu-eigen");
auto linear2 = std::make_shared<Layers::Linear>(64, 32, "fc2", true, true, "cpu-eigen");
auto linear3 = std::make_shared<Layers::Linear>(32, 3, "fc3", true, true, "cpu-eigen");
Activations::ReLU relu1("cpu-eigen");
Activations::ReLU relu2("cpu-eigen");
Losses::SoftmaxCrossEntropy loss;
Optimizers::Adam optimizer;
Convolutional neural network classifying 8×8 synthetic stripe images (horizontal vs vertical) with Conv2D, MaxPool, and Flatten layers.
// Conv2D(1,4,3,pad=1) -> ReLU -> MaxPool2D(2) -> Flatten -> Linear(64,2)
auto conv1 = std::make_shared<Layers::Conv2D>(1, 4, 3, 1, 1, true, "cpu-eigen");
auto pool1 = std::make_shared<Layers::MaxPool2D>(2, -1, "cpu-eigen");
auto flatten = std::make_shared<Layers::Flatten>();
auto fc = std::make_shared<Layers::Linear>(64, 2, "classifier",
true, true, "cpu-eigen");
Activations::ReLU relu1("cpu-eigen");
Losses::SoftmaxCrossEntropy loss;
Optimizers::Adam optimizer;
LSTM-based recurrent network for sine-wave regression — predicting the next value in a discretised sine wave sequence.
// LSTM(1,16,return_sequences=true) -> last timestep -> Linear(16,1)
auto lstm = std::make_shared<Layers::LSTM>(1, hidden_size, true, "cpu-eigen");
auto linear = std::make_shared<Layers::Linear>(hidden_size, 1, "output",
true, true, "cpu-eigen");
Losses::MSE loss;
Optimizers::Adam optimizer;
auto lstm_out = lstm->forward(x_batch);
// Take last timestep hidden state
Eigen::Tensor<float, 2> last_hidden(batch_size, hidden_size);
for (int b = 0; b < batch_size; ++b)
for (int d = 0; d < hidden_size; ++d)
last_hidden(b, d) = lstm_out(b, seq_len - 1, d);
GRU-based sine-wave regression demonstrating the GRU layer with MAE loss and Momentum optimizer.
// GRU(1,16,return_sequences=true) -> last timestep -> Linear(16,1)
auto gru = std::make_shared<Layers::GRU>(1, hidden_size, true, "cpu-eigen");
auto linear = std::make_shared<Layers::Linear>(hidden_size, 1, "output",
true, true, "cpu-eigen");
Losses::MAE loss;
Optimizers::Momentum optimizer(0.9f);
auto gru_out = gru->forward(x_batch);
// Take last timestep
Eigen::Tensor<float, 2> last_hidden(batch_size, hidden_size);
for (int b = 0; b < batch_size; ++b)
for (int d = 0; d < hidden_size; ++d)
last_hidden(b, d) = gru_out(b, seq_len - 1, d);
Token-sequence classifier using Embedding, Multi-Head Self-Attention with skip connections, and mean pooling.
// Embedding(vocab,embed_dim) -> Self-Attention -> Mean-pool -> ReLU -> Linear
auto embedding = std::make_shared<Layers::Embedding>(vocab_size, embed_dim, "cpu-eigen");
auto attention = std::make_shared<Layers::MultiHeadAttention>(embed_dim, num_heads, "cpu-eigen");
auto fc = std::make_shared<Layers::Linear>(embed_dim, num_classes, "classifier",
true, true, "cpu-eigen");
Activations::ReLU relu("cpu-eigen");
Losses::SoftmaxCrossEntropy loss;
Optimizers::Adam optimizer;
ResNet-style binary classifier with skip connections for concentric circles dataset, using gradient clipping and He initialization.
// Linear(2,H) -> ReLU -> [ResBlock(H,H)] + skip -> ReLU -> Linear(H,1) -> Sigmoid
Layers::Linear proj(2, H, "proj", true, true, "cpu-eigen", "he");
Layers::Linear b1a(H, H, "b1a", true, true, "cpu-eigen", "he");
Layers::Linear b1b(H, H, "b1b", true, true, "cpu-eigen", "he");
Layers::Linear head(H, 1, "head", true, true, "cpu-eigen", "xavier");
auto skip = h; // save for skip connection
auto blk = b1a.forward(h);
blk = r1a.forward(blk);
blk = b1b.forward(blk);
h = add_tensors(blk, skip); // residual add
CNN with BatchNorm, Dropout, and LeakyReLU on 3-class synthetic image data using CategoricalCrossEntropy and Adagrad.
// Conv2D(1,8,3,pad=1) -> LeakyReLU -> Pool -> Flatten -> BatchNorm -> Dropout -> FC
auto conv1 = std::make_shared<Layers::Conv2D>(1, 8, 3, 1, 1, true, "cpu-eigen");
auto pool1 = std::make_shared<Layers::MaxPool2D>(2, -1, "cpu-eigen");
auto flatten = std::make_shared<Layers::Flatten>();
auto fc = std::make_shared<Layers::Linear>(128, 3, "classifier",
true, true, "cpu-eigen");
Activations::LeakyReLU leaky_relu(0.01f, "cpu-eigen");
Layers::BatchNorm batch_norm(128);
Layers::Dropout dropout(0.3f);
Losses::CategoricalCrossEntropy loss("mean", true);
Optimizers::Adagrad optimizer;
Compares SGD, Momentum, Adagrad, RMSProp, and Adam on the same MLP and regression dataset with Tanh activation and Huber loss.
// Linear(2,32) -> Tanh -> Linear(32,16) -> Tanh -> Linear(16,1)
Optimizers::SGD opt_sgd;
Optimizers::Momentum opt_momentum(0.9f);
Optimizers::Adagrad opt_adagrad;
Optimizers::RMSProp opt_rmsprop;
Optimizers::Adam opt_adam;
Activations::Tanh tanh1;
Activations::Tanh tanh2;
Losses::Huber loss(1.0f);
// Each optimizer trains an identical fresh model
auto losses = train_with_optimizer(X, Y, *e.opt, e.name,
epochs, batch_size, learning_rate);
Performance comparisons across three compute backends — cpu-eigen, cpu (OpenMP), and gpu (CUDA) — on five architectures.
NVIDIA GeForce GTX 1650 (4 GB), multi-core CPU, GCC 13.3, CUDA 12.0, Release build (-O2). OpenMP: 4 threads. All results are averages over full training runs with fixed random seeds for reproducibility.
| Architecture | Avg GPU Speedup | Best GPU Speedup | Best Configuration |
|---|---|---|---|
| MLP | 12.1x | 25.3x | XLarge (2.6M params) |
| CNN | 35.4x | 42.0x | Medium (Conv32→64→FC128) |
| Sequence (RNN/LSTM/GRU) | 13.7x | 56.4x | GRU Large (H=256) |
| Transformer | 0.9x | 1.2x | Large (d=128, h=8) |
| ResNet | 5.8x | 9.0x | Large (W=256, D=6) |
5-class 2D spiral, 15,000 samples. Adam optimizer, ReLU activations, Xavier init.
| Config | Architecture | cpu-eigen | cpu (OpenMP) | gpu (CUDA) | GPU Speedup |
|---|---|---|---|---|---|
| Small | 2→64→64→5 | 7.4 s | 17.9 s | 3.6 s | 2.03x |
| Medium | 2→128→256→128→5 | 61.6 s | 241.9 s | 8.9 s | 6.91x |
| Large | 2→256→512→512→256→5 | 114.5 s | 669.9 s | 8.0 s | 14.31x |
| XLarge | 2→512→1024→1024→512→5 | 473.1 s | 3,027.0 s | 18.7 s | 25.28x |
Synthetic CIFAR-10-shaped images (3×32×32), 1,000 samples, 10 classes. ReLU + MaxPool2D.
| Config | Architecture | cpu-eigen | cpu (OpenMP) | gpu (CUDA) | GPU Speedup |
|---|---|---|---|---|---|
| Small | Conv16→Conv32→FC10 | 84.8 s | 91.0 s | 2.9 s | 28.82x |
| Medium | Conv32→Conv64→FC128→FC10 | 252.3 s | 277.4 s | 6.0 s | 41.97x |
RNN, LSTM, and GRU compared across three sizes. Momentum optimizer, MSE loss.
| Config | Layer | cpu-eigen | cpu (OpenMP) | gpu (CUDA) | GPU Speedup |
|---|---|---|---|---|---|
| Small (H=64) | RNN | 0.87 s | 0.76 s | 0.40 s | 2.2x |
| LSTM | 2.70 s | 2.89 s | 0.79 s | 3.4x | |
| GRU | 3.98 s | 3.92 s | 0.76 s | 5.2x | |
| Medium (H=128) | RNN | 2.53 s | 2.84 s | 0.54 s | 4.7x |
| LSTM | 11.41 s | 11.73 s | 1.57 s | 7.3x | |
| GRU | 21.59 s | 21.69 s | 1.39 s | 15.5x | |
| Large (H=256) | RNN | 22.93 s | 24.24 s | 1.88 s | 12.2x |
| LSTM | 100.40 s | 101.28 s | 6.16 s | 16.3x | |
| GRU | 318.00 s | 318.07 s | 5.64 s | 56.4x |
Embedding → Self-Attention → Mean Pool → ReLU → Linear. 4 classes, Adam optimizer.
| Config | Architecture | cpu-eigen | cpu (OpenMP) | gpu (CUDA) | GPU Speedup |
|---|---|---|---|---|---|
| Small | Emb(200,32)→Attn(h=2)→FC(32,4) | 0.79 s | 0.83 s | 1.51 s | 0.52x |
| Medium | Emb(500,64)→Attn(h=4)→FC(64,4) | 2.52 s | 3.04 s | 2.51 s | 1.00x |
| Large | Emb(1000,128)→Attn(h=8)→FC(128,4) | 9.89 s | 22.36 s | 8.51 s | 1.16x |
Deep residual networks with stacked skip-connection blocks. Adam optimizer, He init.
| Config | Architecture | cpu-eigen | cpu (OpenMP) | gpu (CUDA) | GPU Speedup |
|---|---|---|---|---|---|
| Small | 2→64→[ResBlock×2]→5 | 6.35 s | 24.85 s | 4.06 s | 1.56x |
| Medium | 2→128→[ResBlock×4]→5 | 26.60 s | 77.50 s | 3.99 s | 6.68x |
| Large | 2→256→[ResBlock×6]→5 | 125.27 s | 517.33 s | 13.88 s | 9.03x |
Across all architectures, larger models see dramatically higher GPU speedups as matrix sizes better saturate GPU cores.
Convolution achieves up to 42x speedup; GRU achieves up to 56.4x — the highest across all benchmarks.
Eigen’s SIMD vectorization and cache-optimal layouts consistently outperform manual OpenMP loops for all architectures.
All devices converge to equivalent loss and accuracy values, confirming correctness across all 41 CUDA kernels.
Full per-epoch results and methodology: benchmarks/benchmarks.md