Learning, Solving and Optimizing PDEs with TensorGalerkin: an Efficient High-Performance Galerkin Assembly Algorithm

1ETH Zurich    2Imperial College London    3Northeastern University    4Renmin University of China
*Indicates Equal Contribution
Overview of TensorGalerkin: Map-Reduce Galerkin Assembly

Overview of TensorGalerkin. Stage I (Batch-Map) computes element-wise operators via a fully tensorized einsum kernel; Stage II (Sparse-Reduce) assembles global sparse values via routing matrices and a single SpMM. For comparison, the white box illustrates traditional FEM assembly via per-element loops and scatter-add (atomics) into the global system. The same assembly engine powers TensorMesh, TensorPils, and TensorOpt.

Abstract

We present a unified algorithmic framework for the numerical solution, constrained optimization, and physics-informed learning of PDEs with a variational structure. Our framework is based on a Galerkin discretization of the underlying variational forms, and its high efficiency stems from a novel highly-optimized and GPU-compliant TensorGalerkin framework for linear system assembly (stiffness matrices and load vectors). TensorGalerkin operates by tensorizing element-wise operations within a Python-level Map stage and then performs global reduction with a sparse matrix multiplication that performs message passing on the mesh-induced sparsity graph. It can be seamlessly employed downstream as i) a highly-efficient numerical PDEs solver, ii) an end-to-end differentiable framework for PDE-constrained optimization, and iii) a physics-informed operator learning algorithm for PDEs. With multiple benchmarks, including 2D and 3D elliptic, parabolic, and hyperbolic PDEs on unstructured meshes, we demonstrate that the proposed framework provides significant computational efficiency and accuracy gains over a variety of baselines in all the targeted downstream applications.

Numerical PDE Solver

To demonstrate the capability of our TensorMesh as a numerical PDE solver, we consider two PDE benchmarks: the Poisson equation and the linear elasticity equations. In the following figures, we present the runtimes for TensorMesh as the number of degrees of freedom increases, for both the CPU and GPU versions. As baselines, we present runtimes for the extremely popular FEniCS framework on CPUs, the scikit-fem (SKFEM) framework, JAX-FEM on CPU and GPU, and PINN.

Poisson Equation (3D)

Poisson Equation 3D Timing Comparison

Linear Elasticity (3D)

Linear Elasticity 3D Timing Comparison

Poisson Equation — Solution Visualization

TensorMesh Poisson Solution

TensorMesh

JAX-FEM Poisson Solution

JAX-FEM

scikit-fem Poisson Solution

scikit-fem

FEniCS Poisson Solution

FEniCS

PINN Poisson Solution

PINN

Linear Elasticity — Solution Visualization

TensorMesh Elasticity Solution

TensorMesh

JAX-FEM Elasticity Solution

JAX-FEM

scikit-fem Elasticity Solution

scikit-fem

FEniCS Elasticity Solution

FEniCS

PINN Elasticity Solution

PINN

Neural PDE Solver

We consider a two-dimensional Poisson equation with checkerboard forcing terms at different scales to investigate the performance of the proposed TensorPils framework as a stand-alone neural PDE solver. We compare its performance with three neural PDE solvers: PINNs, VPINNs, and the Deep Ritz method, and present the errors and visualization in the following.

We report relative L2 error (%) for different frequencies K and training throughput (it/s). All models were trained for 10,000 Adam + 200 LBFGS steps.

Method Rel. L2 Error (%) Speed (it/s)
K=2 K=4 K=8 Adam LBFGS
PINN 0.90 6.30 34.77 20.1 1.0
VPINN 11.58 36.88 154.10 54.9 50.5
Deep Ritz 3.34 4.50 10.60 58.7 55.7
TensorPils 0.56 2.24 10.05 117.8 99.7

Solution Comparison

Neural PDE Solver Comparison K=2 Neural PDE Solver Comparison K=4 Neural PDE Solver Comparison K=8

Physics-informed Neural Operator

In the next set of experiments, we test the entire TensorPils pipeline for physics-informed operator learning. To this end, we consider two time-dependent PDEs: the linear acoustic wave equation on a circular domain and the non-linear Allen-Cahn equations on an L-Shaped domain.

We report the relative L2 errors on both In-distribution (ID) and Out-of-distribution (OOD) test sets.

Wave AC
ID OOD ID OOD
Data-Driven 0.089±0.013 0.230±0.017 0.135±0.042 0.152±0.080
PI-DeepONet 0.626±0.033 0.863±0.018 0.743±0.163 8.536±6.306
TensorPils 0.085±0.010 0.090±0.006 0.110±0.014 0.083±0.013

Wave Equation

Allen-Cahn Equation

Topology Optimization

We consider a challenging inverse design from the field of Topology Optimization. Our problem can be summarized as a compliance minimization problem for a 2D cantilever beam, described by the elasticity equations. Starting from a rectangular design domain, with fixed support on one edge and a point load applied in another corner. We compare with JAX-FEM using the same setup.

Performance comparison on the 2D Cantilever Beam task (51 iterations). Time is measured in seconds.

Stage JAX-FEM TensorMesh (Ours) Speedup
Setup Time 2.62 s 0.58 s 4.5×
Optimization Loop 28.51 s 7.77 s 3.7×
Total Time 31.13 s 8.35 s 3.7×

BibTeX

@article{wen2026tensorgalerkin,
  title={Learning, Solving and Optimizing PDEs with TensorGalerkin: 
         an Efficient High-Performance Galerkin Assembly Algorithm},
  author={Wen, Shizheng and Chi, Mingyuan and Yu, Tianwei and Moseley, Ben and Michelis, Mike Yan and Ren, Pu and Sun, Hao and Mishra, Siddhartha},
  journal={arXiv preprint arXiv:2602.05052},
  year={2026}
}