Golang AI applications have incredible potential. With unique features like inexplicable speed, easy debugging, concurrency, and excellent libraries for ML, deep learning, and reinforcement learning.
Benchmark
ADeLe: ADeLe v1.0 is a comprehensive AI evaluation framework that combines explanatory analysis and predictive modeling capabilities to systematically assess AI system performance across multiple dimensions.
SWELancer: The SWE-Lancer-Benchmark is designed to evaluate the capabilities of frontier LLMs in solving real-world freelance software engineering tasks, exploring their potential to generate economic value through complex software development scenarios.
BBH: Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them.
BIG-bench: Beyond the Imitation Game collaborative benchmark for measuring and extrapolating the capabilities of language models.
GPQA: GPQA: A Graduate-Level Google-Proof Q&A Benchmark.
HelloSwag: HellaSwag: Can a Machine Really Finish Your Sentence?
IFEval: IFEval is designed to systematically evaluate the instruction-following capabilities of large language models by incorporating 25 verifiable instruction types (e.g., format constraints, keyword inclusion) and applying dual strict-loose metrics for automated, objective assessment of model compliance.
LiveBench: A Challenging, Contamination-Free LLM Benchmark.
MMLU: Measuring Massive Multitask Language Understanding ICLR 2021.
MMLU-CF: A Contamination-free Multi-task Language Understanding Benchmark.
MMLU-Pro: [NeurIPS 2024] A More Robust and Challenging Multi-Task Language Understanding Benchmark.
PIQA: PIQA is a dataset for commonsense reasoning, and was created to investigate the physical knowledge of existing models in NLP.
WinoGrande: An Adversarial Winograd Schema Challenge at Scale.
Chinese
C-Eval: [NeurIPS 2023] A Chinese evaluation suite for foundation models.
CMMLU: Measuring massive multitask language understanding in Chinese.
C-SimpleQA: A Chinese Factuality Evaluation for Large Language Models.
Math
AIME: Evaluation of LLMs on latest math competitions.
grade-school-math: The GSM8K dataset contains 8.5K grade school math word problems designed to evaluate multi-step reasoning capabilities in language models, revealing that even large transformers struggle with these conceptually simple yet procedurally complex tasks.
MATH: The MATH Dataset for NeurIPS 2021, is a benchmark for evaluating mathematical problem-solving capabilities, offering dataset loaders, evaluation code, and pre-training data.
MathVista: MathVista: data, code, and evaluation for Mathematical Reasoning in Visual Contexts.
Omni-MATH: Omni-MATH is a comprehensive and challenging benchmark specifically designed to assess LLMs’ mathematical reasoning at the Olympiad level.
TAU-bench: TauBench is an open-source benchmark suite designed to evaluate the performance of large language models (LLMs) on complex reasoning tasks across multiple domains.
Code
AIDER: The leaderboards page of aider presents a performance comparison of various LLMs in programming-related tasks, such as code writing and editing.
BFCL: BFCL aims to provide a thorough study of the function-calling capability of different LLMs.
BigCodeBench: [ICLR’25] BigCodeBench: Benchmarking Code Generation Towards AGI.
Code4Bench: A Mutildimensional Benchmark of Codeforces Data for Different Program Analysis Techniques.
CRUXEval: Code Reasoning, Understanding, and Execution Evaluation.
HumanEval: Code for the paper “Evaluating Large Language Models Trained on Code”.
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code.
MBPP: The benchmark consists of around 1,000 crowd-sourced Python programming problems, designed to be solvable by entry level programmers, covering programming fundamentals, standard library functionality, and so on.
MultiPL-E: A multi-programming language benchmark for LLMs.
multi-swe-bench: The Multi-SWE-bench project, developed by ByteDance’s Doubao team, is the first open-source multilingual dataset for evaluating and enhancing large language models’ ability to automatically debug code, covering 7 major programming languages (e.g., Java, C++, JavaScript) with real-world GitHub issues to benchmark “full-stack engineering” capabilities.
SWE-bench: SWE-bench is a benchmark suite designed to evaluate the capabilities of large language models (LLMs) in solving real-world software engineering tasks, focusing on actual software bug-fixing challenges extracted from open-source projects.
Tool Use
BFCL: Training and Evaluating LLMs for Function Calls (Tool Calls).
T-Eval: [ACL2024] T-Eval: Evaluating Tool Utilization Capability of Large Language Models Step by Step.
WildBench: Benchmarking LLMs with Challenging Tasks from Real Users.
Open ended
Arena-Hard: Arena-Hard-Auto: An automatic LLM benchmark.
Safety
False refusal
Xstest: Röttger et al. (NAACL 2024): “XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models”.
Multi-modal
DPG-Bench: The DPG benchmark tests a model’s ability to follow complex image generation prompts.
geneval: GenEval: An object-focused framework for evaluating text-to-image alignment.
LongVideoBench: [Neurips 24’ D&B] Official Dataloader and Evaluation Scripts for LongVideoBench.
MLVU: Multi-task Long Video Understanding Benchmark.
perception_test: A Diagnostic Benchmark for Multimodal Video Models is a multimodal benchmark designed to comprehensively evaluate the perception and reasoning skills of multimodal video models.
TempCompass: A benchmark to evaluate the temporal perception ability of Video LLMs.
VBench: VBench is an open-source project aiming to build a comprehensive evaluation benchmark for video generation models.
Video-MME: [CVPR 2025] Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis.
mcp-go: A Go implementation of the Model Context Protocol (MCP), enabling seamless integration between LLM applications and external data sources and tools.
mcp-golang: Write Model Context Protocol servers in few lines of go code.
gateway: Universal MCP-Server for your Databases optimized for LLMs and AI-Agents.
Large Language Model
GPT
gpt-go: Tiny GPT implemented from scratch in pure Go. Trained on Jules Verne books.
ChatGPT Apps
feishu-openai: Feishu (Lark) integrated with (GPT-4 + GPT-4V + DALL·E-3 + Whisper) delivers an extraordinary work experience.
chatgpt-telegram: Run your own GPTChat Telegram bot, with a single command.
SDKs
openai-go: The official Go library for the OpenAI API.
go-openai: OpenAI ChatGPT, GPT-3, GPT-4, DALL·E, Whisper API wrapper for Go.
anthropic-sdk-go: Access to Anthropic’s safety-first language model APIs via Go.
go-anthropic: Anthropic Claude API wrapper for Go.
deepseek-go: A Deepseek client written for Go supporting R-1, Chat V3, and Coder. Also supports external providers like Azure, OpenRouter and Local Ollama.
DevTools
ollama: Get up and running with Llama 3.3, DeepSeek-R1, Phi-4, Gemma 2, and other large language models.
go-attention: A full attention mechanism and transformer in pure go.
langchaingo: LangChain for Go, the easiest way to write LLM-based programs in Go.
gpt4all-bindings: GPT4All Language Bindings provide cross-language interfaces to easily integrate and interact with GPT4All’s local LLMs, simplifying model loading and inference for developers.
go-openai: OpenAI ChatGPT, GPT-3, GPT-4, DALL·E, Whisper API wrapper for Go.
llama.go: llama.go is like llama.cpp in pure Golang.
eino: The ultimate LLM/AI application development framework in Golang.
fabric: fabric is an open-source framework for augmenting humans using AI. It provides a modular framework for solving specific problems using a crowdsourced set of AI prompts that can be used anywhere.
genkit: An open source framework for building AI-powered apps with familiar code-centric patterns. Genkit makes it easy to develop, integrate, and test AI features with observability and evaluations. Genkit works with various models and platforms.
swarmgo: SwarmGo (agents-sdk-go) is a Go package that allows you to create AI agents capable of interacting, coordinating, and executing tasks.
orra: The orra-dev/orra project offers resilience for AI agent workflows.
core: A fast, agnostic, and powerful Go AI framework for one-shot workflows, building autonomous agents, and working with LLM providers.
gollm: Unified Go interface for Language Model (LLM) providers. Simplifies LLM integration with flexible prompt management and common task functions.
Vector Database
milvus: Milvus is a high-performance, cloud-native vector database built for scalable vector ANN search.
weaviate: Weaviate is an open-source vector database that stores both objects and vectors, allowing for the combination of vector search with structured filtering with the fault tolerance and scalability of a cloud-native database.
tidb: TiDB - the open-source, cloud-native, distributed SQL database designed for modern applications.
Pipeline and Data Version
pachyderm: Data-Centric Pipelines and Data Versioning.
Embedding Benchmark
MTEB: MTEB (Massive Text Embedding Benchmark) is an open-source benchmarking framework for evaluating and comparing text embedding models across 8 tasks (e.g., classification, retrieval, clustering) using 58 datasets in 112 languages, providing standardized performance metrics for model selection.
BRIGHT: BBRIGHT is a realistic, challenging benchmark for reasoning-intensive retrieval, featuring 12 diverse datasets (math, code, biology, etc.) to evaluate retrieval models across complex, context-rich queries requiring logical inference.
General Machine Learning libraries
goml:On-line Machine Learning in Go (and so much more).
golearn: simple and customizable batteries included ML library in Go.
gonum:Gonum is a set of numeric libraries for the Go programming language. It contains libraries for matrices, statistics, optimization, and more.
gorgonia: Gorgonia is a library that helps facilitate machine learning in Go.
spago: Self-contained Machine Learning and Natural Language Processing library in Go.
goro: A High-level Machine Learning Library for Go.
go-perceptron-go: A single / multi layer / recurrent neural network written in Golang.
Linear Algebra
gosl: Linear algebra, eigenvalues, FFT, Bessel, elliptic, orthogonal polys, geometry, NURBS, numerical quadrature, 3D transfinite interpolation, random numbers, Mersenne twister, probability distributions, optimisation, differential equations.
sparse: Sparse matrix formats for linear algebra supporting scientific and machine learning applications.
Probability Distributions
godist: Probability distributions and associated methods in Go.
Decision Trees
CloudForest: CloudForest is a fast, flexible Go library for multi-threaded decision tree ensembles (Random Forest, Gradient Boosting, etc.) designed for high-dimensional heterogeneous data with missing values, emphasizing speed and robustness for real-world machine learning tasks.
Regression
regression: Multivariable regression library in Go.