Garfield
System support for byzantine machine learning
Training ML is done in a distributed fashion these days due to the usage of big models and huge datasets (for scalability reasons). This distribution inevitably leads to a higher probability of failure somewhere in the network. Garfield is a library/tool to ensure the correctness/convergence of training despite the presence of these failures. Garfield can be used to do so with various ML applications and architectures.
Garfield is a library to build Byzantine machine learning (ML) applications on top of popular frameworks such as TensorFlow and PyTorch. We show how to use Garfield to build different architectures for ML applications like single server, multiple workers (SSMW), multiple servers, multiple workers (MSMW), and fully decentralized architecture.
inactive
—
entered showcase: 2021-01-20
—
entry updated: 2024-03-22
Started in Spring 2021, demo in Autumn.
Prototype
Library
Python, Cuda, C++
MIT