Prototype

Garfield

System support for byzantine machine learning

Garfield is a library to build Byzantine machine learning (ML) applications on top of popular frameworks such as TensorFlow and PyTorch. We show how to use Garfield to build different architectures for ML applications like single server, multiple workers (SSMW), multiple servers, multiple workers (MSMW), and fully decentralized architecture.

Byzantine ResilienceDecentralizedDistributed LearningFederated LearningPyTorchTensorFlow

Maturity

Support

C4DT

Lab

Maturity

Support

C4DT

Lab

Presentation
Demo
C4DT work
Research papers
Technical

Introduction

Garfield is a framework to write Machine Learning (ML) applications. It is built on top of PyTorch and TensorFlow, two of the most used libraries in this field.

More specifically, Garfield allows one to write distributed Byzantine fault tolerant ML applications. Those terms will be explained in the following paragraphs.

Distributed Machine Learning

Garfield focuses on Distributed ML, or in other words the use of multiple machines to collaboratively train a model using data.

One reason to use many machines can be efficiency: distributing the work and executing it in parallel can dramatically reduce the computation time.

Another reason can also be privacy: in some situations, the data on which we wish to train a model belongs to different entities, who either cannot (for legal reasons) or do do not want to share their data with the other participants. In this case, decentralised learning can be used, allowing each participant to keep its data secret and only share partial models.

These different goals lead to various distributed architectures, illustrated in the following figure. Garfield can work with any of these architectures, and provides examples for all of them.

Byzantine Faults

Computer systems can present many kinds of failures, ranging from one-off glitches to catastrophic breakdowns. In distributed systems, the most general class of failures is called Byzantine faults, and characterizes conditions where components can behave in completely arbitrary ways. They can for example act normally with respect to one component, but present errors to another one. They can send inconsistent information to different peers. Such actions can be the result of an attack, or simply due to a combination of software or hardware errors.

A system is said to be Byzantine fault tolerant when the components that operate correctly can reach a correct result despite the presence of these faulty elements.

Garfield provides tools to create Byzantine fault tolerant ML applications, thereby allowing the training of models even if some of the actors do not behave correctly.

Distributed Computing Lab

Prof. Rachid Guerraoui

The Distributed Computing Lab focuses currently on Scalable Implementations of Cryptocurrencies, Byzantine fault tolerance and privacy in distributed machine learning, distributed algorithms making use of RDMA and NVRAM.

Go back

This page was last edited on 2024-03-22.

Go back

This page was last edited on 2024-03-22.