SyftBox Computation Model
By now, you should have already installed SyftBox as covered in the introduction. If not, check out the SyftBox intro.
Once installed, you'll find a SyftBox
folder on your system. Inside this folder, the most important two main components are APIs and Datasites.
APIs allow you to propose and participate in computations within SyftBox. Datasites, on the other hand, represent individual entities in the Syft network - individuals, organizations, etc and contain their private and public data.
Now, let's dive into how you can install an API on your SyftBox and start being part of a computation.
Workflow overview
Applying computations on private data works through a workflow that implies 2 parties:
- Data Owners -> the party that owns part of the private data taking part in the computation
- API Developer -> the party that wants to apply the computation on the private data; we can see also them as research proposers
Since the API developer is the one proposing the study, they will be in charge of writing both the API designed to run on the data owner's Datasite (preparing the data for aggregation) and the API designed to run on their Datasite.
Each party will manage their own Datasite and SyftBox APIs. For the purpose of these tutorials we'll label data owner Datasites with A
, B
etc. and developer Datasites with X
, Y
etc.
Data owner setup
A data owner takes part in a computation by installing a SyftBox API following a developer's proposal. The Data Owner API is designed to apply a computation on the private data on their Datasite and write a publicly available result.
A Data Owner is always in control of their private data and what code runs on it.
- Datasite A - a data owner's Datasite
- data - the private data stored on the data owner's Datasite
- data_owner_api - the API running on the data owner's Datasite
- public result - a result computed from the private data, which can be used for aggregations on other Datasites
Developer setup
A developer will install another API on their own datasite, which will aggregate the public results computed on the Data Owner's Datasites.
- Datasite A & Datasite B - Datasites belonging to data owners taking part in the computation
- aggregator_api - the API running on the proposing developer's Datasite
- final result - the final result of the aggregation
Step by step
The whole workflow looks something like this:
- data owners (A and B) each prepare their private data (CSV, JSON, or any other format supported by the APIs)
- data owners (A and B) each install the
data_owner_api
API on their Datasites (developed by the API developer) - the developer (X) installs the
aggregator_api
API on their Datasite. - the developer (X) will soon see the aggregation result on their Datasite.
That's it! SyftBox takes care of syncing the intermediary public results between datasites so the APIs can do their job.
For this workflow to work, every party needs to have their client running and connected to the same Syft network.
Example: CPU Tracker API
The CPU Tracker API is a simple example of what could be build on SyftBox: it's an application that gathers CPU data from participating Datasites and displays them in a chart after aggregating them.
Click here to see a live example of the CPU Tracker API running on the main Syft network.
Installation
To install the CPU Tracker, follow these steps:
- Make sure your SyftBox client is running
- Click on the "Install CPU_tracker_member" button on the top-left of the page
Another way to download the API will be to clone this repo and move the cpu_tracker_member folder to your syftbox/apis folder.
That's it! You're now part of you're now part of the computation! Your CPU load will be included in the aggregation, helping to calculate the average CPU load across the Syft network. Your Datasite will appear in the "Active Peers" list participating in the computation.
How does the API work?
After installing the CPU Tracker API, you'll notice a new API called cpu_tracker_member
in the APIs folder of your SyftBox. This API is defined by two key files: main.py
and run.sh
, which work together to perform a certain function using data from your Datasite.
SyftBox
├── datasites
│ └── ...
└── apis
├── ...
└── cpu_tracker_member
├── main.py
├── run.sh
└── ...
A quick glance at the main.py
script shows that the API collects 50 data points from your CPU usage at specific intervals and averages them (adding noise to ensure a degree of privacy). The processed result is then placed in a public folder to make it available for aggregation.
The cpu_tracker
API also creates a file on your Datasite, located in the api_data
folder, which contains your average CPU usage data for a specific time frame.
SyftBox
├── apis
│ └── cpu_tracker_member
└── datasites
└── YOUR_DATASITE
└── api_data
└── cpu_tracker
└── cpu_tracker.json
The cpu_tracker_member
API automatically manages the loading and processing of your data so it can be included in the global aggregation.
In the next tutorial, we'll dive deeper and learn how to build an API like this from scratch!