Skip to main content

Syncing in SyftBox

Overview

To move file changes from one datasite to another, SyftBox uses a syncing component that forwards file changes from one datasite to a caching server, and downloads them to another datasite from the caching server. These file changes are subject to permissions, which are checked on both the clients and server.

The main value of the syncing component is that it enables asynchronous collaboration. People can collaborate with each other even if they are not online at the same time, as the caching server stores and forwards changes between datasites.

Components & Flow

From a high level, the clients implement a producer and a consumer within the syncing component:

  1. The producer compares the hashes of local datasite files against the hashes of the files on the server, and if there is a difference, it pushes this change to the consumer.

  2. The consumer compares three hashes:

    • The hash of the local datasite file the last time it was synced
    • The current hash of the local datasite file
    • The current hash of the remote file on the server

Based on this information, the client determines:

  • The location of the change (local vs. remote)
  • The type of modification (create/delete/modify)

The consumer then takes action to sync the file. Syncing may involve:

  • Downloading a file
  • Uploading a file
  • Requesting to apply a diff
  • Removing a local file
  • Requesting removal on the server

The logic on the server is very lightweight—it simply checks whether the user is allowed to make a change based on the permissions and applies it accordingly.

Initial Sync

When you start a new datasite, there might be a lot of files to sync. Therefore, the initial set of files is downloaded in batches, so that the datasite quickly become operational with all the necessary files from the cache server.

Monitoring Sync Status

The status of what is synced or not can be checked in a local dashboard available at:

http://localhost:<client_port>/sync

The client_port can be found in:

  • Your config.json file
  • At the top of the SyftBox client log when you start the client

This dashboard provides a real-time view of the syncing process, allowing you to monitor which files have been synchronized and which are pending.

Sync Scope and Exclusions

Sync Scope

Only files located under the Syftbox/datasites directory are considered for syncing. Any files or directories outside this path are not included in the synchronization process.

The _.syftignore File

SyftBox provides a way to exclude specific files from syncing using a _.syftignore file, which works similar to a .gitignore file in Git. This file lives under the Syftbox/datasites directory and allows you to specify patterns for files and directories that should be excluded from synchronization.

Here's an example of a _.syftignore file:

# Syft
/_.syftignore
/.syft*
/apps
/staging
/syft_changelog

# Python
.ipynb_checkpoints/
**pycache**/
*.py[cod]
.venv/

# OS-specific
.DS_Store
Icon

# IDE/Editor-specific
*.swp
*.swo
.vscode/
.idea/
*.iml

# General excludes
*.tmp

# excluded datasites
# example:
# /user_to_exclude@example.com/

You can customize this file to exclude:

  • System files (like .DS_Store)
  • Development environments (like .venv)
  • IDE/editor specific files
  • Temporary files
  • Specific datasites by user email (e.g., /user_to_exclude@example.com/)
  • Any other files or directories you don't want to synchronize

Limitations

There is currently a limitation in the syncing process - files larger than 10 MB are skipped and will not be synchronized between datasites. We are working to address this limitation shortly