How to: Use Others' Private Data

This guide will show you how to extract insights from private data while preserving confidentiality. In particular, we show how to discover datasets on the network and compute results without accessing raw data.

Discovering Private Datasets on the Network

Before you can use private data, you first need to discover what's available on the network. SyftBox allows dataset owners to make their private datasets discoverable through metadata descriptors. Look for dataset.yaml files in the public directories of other Datasites:

version: "0.1.0"

datasets:
  - name: "Netflix Data"
    path: "~/SyftBox/datasites/owner@example.org/datasets/netflix/NetflixViewingHistory"
    dataset_loader: "SyftBox/datasites/aggregator@openmined.org/public/data_loader/json_loader.py"
    description: "Primary dataset for user behavior analysis."
    format: "CSV"

These descriptors provide essential information about available datasets (name, description, format) without exposing the actual data. You can browse these descriptors in the public folders of other Datasites to identify datasets that might be valuable for your analysis. Once you've found a dataset of interest, you can proceed to write an App that will work with that specific dataset.

Workflow

No one has direct access to anyone's private data, so the only way to extract meaningful insights from another Datasite is to submit an App that would compute a public result.

note

Datasite owners are in complete control of their private data, so there is no way to access it without their explicit approval.

Write an App

The first step is to write an App that processes private data and outputs the result in a public file.

Here is an example that takes a private incomes.csv file and outputs the result of a computation (the mean) in a public file, income_mean.json:

from syftbox.lib import Client
import pandas as pd
import json


def main():
    client = Client.load()

    # load private data
    incomes_filepath = client.my_datasite / "private" / "incomes.csv"
    df = pd.read_csv(incomes_filepath)

    # compute result
    result = df["annual_income"].mean()

    # write result
    result_filepath = client.my_datasite / "public" / "income_mean.json"
    with open(result_filepath, "w") as f:
        json.dump({"result": result}, f, indent=2)


if __name__ == "__main__":
    main()

Click here to see the CSV file used for the example.

incomes.csv
name,occupation,annual_income
John Smith,Software Engineer,105000
Sarah Johnson,Teacher,52000
Michael Brown,Doctor,215000
Emily Davis,Retail Sales Associate,28500
James Wilson,Marketing Manager,88000
Jessica Thompson,Nurse,75000
Robert Anderson,Construction Worker,45000
Amanda Martinez,Accountant,72000
David Taylor,Chef,58000
Jennifer Garcia,Lawyer,125000
Thomas Rodriguez,Bus Driver,42000
Lisa Hernandez,Graphic Designer,62000
Daniel Moore,Electrician,68000
Michelle Lewis,Social Worker,52000
Christopher Lee,Financial Analyst,95000
Stephanie Walker,Administrative Assistant,38000
Matthew Hall,Professor,92000
Nicole Allen,Journalist,58000
Andrew Young,Mechanical Engineer,92000
Rebecca King,Dental Hygienist,76000
William Wright,Police Officer,64000
Ashley Scott,Event Planner,55000
Joseph Hill,Pharmacist,128000
Melissa Green,Interior Designer,70000
Brandon Adams,Plumber,56000
Lauren Baker,Human Resources Manager,84000
Kevin Nelson,Architect,88000
Amber Mitchell,Physical Therapist,85000
Justin Phillips,Truck Driver,55000
Rachel Campbell,Real Estate Agent,72000
Mark Carter,Welder,52000
Victoria Parker,Veterinarian,110000
Steven Evans,Insurance Agent,61000
Heather Edwards,Elementary School Principal,98000
Ryan Collins,IT Support Specialist,65000
Samantha Stewart,Flight Attendant,56000
Patrick Sanchez,Paralegal,49000
Brittany Morris,Research Scientist,88000
Gregory Rogers,Auto Mechanic,48000
Megan Reed,Public Relations Specialist,65000
Jonathan Cook,Firefighter,58000
Olivia Morgan,Speech Pathologist,82000
Charles Cooper,Web Developer,80000
Kayla Peterson,Occupational Therapist,84000
Timothy Bailey,Carpenter,52000
Christina Richardson,Bank Teller,34000
Sean Cox,Civil Engineer,94000
Tiffany Howard,Dental Assistant,42000
Keith Ward,Sales Manager,78000
Natalie Torres,Psychologist,92000
Dustin Powell,Landscaper,35000
Erin Butler,Librarian,55000
Kyle Coleman,Electrical Engineer,96000
Alicia Barnes,Customer Service Representative,36000
Todd Jenkins,College Professor,96000
Shannon Perry,Hairstylist,42000
Troy Long,Data Scientist,115000
Kelly Hughes,Physician Assistant,110000
Aaron Price,Line Cook,32000
April Sanders,Marketing Coordinator,55000
Derrick Bennett,Security Guard,38000
Erica Wood,Registered Nurse,82000
Shane Gray,Hotel Manager,68000
Jenna James,Elementary School Teacher,54000
Ian Brooks,Software Developer,98000
Kristin Watson,Office Manager,62000
Corey Hayes,HVAC Technician,58000
Danielle Reynolds,Social Media Manager,60000
Johnny Foster,Personal Trainer,45000
Chelsea Morgan,Pharmacist,126000
Scott Price,Commercial Pilot,140000
Melanie Sullivan,Preschool Teacher,38000
Tyler Ross,Network Administrator,78000
Katrina Ortiz,Occupational Health Nurse,76000
Brett Spencer,Restaurant Manager,52000
Krystal Gardner,Certified Public Accountant,75000
Adam Sullivan,Aerospace Engineer,120000
Vanessa Reed,Legal Assistant,48000
Derek Nguyen,Physical Education Teacher,56000
Brooke Murphy,Human Resources Specialist,65000
Cameron Russell,Fitness Instructor,40000
Candice Bryant,UX Designer,92000
Blake Myers,Financial Advisor,105000
Veronica Butler,Medical Laboratory Technician,52000
Jared Coleman,Bartender,34000
Kimberly Hudson,Product Manager,108000
Trevor Bell,Landscape Architect,72000
Sabrina Dixon,Radiologic Technologist,66000
Joel Freeman,Technical Writer,68000
Kristen Elliott,Healthcare Administrator,88000
Tony Griffin,Warehouse Manager,62000
Regina Diaz,High School Teacher,58000
Lance Simmons,Civil Engineering Technician,58000
Felicia Harrison,Copywriter,60000
Chris Palmer,IT Project Manager,105000
Bethany Andrews,Dental Receptionist,38000
Wesley Graham,Construction Manager,82000
Jacqueline Berry,Social Media Coordinator,48000
Shaun Johnston,Biomedical Engineer,98000

tip

Make sure your App correctly references the private data on the other Datasite. This will ensure your code runs and the results are computed properly.

Submit the App

Next, submit (send) your App for review. This can be done in multiple ways, including:

compress it in a .zip file and send it
push it to a repository to which the other party has access

Not much else do be done from this point forward , you just need to wait for the other party to review your App and install it. Take a coffee break! ☕️

Check the results

After you get the green light and the other party installs your App, you should soon be able to see the results in the format specified by your App.

For the example above, you'll see a public file, income_mean.json, appear under the other party's Datasite:

income_mean.json
{
  "result": 72005.0505050505
}

Conclusion

Extracting information from other Datasites can be acheived by requesting other parties to install Apps that access private data.

While the example above is pretty simple, your App can definitely include more steps (access external services, require additional configuration etc.), depending on your use-case.

Discovering Private Datasets on the Network​

Workflow​

Write an App​

Submit the App​

Check the results​

Conclusion​