How can we learn high quality models when data is inherently distributed across sites and cannot be shared or pooled? In federated learning, the solution is to iteratively train models locally at each site and share these models with the server to be aggregated to a global model. As only models are shared, data usually remains undisclosed. This process, however, requires sufficient data to be available at each site in order for the locally trained models to achieve a minimum quality – even a single bad model can render aggregation arbitrarily bad.
In healthcare settings, however, we often have as little as a few dozens of samples per hospital. How can we still collaboratively train a model from a federation of hospitals, without infringing on patient privacy?
At this year’s ICLR, my colleagues Jonas Fischer, Jilles Vreeken and me presented an novel building block for federated learning called daisy-chaining. This approach trains models consecutively on local datasets, much like a daisy chain. Daisy-chaining alone, however, violates privacy, since a client can infer from a model upon the data of the client it received it from. Moreover, performing daisy-chaining naively would lead to overfitting which can cause learning to diverge. In our paper “Federated Learning from Small Datasets“, we propose to combine daisy-chaining of local datasets with aggregation of models, both orchestrated by the server, and term this method Federated Daisy-Chaining (FedDC).
This approach allows us to train models successfully from as little as 2 samples per client. Our results on image data (Table 1) show that FedDC not only outperforms standard federated avering (FedAvg), but also state-of-the-art federated learning approaches, achieving a test accuracy close to centralized training.
Discovering causal relationships enables us to build more reliable, robust, and ultimately trustworthy models. It requires large amounts of observational data, though. In healthcare, for most diseases the amount of available data is large, but this data is scattered over thousands of hospitals worldwide. Since this data in most cases mustn’t be pooled for privacy reasons, we need a way to learn a structural causal model in a federated fashion.
At this year’s AISTATS, my co-authors Osman Mian, David Kaltenpoth, Jilles Vreeken and me presented the paper “Nothing but Regrets – Privacy-Preserving Federated Causal Discovery” in which we show that you can discover causal relationships by sharing only regret values with a server: The server sends a candidate causal model to each client and the clients reply with how much worse single-edge extensions of this global model are compared to the original global model. From this information alone, the server can compute the best extension of the current global model.
In practice, the environments at the local clients are not the same. We should expect local differences that could be modeled by interventions into the global causal structure. In our AAAI paper “Information-Theoretic Causal Discovery and Intervention Detection over Multiple Environments” we have shown how to discover a global causal structure as well as local interventions in a centralized setting. Our current goal is to combine these two works to provide an approach to federated causal discovery from heterogeneous environments.
For medical data this has an enormous impact: Being able to reliably detect causal relationships in medical data, such as gene expressions or patient records, allows us not only to build more reliable and trustworthy models, but also to detect novel insights on diseases and risk factors.
Reliably detecting causal relationships requires large amounts of observational data, though. Therefore, it is paramount to develop privacy-preserving methods to tap into the large, but inherently distributed medical datasets in hospitals all over the world. What we need is federated causal discovery.