Most datasets in current genomic research and in quantitative biomedicine can be represented in
matrix form. The matrix entries provide the measured molecular features (typically thousands)
across the available samples. For example, the expression of all genes in the human genome can be
reported for samples extracted from different patients with a specific disease. In this case, the goal
could be the identification of experimentally accessible molecular signatures able to discriminate
disease subtypes with different characteristics and prognosis in order to design ad-hoc therapies.
However, the development of this “personalized medicine” program is hindered by the complexity
and the high dimensionality of the system. In fact, the data are the result of a complex interplay
between the stochastic variability inherent to molecular processes, the experimental variability, for
example due to sequencing, and the relevant “signal” we would like to extract and characterize.
Moreover, given the high-dimensionality of the problem, typical data sets are under-sampled. This
complex inference problem could be facilitated by a still-lacking clear understanding of the statistical
properties of these data sets.
The goal of this project is precisely to develop data analysis techniques using statistical physics
concepts and tools to shed new light on this timely problem. In fact, the characterization of scaling
laws, the identification of universal system properties and the design of analytical null models are
examples of methods well developed in the physics of complex and disordered systems that could
guide the discovery of new inference methods in genomics. The results of this project will be crucial
to boost the collaborations that the Department of Physics has with several experimental and clinical
research groups in Torino (e.g., with the Department of Oncology and with the Italian Institute for
Genomic Medicine) whose experimental data outputs present precisely these inference challenges