– Machine learning of large graphs based on tensor networks

01/11/2021 – 02/11/2021 all-day

Offre en lien avec l’Action/le Réseau : – — –/– — –

Laboratoire/Entreprise : CRIStAL/DATING/SigMA team
Durée : 3 ans
Contact : remy.boyer@univ-lille.fr
Date limite de publication : 2021-11-01

Contexte :
Thesis context
The recruited student will be integrated into the SigMA team within the DATING WG located in the CRIStAL laboratory. It will be able to benefit from the current dynamic activities related to AI and in particular CPER CORNELIA. The recruited student could come and strengthen the existing team composed of Jérémie Boulanger (MdC UdL) and Ouafae Karmouda (2nd year PhD student in AI). This funding request would be consistent with the Franco-German ANR project involving Martin Haardt from the University of Ilemenau. This project is in its submission phase and is scientifically centered around distributed and multi-modal AI. In addition, this thesis work could count on 2 additional partners: Laurent Albera from the SESAME team (Epileptogenic Systems: SignAux and Models) from the LTCI laboratory at the University of Rennes and Arthur Tenenhaus from the Institut du Cerveau et de la Moelle Epinière (ICM). Rémy Boyer has collaborated with the SESAME team and the ICM as part of the thesis of a student (Abdelhak Boudhenane, University of Paris-Saclay) to defend his thesis in November 2021.

Sujet :
Graphs are commonly used to describe the knowledge we have on given phenomena. Social networks can be modeled in the form of graphs whose nodes are users and the edges are the relationships between individuals. Biologists use graphs to describe protein interactions, while communication networks are themselves graphs. The interest of machine learning for graphs has recently become a strategic issue. The objective can be, for example, the prediction of new affinities in social networks, etc. In order to handle and perform operations on graphs, it is important to have an adequate associated mathematical representation (“embedding” in English) [1,2]. The main challenge in this area is to represent the structure of a graph in order to be easily exploited by machine learning methods.
Until now, it is therefore common to associate a graph with a vector or a set of vectors, that is to say, in the form of a matrix. For example, Word2vec is an embedding method which associates a vector with each word. Similar words must have close vector embeddings in the sense of a metric. This embedding step should capture the topology of the graph, the vertex-to-vertex relationships, and any other relevant and available information about the graph, subgraphs, and vertices. Here, we list two of the properties that an embedding method should have:
– Fidelity to topology. The embedding method must describe as closely as possible the topology of the graph (connections and neighborhood). The learning performance will of course be strongly dependent on the choice of the embedding.

– Efficiency on large graphs. Graphs are generally large. For example, we must imagine social networks composed of millions/billions users or the multitude of objects communicating in a home internet network (IoT). Each elementary entity is a node and the edges indicate a knowledge shared between nodes.

The greater the quantity of information captured by the embedding method, the more efficient the learning step will be, but the greater the resources in terms of storage and calculation will be necessary. There is a natural increase in the dimensionality of the embedding method. And in this case, the quantity of information then grows exponentially with the dimension. This trend is known as the “curse of dimensionality”. The methods of the state of the art generally learn representations of explicit nodes from the spectral decomposition of the adjacency matrix or the Laplacian matrix [3] constructed as the difference between the matrix of degrees (a diagonal matrix which contains information on the number of edges attached to each node) and the ajdacency matrix (indicating whether or not nodes are adjacent in the graph). In this thesis work, we want to generalize these approaches to a multi-view graph conext. By multi-views, we understand the broadening of the information base on which we build the embedding [4]. More precisely, it is a question of studying the learning of graphs based on tensor embeddings and the decompositions which are attached to it [5]. In the latter case, we will call on the theory of tensor networks [6,7], which consists of factorizing a tensor of high order into a collection of coupled tensors of order at most 3. The number of edges indicates the order/dimension.

By resorting to tensor networks, the initial massive problem is replaced by a collection of distributed problems involving a reduced amount of data. These approaches, in addition to the reduction of computational and storage complexities, make it possible to gain in interpretability of raw data. The challenge here is to develop this new tensor tool in the context of learning large graphs (large number of nodes). The applications of this work are multi-fold, such as for example machine learning for IoT, 5G telecoms and beyond, social networks or even health (see [8] dealing with the COVID-19 pandemic), …

[1] Cai, H., Zheng, V. W., & Chang, K. C. C. (2018). A comprehensive survey of graph embedding: Problems, techniques, and applications. IEEE Transactions on Knowledge and Data Engineering, 30(9), 1616-1637.
[2] Yan, S., Xu, D., Zhang, B., Zhang, H. J., Yang, Q., & Lin, S. (2006). Graph embedding and extensions: A general framework for dimensionality reduction. IEEE transactions on pattern analysis and machine intelligence, 29(1), 40-51.
[3] Belkin, M., & Niyogi, P. (2002). Laplacian eigenmaps and spectral techniques for embedding and clustering. In Advances in neural information processing systems (pp. 585-591).
[4] Al-Sayouri, S., Gujral, E., Koutra, D. et al. (2020) t-PINE: tensor-based predictable and interpretable node embeddings. Soc. Netw. Anal. Min. 10, 46. https://doi.org/10.1007/s13278-020-00649-4
[5] Kolda, T. G., Bader, B. W. (2009). Tensor decompositions and applications. SIAM review, 51(3), 455-500.
[6] Cichocki, A., Lee, N., Oseledets, I., Phan, A. H., Zhao, Q., Mandic, D. P. (2016). Tensor networks for dimensionality reduction and large-scale optimization: Part 1 low-rank tensor decompositions. Foundations and Trends in Machine Learning, 9(4-5), 249-429.
[7] Zniyed, Y., Boyer, R., de Almeida, A. L., & Favier, G. (2020). High-order tensor estimation via trains of coupled third-order CP and Tucker decompositions. Linear Algebra and its Applications, 588, 304-337
[8] Kanatsoulis, C. I., Sidiropoulos, N. D. (2020). TeX-Graph: Coupled tensor-matrix knowledge-graph embedding for COVID-19 drug repurposing. arXiv preprint arXiv:2010.11367.

Profil du candidat :
Candidate requirements
The recruited student must have a solid theoretical background in linear algebra and statistics. Training geared towards machine learning will be a definite plus. The work envisaged is of a methodological and prospective nature. There are many possible fields of application such as for example IoT, 5G telecoms and beyond, social networks, health … In order to validate and apply the developed methods, it is necessary to master tools programming such as MatLab or Python, for example.

Formation et compétences requises :
The first 6 months will be dedicated to the acquisition of fundamental mathematical notions for this work. It is about graph theory and the representation of knowledge on graphs. The concept of multilinearity in algebra, including classical tensor factorizations but also more advanced tensor networks should be studied in detail. The following year and a half will be devoted to proposing new tensor representations for a multi-view approach to graphs. Then, a reflection should be carried out in order to determine the ad-hoc topology for the tensor network to be used in the factorization phase and to cope with a distributed architecture for the estimator. The last year will be devoted to the application of the developed algorithms on real data, particularly biomedical data. This last year will also be the occasion for the scientific promotion of the results in the form of a thesis manuscript and publications in the best journals and conferences in the field.

Adresse d’emploi :
Univ Lille and Federal University of Ceará