Sybrandt/AGATHA_2015

machine learning graph
Name AGATHA_2015
Group Sybrandt
Matrix ID 2893
Num Rows 183,964,077
Num Cols 183,964,077
Nonzeros 11,588,725,964
Pattern Entries 11,588,725,964
Kind Undirected Graph
Symmetric Yes
Date 2020
Author J. Sybrandt, I. Tyagin, M. Shtutman, I. Safro
Editor T. Davis
Structural Rank
Structural Rank Full
Num Dmperm Blocks
Strongly Connect Components
Num Explicit Zeros 0
Pattern Symmetry 100%
Numeric Symmetry 100%
Cholesky Candidate no
Positive Definite no
Type binary
Download MATLAB Rutherford Boeing Matrix Market
Notes
Sybrandt/AGATHA_2015: deep-learning graph                              
                                                                       
AGATHA: Automatic Graph-mining And Transformer based Hypothesis        
Generation Approach                                                    
                                                                       
Justin Sybrandt, Ilya Tyagin, Michael Shtutman, Ilya Safro             
Clemson University                                                     
paper: https://arxiv.org/abs/2002.05635.                               
                                                                       
abstract: Medical research is risky and expensive. Drug discovery, as  
an example, requires that researchers efficiently winnow thousands of  
potential targets to a small candidate set for more thorough           
evaluation. However, research groups spend significant time and money  
to perform the experiments necessary to determine this candidate set   
long before seeing intermediate results. Hypothesis generation systems 
address this challenge by mining the wealth of publicly available      
scientific information to predict plausible research directions. We    
present AGATHA, a deep-learning hypothesis generation system that can  
introduce data-driven insights earlier in the discovery process.       
Through a learned ranking criteria, this system quickly prioritizes    
plausible term-pairs among entity sets, allowing us to recommend new   
research directions. We massively validate our system with a temporal  
holdout wherein we predict connections first introduced after 2015     
using data published beforehand. We additionally explore biomedical    
sub-domains, and demonstrate AGATHA's predictive capacity across the   
twenty most popular relationship types. This system achieves           
best-in-class performance on an established benchmark, and demonstrates
high recommendation scores across subdomains. Reproducibility: All     
code, experimental data, and pre-trained models are available online:  
http://sybrandt.com/2020/agatha .                                      
                                                                       
Appears in the 29TH ACM Intl. Conf. on Information and Knowledge       
Management, Oct 2020.  https://www.cikm2020.org/                       
                                                                       
Details of this matrix:                                                
                                                                       
This matrix represents the semantic graph associated with the Agatha   
hypothesis generation system: https://arxiv.org/abs/2002.05635.        
                                                                       
The matrix was created by selecting all biomedical papers published    
prior to January 1st 2015. We extracted terms, phrases, entities, and  
author-supplied metadata keywords for each. In addition we identify all
SemRep predicate arguments. Each sentence, term, phrase, entity, and   
predicate represents a node. Edges follow a particular schema,         
described in detail in the paper. Most edges indicate that two nodes   
co-occur. For instance, if a sentence contains a term, then an edge    
exists between the two nodes. We introduce a set of edges between      
sentences based on the nearest-neighbors network of sentence           
embeddings.                                                            
                                                                       
The graph is really big, consisting of 183,964,077 nodes and           
11,588,725,964 edges.  Each node has a label, consisting of a          
character string, with a length ranging from 0 to 782 characters.      
The mean string lenth is 17.4.  A single node (171,649,101 in 1-based  
notation) has a label of length zero, and is an artifact from how the  
original data was processed.  All other node labels range in length    
from 3 to 782.  The longest string comes from the following paper:     
https://pubmed.ncbi.nlm.nih.gov/183954/ which states: "The primary     
structure of the enzyme was determined: Ac-Met-Glu-...--Ala-Leu-Lys."  
                                                                       
To save space in the MATLAB representation, the node labels are held   
in a single character array, Problem.aux.names, where each label is    
terminated with a newline character. To extract the label of any       
given node, do the following:                                          
                                                                       
   names = Problem.aux.names ;                                         
   p = [1 find(names==10)+1] ;                                         
   label = names (p(i):p(i+1)-2) ;                                     
                                                                       
For example, to list all nodes with labels of length 0 to 4:           
                                                                       
   p = [1 find(names==10)+1] ;                                         
   d = diff (p) ;                                                      
   for len = 0:4                                                       
       fprintf ('\nnodes with labels of length %d:\n', len) ;          
       nodes = find (d == len+1) ;                                     
       for i = nodes                                                   
           fprintf ('%12d: [%s]\n', i, names (p(i):p(i+1)-2)) ;        
       end                                                             
   end