GAP/GAP-twitter

GAP benchmark: twitter
Name GAP-twitter
Group GAP
Matrix ID 2852
Num Rows 61,578,415
Num Cols 61,578,415
Nonzeros 1,468,364,884
Pattern Entries 1,468,364,884
Kind Directed Weighted Graph
Symmetric No
Date 2017
Author H. Kwak, C. Lee, H. Park, S. Moon
Editor S. Beamer, K. Asanovic, D. Patterson
Structural Rank
Structural Rank Full
Num Dmperm Blocks
Strongly Connect Components 27,970,913
Num Explicit Zeros 0
Pattern Symmetry 36.2%
Numeric Symmetry 0.1%
Cholesky Candidate no
Positive Definite no
Type integer
Download MATLAB Rutherford Boeing Matrix Market
Notes
GAP/GAP-twitter matrix:  this is the 'twitter' graph from the GAP    
benchmark suite, as described in "The GAP Benchmark Suite", by Scott 
Beamer, Krste Asanovic', and David Patterson,                        
https://arxiv.org/abs/1508.03619 .                                   
                                                                     
In that paper, the Twitter graph is described as follows:            
                                                                     
    Twitter (|V|=61.6M, |E|=1,468.4M, directed) is an example of a   
    social network topology [18]. This particular crawl of Twitter   
    has been commonly used by researchers and thus eases comparisons 
    with prior work. By virtue of it coming from real-world data, it 
    has interesting irregularities and the skew in its degree        
    distribution can be a challenge for some implementations.        
                                                                     
    [18] Haewoon Kwak, Changhyun Lee, Hosung Park, and Sue Moon.     
    What is Twitter, a social network or a news media? International 
    World Wide Web Conference (WWW), 2010.                           
                                                                     
The graph already appears in the SuiteSparse Matrix Collection as    
SNAP/twitter7, but with different numbers of nodes and edges.  The   
GAP/GAP-twitter graph has 61,578,415 and 1,468,364,884 edges.        
The SNAP/twitter7 graph has 41,652,230 nodes and 1,468,365,182 edges.
The edge counts differ by 298, which are 298 self-edges in           
SNAP/twitter7 that do not appear in the GAP-twitter graph.           
                                                                     
The node counts differ because of how nodes with no incident         
edges are treated.  Every node in SNAP/twitter is incident on        
at least one edge (incoming or outgoing).  The GAP/GAP-twitter7      
matrix has 19,926,185 nodes that have no incident edges at all       
(no incoming nor outgoing edges, nor self edges).                    
                                                                     
The original data at http://an.kaist.ac.kr/traces/WWW2010.html       
lists edges with node ids that correspond to active user ids.        
User ids that did not exist at the time the data was collected,      
do not appear in the edge lists.  These are the nodes that appear    
in GAP-twitter but not in SNAP/twitter7.                             
                                                                     
The other difference between GAP-twitter and SNAP/twitter7           
is that the former has been given random integer edge weights (in    
range 1 to 255), whereas SNAP/twitter7 is a binary matrix.           
                                                                     
The following shows the relationship between the two versions of     
this graph in this collection:                                       
                                                                     
    Prob1 = ssget ('SNAP/twitter7')                                  
    Prob2 = ssget ('GAP/GAP-twitter')                                
    A1 = GrB.offdiag (Prob1.A) ;    % remove 298 diagonal entries    
    id = Prob1.aux.nodeid + 1 ;                                      
    A2 = Prob2.A ;                                                   
    S2 = spones (A2 (id, id)) ;     % pattern of submatrix of A2     
    assert (nnz (S2) == nnz (A2))   % S2 has all entries from A2     
    assert (isequal (A1, S2)) ;     % now they are equal             
                                                                     
The GAP breadth-first-search (BFS) benchmark generates 64 random     
source nodes and evaluates the time to compute the BFS from each of  
the 64 sources.  The betweenness-centrality (BC) runs 16 trials with 
4 source nodes each.  These source nodes are the same, in the same   
order.  That is, the first 4 BFS source nodes are the same 4 source  
nodes for the first trial of BC.  In one-based notation (where nodes 
are numbered 1 to n), the 64 source nodes are:                       
                                                                     
    12441073    54488258    25451916    57714474                     
    14839495    32081105    52957358    50444381                     
    49590702    20127817    34939334    48251002                     
    19524254    43676727    33055509    15244688                     
    24946739     6479473    26077683    22023876                     
    22081916    40034163    49496015    42847508                     
    52409558    55445389    22028098    48766649                     
    44521242    60135543    28528672     9678013                     
    40020307    31625736    37446893    51788953                     
    52584256    20346697    48387910    37337428                     
    50501085    30130062    41185894    56495704                     
    45663306    33359461    48143059    33291514                     
    53461446    29340611    34148499    49171807                     
    35550697    14521508    51633219    46823383                     
    19396274    19871751    36862678    49539127                     
    34016453    36567396    55487794    14391371                     
                                                                     
The first row are the 4 source nodes for the first BC trial,         
the 2nd row is for the second BC trial, and so on.                   
                                                                     
These node ids also appear as Problem.aux.sources, in one-based      
notation.  One-based notation is used because the Matrix Market      
format is one-based, as is MATLAB.  If you use a zero-based method   
(such as the GAP benchmark itself, or the C API to GraphBLAS), be    
sure to subtract one from the node ids above (and in                 
Problem.aux.sources) to obtain the right source nodes.               
                                                                     
The original twitter problem was currated by                         
Haewoon Kwak (http://an.kaist.ac.kr/~haewoon),                       
Changhyun Lee (http://an.kaist.ac.kr/~chlee),                        
Hosung Park (http://an.kaist.ac.kr/~hosung),                         
and Sue Moon (http://an.kaist.ac.kr/~sbmoon).                        
                                                                     
It was added to the GAP benchmark by Scott Beamer, Krste Asanovic',  
and David Patternson based on the original data at:                  
http://an.kaist.ac.kr/~haewoon/release/twitter_social_graph/         
twitter_rv.tar.gz.  Random integer edge weights (1 to 255) were      
then added to the graph in the GAP benchmark.  It was then included  
into the SuiteSparse Matrix Collection, from the GAP benchmark       
graph, by Tim Davis.                                                 
                                                                     
Note that the SNAP/twitter7 problem in the SuiteSparse Matrix        
Collection preserves all the meta-data, including information        
about 6,499 celebrities in the graph.  This metadata does not        
appear in this GAP/GAP-twitter version of the graph.