GAP/GAP-twitter

GAP benchmark: twitter

Name	GAP-twitter
Group	GAP
Matrix ID	2852
Num Rows	61,578,415
Num Cols	61,578,415
Nonzeros	1,468,364,884
Pattern Entries	1,468,364,884
Kind	Directed Weighted Graph
Symmetric	No
Date	2017
Author	H. Kwak, C. Lee, H. Park, S. Moon
Editor	S. Beamer, K. Asanovic, D. Patterson

Structural Rank
Structural Rank Full
Num Dmperm Blocks
Strongly Connect Components	27,970,913
Num Explicit Zeros	0
Pattern Symmetry	36.2%
Numeric Symmetry	0.1%
Cholesky Candidate	no
Positive Definite	no
Type	integer

Download	MATLAB Rutherford Boeing Matrix Market
Notes	GAP/GAP-twitter matrix: this is the 'twitter' graph from the GAP benchmark suite, as described in "The GAP Benchmark Suite", by Scott Beamer, Krste Asanovic', and David Patterson, https://arxiv.org/abs/1508.03619 . In that paper, the Twitter graph is described as follows: Twitter (\|V\|=61.6M, \|E\|=1,468.4M, directed) is an example of a social network topology [18]. This particular crawl of Twitter has been commonly used by researchers and thus eases comparisons with prior work. By virtue of it coming from real-world data, it has interesting irregularities and the skew in its degree distribution can be a challenge for some implementations. [18] Haewoon Kwak, Changhyun Lee, Hosung Park, and Sue Moon. What is Twitter, a social network or a news media? International World Wide Web Conference (WWW), 2010. The graph already appears in the SuiteSparse Matrix Collection as SNAP/twitter7, but with different numbers of nodes and edges. The GAP/GAP-twitter graph has 61,578,415 and 1,468,364,884 edges. The SNAP/twitter7 graph has 41,652,230 nodes and 1,468,365,182 edges. The edge counts differ by 298, which are 298 self-edges in SNAP/twitter7 that do not appear in the GAP-twitter graph. The node counts differ because of how nodes with no incident edges are treated. Every node in SNAP/twitter is incident on at least one edge (incoming or outgoing). The GAP/GAP-twitter7 matrix has 19,926,185 nodes that have no incident edges at all (no incoming nor outgoing edges, nor self edges). The original data at http://an.kaist.ac.kr/traces/WWW2010.html lists edges with node ids that correspond to active user ids. User ids that did not exist at the time the data was collected, do not appear in the edge lists. These are the nodes that appear in GAP-twitter but not in SNAP/twitter7. The other difference between GAP-twitter and SNAP/twitter7 is that the former has been given random integer edge weights (in range 1 to 255), whereas SNAP/twitter7 is a binary matrix. The following shows the relationship between the two versions of this graph in this collection: Prob1 = ssget ('SNAP/twitter7') Prob2 = ssget ('GAP/GAP-twitter') A1 = GrB.offdiag (Prob1.A) ; % remove 298 diagonal entries id = Prob1.aux.nodeid + 1 ; A2 = Prob2.A ; S2 = spones (A2 (id, id)) ; % pattern of submatrix of A2 assert (nnz (S2) == nnz (A2)) % S2 has all entries from A2 assert (isequal (A1, S2)) ; % now they are equal The GAP breadth-first-search (BFS) benchmark generates 64 random source nodes and evaluates the time to compute the BFS from each of the 64 sources. The betweenness-centrality (BC) runs 16 trials with 4 source nodes each. These source nodes are the same, in the same order. That is, the first 4 BFS source nodes are the same 4 source nodes for the first trial of BC. In one-based notation (where nodes are numbered 1 to n), the 64 source nodes are: 12441073 54488258 25451916 57714474 14839495 32081105 52957358 50444381 49590702 20127817 34939334 48251002 19524254 43676727 33055509 15244688 24946739 6479473 26077683 22023876 22081916 40034163 49496015 42847508 52409558 55445389 22028098 48766649 44521242 60135543 28528672 9678013 40020307 31625736 37446893 51788953 52584256 20346697 48387910 37337428 50501085 30130062 41185894 56495704 45663306 33359461 48143059 33291514 53461446 29340611 34148499 49171807 35550697 14521508 51633219 46823383 19396274 19871751 36862678 49539127 34016453 36567396 55487794 14391371 The first row are the 4 source nodes for the first BC trial, the 2nd row is for the second BC trial, and so on. These node ids also appear as Problem.aux.sources, in one-based notation. One-based notation is used because the Matrix Market format is one-based, as is MATLAB. If you use a zero-based method (such as the GAP benchmark itself, or the C API to GraphBLAS), be sure to subtract one from the node ids above (and in Problem.aux.sources) to obtain the right source nodes. The original twitter problem was currated by Haewoon Kwak (http://an.kaist.ac.kr/~haewoon), Changhyun Lee (http://an.kaist.ac.kr/~chlee), Hosung Park (http://an.kaist.ac.kr/~hosung), and Sue Moon (http://an.kaist.ac.kr/~sbmoon). It was added to the GAP benchmark by Scott Beamer, Krste Asanovic', and David Patternson based on the original data at: http://an.kaist.ac.kr/~haewoon/release/twitter_social_graph/ twitter_rv.tar.gz. Random integer edge weights (1 to 255) were then added to the graph in the GAP benchmark. It was then included into the SuiteSparse Matrix Collection, from the GAP benchmark graph, by Tim Davis. Note that the SNAP/twitter7 problem in the SuiteSparse Matrix Collection preserves all the meta-data, including information about 6,499 celebrities in the graph. This metadata does not appear in this GAP/GAP-twitter version of the graph.

Download

MATLAB Rutherford Boeing Matrix Market

Notes

GAP/GAP-twitter matrix:  this is the 'twitter' graph from the GAP    
benchmark suite, as described in "The GAP Benchmark Suite", by Scott 
Beamer, Krste Asanovic', and David Patterson,                        
https://arxiv.org/abs/1508.03619 .                                   
                                                                     
In that paper, the Twitter graph is described as follows:            
                                                                     
    Twitter (|V|=61.6M, |E|=1,468.4M, directed) is an example of a   
    social network topology [18]. This particular crawl of Twitter   
    has been commonly used by researchers and thus eases comparisons 
    with prior work. By virtue of it coming from real-world data, it 
    has interesting irregularities and the skew in its degree        
    distribution can be a challenge for some implementations.        
                                                                     
    [18] Haewoon Kwak, Changhyun Lee, Hosung Park, and Sue Moon.     
    What is Twitter, a social network or a news media? International 
    World Wide Web Conference (WWW), 2010.                           
                                                                     
The graph already appears in the SuiteSparse Matrix Collection as    
SNAP/twitter7, but with different numbers of nodes and edges.  The   
GAP/GAP-twitter graph has 61,578,415 and 1,468,364,884 edges.        
The SNAP/twitter7 graph has 41,652,230 nodes and 1,468,365,182 edges.
The edge counts differ by 298, which are 298 self-edges in           
SNAP/twitter7 that do not appear in the GAP-twitter graph.           
                                                                     
The node counts differ because of how nodes with no incident         
edges are treated.  Every node in SNAP/twitter is incident on        
at least one edge (incoming or outgoing).  The GAP/GAP-twitter7      
matrix has 19,926,185 nodes that have no incident edges at all       
(no incoming nor outgoing edges, nor self edges).                    
                                                                     
The original data at http://an.kaist.ac.kr/traces/WWW2010.html       
lists edges with node ids that correspond to active user ids.        
User ids that did not exist at the time the data was collected,      
do not appear in the edge lists.  These are the nodes that appear    
in GAP-twitter but not in SNAP/twitter7.                             
                                                                     
The other difference between GAP-twitter and SNAP/twitter7           
is that the former has been given random integer edge weights (in    
range 1 to 255), whereas SNAP/twitter7 is a binary matrix.           
                                                                     
The following shows the relationship between the two versions of     
this graph in this collection:                                       
                                                                     
    Prob1 = ssget ('SNAP/twitter7')                                  
    Prob2 = ssget ('GAP/GAP-twitter')                                
    A1 = GrB.offdiag (Prob1.A) ;    % remove 298 diagonal entries    
    id = Prob1.aux.nodeid + 1 ;                                      
    A2 = Prob2.A ;                                                   
    S2 = spones (A2 (id, id)) ;     % pattern of submatrix of A2     
    assert (nnz (S2) == nnz (A2))   % S2 has all entries from A2     
    assert (isequal (A1, S2)) ;     % now they are equal             
                                                                     
The GAP breadth-first-search (BFS) benchmark generates 64 random     
source nodes and evaluates the time to compute the BFS from each of  
the 64 sources.  The betweenness-centrality (BC) runs 16 trials with 
4 source nodes each.  These source nodes are the same, in the same   
order.  That is, the first 4 BFS source nodes are the same 4 source  
nodes for the first trial of BC.  In one-based notation (where nodes 
are numbered 1 to n), the 64 source nodes are:                       
                                                                     
    12441073    54488258    25451916    57714474                     
    14839495    32081105    52957358    50444381                     
    49590702    20127817    34939334    48251002                     
    19524254    43676727    33055509    15244688                     
    24946739     6479473    26077683    22023876                     
    22081916    40034163    49496015    42847508                     
    52409558    55445389    22028098    48766649                     
    44521242    60135543    28528672     9678013                     
    40020307    31625736    37446893    51788953                     
    52584256    20346697    48387910    37337428                     
    50501085    30130062    41185894    56495704                     
    45663306    33359461    48143059    33291514                     
    53461446    29340611    34148499    49171807                     
    35550697    14521508    51633219    46823383                     
    19396274    19871751    36862678    49539127                     
    34016453    36567396    55487794    14391371                     
                                                                     
The first row are the 4 source nodes for the first BC trial,         
the 2nd row is for the second BC trial, and so on.                   
                                                                     
These node ids also appear as Problem.aux.sources, in one-based      
notation.  One-based notation is used because the Matrix Market      
format is one-based, as is MATLAB.  If you use a zero-based method   
(such as the GAP benchmark itself, or the C API to GraphBLAS), be    
sure to subtract one from the node ids above (and in                 
Problem.aux.sources) to obtain the right source nodes.               
                                                                     
The original twitter problem was currated by                         
Haewoon Kwak (http://an.kaist.ac.kr/~haewoon),                       
Changhyun Lee (http://an.kaist.ac.kr/~chlee),                        
Hosung Park (http://an.kaist.ac.kr/~hosung),                         
and Sue Moon (http://an.kaist.ac.kr/~sbmoon).                        
                                                                     
It was added to the GAP benchmark by Scott Beamer, Krste Asanovic',  
and David Patternson based on the original data at:                  
http://an.kaist.ac.kr/~haewoon/release/twitter_social_graph/         
twitter_rv.tar.gz.  Random integer edge weights (1 to 255) were      
then added to the graph in the GAP benchmark.  It was then included  
into the SuiteSparse Matrix Collection, from the GAP benchmark       
graph, by Tim Davis.                                                 
                                                                     
Note that the SNAP/twitter7 problem in the SuiteSparse Matrix        
Collection preserves all the meta-data, including information        
about 6,499 celebrities in the graph.  This metadata does not        
appear in this GAP/GAP-twitter version of the graph.