SNAP/com-DBLP

SNAP network: DBLP collaboration network and ground-truth communities

Name	com-DBLP
Group	SNAP
Matrix ID	2779
Num Rows	317,080
Num Cols	317,080
Nonzeros	2,099,732
Pattern Entries	2,099,732
Kind	Undirected Graph With Communities
Symmetric	Yes
Date	2012
Author	J. Yang, J. Leskovec
Editor	J. Leskovec

Structural Rank
Structural Rank Full
Num Dmperm Blocks
Strongly Connect Components	1
Num Explicit Zeros	0
Pattern Symmetry	100%
Numeric Symmetry	100%
Cholesky Candidate	no
Positive Definite	no
Type	binary

Download	MATLAB Rutherford Boeing Matrix Market
Notes	SNAP (Stanford Network Analysis Platform) Large Network Dataset Collection, Jure Leskovec and Anrej Krevl, http://snap.stanford.edu/data, June 2014. email: jure at cs.stanford.edu DBLP collaboration network and ground-truth communities https://snap.stanford.edu/data/com-DBLP.html Dataset information The DBLP (http://dblp.uni-trier.de/) computer science bibliography provides a comprehensive list of research papers in computer science. We construct a co-authorship network where two authors are connected if they publish at least one paper together. Publication venue, e.g, journal or conference, defines an individual ground-truth community; authors who published to a certain journal or conference form a community. We regard each connected component in a group as a separate ground-truth community. We remove the ground-truth communities which have less than 3 nodes. We also provide the top 5,000 communities with highest quality which are described in our paper (http://arxiv.org/abs/1205.6233). As for the network, we provide the largest connected component. Dataset statistics Nodes 317080 Edges 1049866 Nodes in largest WCC 317080 (1.000) Edges in largest WCC 1049866 (1.000) Nodes in largest SCC 317080 (1.000) Edges in largest SCC 1049866 (1.000) Average clustering coefficient 0.6324 Number of triangles 2224385 Fraction of closed triangles 0.1283 Diameter (longest shortest path) 21 90-percentile effective diameter 8 Source (citation) J. Yang and J. Leskovec. Defining and Evaluating Network Communities based on Ground-truth. ICDM, 2012. http://arxiv.org/abs/1205.6233 Files File Description com-dblp.ungraph.txt.gz Undirected DBLP co-authorship network com-dblp.all.cmty.txt.gz DBLP communities com-dblp.top5000.cmty.txt.gz DBLP communities (Top 5,000) --------------------------------------------------------------------------- Notes on inclusion into the SuiteSparse Matrix Collection, July 2018: --------------------------------------------------------------------------- The graph in the SNAP data set is 0-based, with nodes numbering 0 to 425,956. In the SuiteSparse Matrix Collection, Problem.A is the undirected collaboration graph, a matrix of size n-by-n with n=317,080, which is the number of unique author id's appearing in any edge. Problem.aux.nodeid is a list of the node id's that appear in the SNAP data set. A(i,j)=1 if the author with nodeid(i) is a coauthor with the author with nodeid(j). The node id's are the same as the SNAP data set (0-based). C = Problem.aux.Communities_all is a sparse matrix of size n by 13,477, which represents the 13,477 communities in the com-dblp.all.cmty.txt file. The kth line in that file defines the kth community, and is the column C(:,k), where C(i,k)=1 if author nodeid(i) is in the kth community. Row C(i,:) and row/column i of the A matrix thus refer to the same author. There are a few duplicate communities in this list with 13,423 unique communities out of 13,477 total. Ctop = Problem.aux.Communities_top5000 is n-by-5000, with the same structure as the C array above. This list has duplicates, which are preserved here. There are 4,961 unique communities.

Download

MATLAB Rutherford Boeing Matrix Market

Notes

SNAP (Stanford Network Analysis Platform) Large Network Dataset Collection,
Jure Leskovec and Anrej Krevl, http://snap.stanford.edu/data, June 2014.   
email: jure at cs.stanford.edu                                             
                                                                           
DBLP collaboration network and ground-truth communities                    
                                                                           
https://snap.stanford.edu/data/com-DBLP.html                               
                                                                           
Dataset information                                                        
                                                                           
The DBLP (http://dblp.uni-trier.de/) computer science bibliography provides
a comprehensive list of research papers in computer science. We construct a
co-authorship network where two authors are connected if they publish at   
least one paper together.  Publication venue, e.g, journal or conference,  
defines an individual ground-truth community; authors who published to a   
certain journal or conference form a community.                            
                                                                           
We regard each connected component in a group as a separate ground-truth   
community. We remove the ground-truth communities which have less than 3   
nodes.  We also provide the top 5,000 communities with highest quality     
which are described in our paper (http://arxiv.org/abs/1205.6233). As for  
the network, we provide the largest connected component.                   
                                                                           
Dataset statistics                                                         
Nodes   317080                                                             
Edges   1049866                                                            
Nodes in largest WCC    317080 (1.000)                                     
Edges in largest WCC    1049866 (1.000)                                    
Nodes in largest SCC    317080 (1.000)                                     
Edges in largest SCC    1049866 (1.000)                                    
Average clustering coefficient  0.6324                                     
Number of triangles 2224385                                                
Fraction of closed triangles    0.1283                                     
Diameter (longest shortest path)    21                                     
90-percentile effective diameter    8                                      
                                                                           
Source (citation)                                                          
J. Yang and J. Leskovec. Defining and Evaluating Network Communities based 
on Ground-truth. ICDM, 2012.  http://arxiv.org/abs/1205.6233               
                                                                           
Files                                                                      
File    Description                                                        
com-dblp.ungraph.txt.gz Undirected DBLP co-authorship network              
com-dblp.all.cmty.txt.gz    DBLP communities                               
com-dblp.top5000.cmty.txt.gz    DBLP communities (Top 5,000)               
                                                                           
---------------------------------------------------------------------------
Notes on inclusion into the SuiteSparse Matrix Collection, July 2018:      
---------------------------------------------------------------------------
                                                                           
The graph in the SNAP data set is 0-based, with nodes numbering 0 to       
425,956.                                                                   
                                                                           
In the SuiteSparse Matrix Collection, Problem.A is the undirected          
collaboration graph, a matrix of size n-by-n with n=317,080, which is the  
number of unique author id's appearing in any edge.  Problem.aux.nodeid is 
a list of the node id's that appear in the SNAP data set.  A(i,j)=1 if the 
author with nodeid(i) is a coauthor with the author with nodeid(j).        
The node id's are the same as the SNAP data set (0-based).                 
                                                                           
C = Problem.aux.Communities_all is a sparse matrix of size n by 13,477,    
which represents the 13,477 communities in the com-dblp.all.cmty.txt file. 
The kth line in that file defines the kth community, and is the column     
C(:,k), where C(i,k)=1 if author nodeid(i) is in the kth community.  Row   
C(i,:) and row/column i of the A matrix thus refer to the same author.     
There are a few duplicate communities in this list with 13,423 unique      
communities out of 13,477 total.                                           
                                                                           
Ctop = Problem.aux.Communities_top5000 is n-by-5000, with the same         
structure as the C array above.  This list has duplicates, which are       
preserved here.  There are 4,961 unique communities.