SNAP/wiki-topcats

SNAP network: Wikipedia network of top categories

Name	wiki-topcats
Group	SNAP
Matrix ID	2799
Num Rows	1,791,489
Num Cols	1,791,489
Nonzeros	28,511,807
Pattern Entries	28,511,807
Kind	Directed Graph With Communities
Symmetric	No
Date	2011
Author	H. Yin, A. R. Benson, J. Leskovec, D. F. Gleich
Editor	J. Leskovec

Structural Rank
Structural Rank Full
Num Dmperm Blocks
Strongly Connect Components	1
Num Explicit Zeros	0
Pattern Symmetry	21.5%
Numeric Symmetry	21.5%
Cholesky Candidate	no
Positive Definite	no
Type	binary

Download	MATLAB Rutherford Boeing Matrix Market
Notes	SNAP (Stanford Network Analysis Platform) Large Network Dataset Collection, Jure Leskovec and Anrej Krevl, http://snap.stanford.edu/data, June 2014. email: jure at cs.stanford.edu Wikipedia network of top categories https://snap.stanford.edu/data/wiki-topcats.html Dataset information This is a web graph of Wikipedia hyperlinks collected in September 2011. The network was constructed by first taking the largest strongly connected component of Wikipedia, then restricting to pages in the top set of categories (those with at least 100 pages), and finally taking the largest strongly connected component of the restricted graph. In addition to the graph, we also provide the page names of the articles and the categories of the articles. The categories can serve as "ground-truth" communities. The categories are overlapping as each article may be classified into several categories. Dataset statistics Nodes 1,791,489 Edges 28,511,807 Nodes in largest WCC 1791489 (1.000) Edges in largest WCC 28511807 (1.000) Nodes in largest SCC 1791489 (1.000) Edges in largest SCC 28511807 (1.000) Average clustering coefficient 0.2746 Number of triangles 52106893 Fraction of closed triangles 0.00165 Diameter (longest shortest path) 9 90-percentile effective diameter 3.8 Source (citation) Hao Yin, Austin R. Benson, Jure Leskovec, and David F. Gleich. "Local Higher-order Graph Clustering." In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2017. Christine Klymko, David F. Gleich, and Tamara G. Kolda. "Using triangles to improve community detection in directed networks." In Proceedings of the ASE BigData Conference. 2014. Files File Description wiki-topcats.txt.gz Hyperlink network of Wikipedia wiki-topcats-categories.txt.gz Which articles are in which of the top categories wiki-topcats-page-names.txt.gz Names of the articles --------------------------------------------------------------------------- Notes on inclusion into the SuiteSparse Matrix Collection, July 2018: --------------------------------------------------------------------------- The SNAP data set is 0-based, with nodes numbered 0 to n-1 with n=1,791,489. It is converted to 1-based in the SuiteSparse Matrix Collection. Problem.A is the directed graph, where A(i,j)=1 if the ith page has a link to the jth page (with i and j in the range 1 to n, so they are 1+ offset from the SNAP node numbers). The number of links is not in the SNAP data set, so if page i has multiple links to page j, it still counts as just a single edge; this is not a multigraph. Problem.aux.pagenames is a char array of size n-by-TODO, with the kth row equal to the name of the kth page, and also the kth line of the text file in the MatrixMarket and Rutherford-Boeing format. In the SNAP data set, the name was prepended with the node number, but that has been removed here since the node numbering has changed from 0-based to 1-based. 81 of the pages have no names; these have been named as "page_#" where # is the node number in the 1-based graph. The wiki-topcats-categories.txt data defines 17,364 categories, and is held in two parts in the SuiteSparse collection. Problem.aux.Category_names is a char array of size 17,364-by-96 with the name of each category (Category_names(k,:) is the name of the kth category). The sparse matrix C = Problem.aux.Categories defines the pages in each category. The kth category is represented as C(:,k), where C(i,k)=1 if page i is in the kth category. All categories in the SNAP data set are preserved, including four empty categories: Category 5207 [Shanghai_Metro_stations] Category 6554 [Disused_railway_stations_in_Pomeranian_Voivodeship] Category 8404 [Colostethus] Category 17358 [Days_in_2004] and thus columns 5207, 6554, 8404, and 17358 are all zero in the matrix C. All top categories consisted of at least 100 pages in the raw data, but only the largest strongly-connected component was kept in the final published SNAP data set.

Download

MATLAB Rutherford Boeing Matrix Market

Notes

SNAP (Stanford Network Analysis Platform) Large Network Dataset Collection,
Jure Leskovec and Anrej Krevl, http://snap.stanford.edu/data, June 2014.   
email: jure at cs.stanford.edu                                             
                                                                           
Wikipedia network of top categories                                        
                                                                           
https://snap.stanford.edu/data/wiki-topcats.html                           
                                                                           
Dataset information                                                        
                                                                           
This is a web graph of Wikipedia hyperlinks collected in September 2011.   
The network was constructed by first taking the largest strongly connected 
component of Wikipedia, then restricting to pages in the top set of        
categories (those with at least 100 pages), and finally taking the largest 
strongly connected component of the restricted graph.                      
                                                                           
In addition to the graph, we also provide the page names of the articles   
and the categories of the articles. The categories can serve as            
"ground-truth" communities. The categories are overlapping as each article 
may be classified into several categories.                                 
                                                                           
Dataset statistics                                                         
Nodes   1,791,489                                                          
Edges   28,511,807                                                         
Nodes in largest WCC    1791489 (1.000)                                    
Edges in largest WCC    28511807 (1.000)                                   
Nodes in largest SCC    1791489 (1.000)                                    
Edges in largest SCC    28511807 (1.000)                                   
Average clustering coefficient  0.2746                                     
Number of triangles 52106893                                               
Fraction of closed triangles    0.00165                                    
Diameter (longest shortest path)    9                                      
90-percentile effective diameter    3.8                                    
                                                                           
Source (citation)                                                          
Hao Yin, Austin R. Benson, Jure Leskovec, and David F. Gleich. "Local      
Higher-order Graph Clustering." In Proceedings of the 23rd ACM SIGKDD      
International Conference on Knowledge Discovery and Data Mining. 2017.     
                                                                           
Christine Klymko, David F. Gleich, and Tamara G. Kolda. "Using triangles to
improve community detection in directed networks." In Proceedings of the   
ASE BigData Conference. 2014.                                              
                                                                           
Files                                                                      
File    Description                                                        
wiki-topcats.txt.gz Hyperlink network of Wikipedia                         
wiki-topcats-categories.txt.gz                                             
    Which articles are in which of the top categories                      
wiki-topcats-page-names.txt.gz  Names of the articles                      
                                                                           
---------------------------------------------------------------------------
Notes on inclusion into the SuiteSparse Matrix Collection, July 2018:      
---------------------------------------------------------------------------
                                                                           
The SNAP data set is 0-based, with nodes numbered 0 to n-1 with            
n=1,791,489.  It is converted to 1-based in the SuiteSparse Matrix         
Collection.                                                                
                                                                           
Problem.A is the directed graph, where A(i,j)=1 if the ith page has a link 
to the jth page (with i and j in the range 1 to n, so they are 1+ offset   
from the SNAP node numbers).  The number of links is not in the SNAP data  
set, so if page i has multiple links to page j, it still counts as just a  
single edge; this is not a multigraph.                                     
                                                                           
Problem.aux.pagenames is a char array of size n-by-TODO, with the kth row  
equal to the name of the kth page, and also the kth line of the text file  
in the MatrixMarket and Rutherford-Boeing format.  In the SNAP data set,   
the name was prepended with the node number, but that has been removed here
since the node numbering has changed from 0-based to 1-based.              
                                                                           
81 of the pages have no names; these have been named as "page_#" where # is
the node number in the 1-based graph.                                      
                                                                           
The wiki-topcats-categories.txt data defines 17,364 categories, and is held
in two parts in the SuiteSparse collection.  Problem.aux.Category_names is 
a char array of size 17,364-by-96 with the name of each category           
(Category_names(k,:) is the name of the kth category).  The sparse matrix C
= Problem.aux.Categories defines the pages in each category.  The kth      
category is represented as C(:,k), where C(i,k)=1 if page i is in the kth  
category.                                                                  
                                                                           
All categories in the SNAP data set are preserved, including four empty    
categories:                                                                
                                                                           
    Category 5207 [Shanghai_Metro_stations]                                
    Category 6554 [Disused_railway_stations_in_Pomeranian_Voivodeship]     
    Category 8404 [Colostethus]                                            
    Category 17358 [Days_in_2004]                                          
                                                                           
and thus columns 5207, 6554, 8404, and 17358 are all zero in the matrix C. 
All top categories consisted of at least 100 pages in the raw data, but    
only the largest strongly-connected component was kept in the final        
published SNAP data set.