Representing networks#

Adjacency matrices#

The most common representation of a network in a mathematical sense is an adjacency matrix. For a network with \(n\) nodes, an adjacency matrix is an \(n \times n\) matrix, which we’ll often denote \(A\). If there is an edge between node \(i\) and node \(j\), then there is some nonzero element at the \(i, j\)-th element of \(A\), \(a_{ij}\).

For an unweighted network, the convention is for the matrix to have a “1” anywhere there is an edge, and a “0” otherwise. One example of an unweighted network is a social network, where each node is a person. We could say that there is an edge between two people if they are friends on Facebook, and there is not an edge otherwise. You may also hear the term “binary network” used to mean unweighted, since the elements of the adjacency matrix are 0s and 1s.

For a weighted network, \(a_{ij}\) will have a nonzero value representing the weight associated with that edge. For an example of a weighted network, we could consider another social network formed from email communications. We could say that there is an edge between any two people who emailed each other during a given time period, but we could further say that the edge weight is the number of emails between them.

For an undirected network, we are only interested in whether there is a relationship between node \(i\) and node \(j\). There is no distinction between an edge from \(i\) to \(j\) vs an edge from \(j\) to \(i\). Thus, in the adjacency matrix, we have that \(a_{ij} = a_{ji}\). This makes the matrix symmetric: \(A = A^T\). In our social network analogy, an undirected network could come from a network of Facebook friends, where a friendship from person \(i\) to \(j\) necessarily implies that \(j\) is friends with \(i\).

For a directed network, we do not have the constraint above, and \(a_{ij}\) does not necessarily equal \(a_{ji}\). Continuing our social network analogy, Twitter could be modeled as a network of accounts, with directed edges from \(i\) to \(j\) if person \(i\) follows person \(j\). Since Twitter does not require people to follow each other back, this is a directed network.

Adjacency matrices in code#

There are many ways to represent networks in Python. For instance, we can use the adjacency matrix representation described above. We’ll start with an example of a directed graph on 5 nodes:

import numpy as np

n_nodes = 5

A = np.zeros((n_nodes, n_nodes))  # initialize to a graph with no edges

A[0, 1] = 1  # from node 0 to node 1
A[1, 2] = 1  # from node 1 to node 2...
A[1, 4] = 1  # etc, etc.
A[2, 1] = 1
A[2, 3] = 1
A[4, 1] = 1

A
array([[0., 1., 0., 0., 0.],
       [0., 0., 1., 0., 1.],
       [0., 1., 0., 1., 0.],
       [0., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0.]])

We can check whether the network is directed:

equals = A == A.T  # a matrix comparing each element
print("Is this network undirected?")
print(equals.all())  # is this matrix all true?
Is this network undirected?
False

We can do the same thing using graspologic, as well as force the matrix to be symmetric/undirected:

from graspologic.utils import symmetrize, is_symmetric

print(is_symmetric(A))

A_sym = symmetrize(A)
A_sym
False
array([[0. , 0.5, 0. , 0. , 0. ],
       [0.5, 0. , 1. , 0. , 1. ],
       [0. , 1. , 0. , 0.5, 0. ],
       [0. , 0. , 0.5, 0. , 0. ],
       [0. , 1. , 0. , 0. , 0. ]])

Now, we see that we have a weighted network - not all edge weights are equal to 0 or 1. This is because symmetrize used the average of edge weights between the edge \(i \rightarrow j\) and the edge \(j \rightarrow i\) to make the network symmetric/undirected.

If we wanted to make the network unweighted again, we could use the function binarize.

from graspologic.utils import binarize

binarize(A_sym)
array([[0., 1., 0., 0., 0.],
       [1., 0., 1., 0., 1.],
       [0., 1., 0., 1., 0.],
       [0., 0., 1., 0., 0.],
       [0., 1., 0., 0., 0.]])

Sparse networks#

Most networks in the real world are sparse - this means that they have far more non-edges than edges. For example, consider the example of a Twitter network from before. From a quick Google (I won’t claim that these results are perfectly accurate, but still), there are 100 million Twitter accounts, and the average account has ~700 followers. So on average, this would mean there are roughly \(700 \times 100,000,000 = 70,000,000,000\) total follow relationships on Twitter (70 billion edges!). While this may seem like a lot, consider the total number of possible edges - this would be \(100,000,000^2\), or \(10,000,000,000,000,000\).

What does this tell us about the adjacency matrix for Twitter? It would have far, far more \(0\)’s in it than non-zeros, i.e. there are far, far more non-edges than edges. If we were to try to store the adjacency matrix for Twitter in a numpy array, we would very quickly run out of memory.

Sparse adjacency matrices in code#

Sparse matrices are a wonderful way to get around this problem when working with sparse networks. In Python, the most common sparse matrix format can be found in SciPy. A sparse matrix format typically saves memory (if the data is sparse) by only explicitly representing any non-zero entries, and then assuming that all other elements of the matrix are zero. The main disadvantage to using a sparse matrix is that they are often more difficult to manipulate and modify than a dense array, but sparse matrices can allow us to store or operate on adjacency matrices which would otherwise be too large to store in memory. There are many different kinds of sparse matrices (see both links earlier in this section), but for now, we’ll just focus on the compressed sparse row (CSR) format.

from scipy.sparse import csr_matrix

A_sparse = csr_matrix(A)

A_sparse
<5x5 sparse matrix of type '<class 'numpy.float64'>'
	with 6 stored elements in Compressed Sparse Row format>

Warning

I do NOT recommend actually building your sparse adjacency matrix by first creating a dense array, and then making a sparse one from it. This defeats the purpose of trying to save memory since you had to build the dense array. If you want to make a sparse array “by hand,” look at the constructor for csr_matrix to see how it can be instantiated with arrays of nonzero values only. Otherwise, if you are starting from networkx, use nx.to_scipy_sparse_matrix(g, ...).

We can verify that the conversion preserved the data itself by converting back to a dense array.

print("Are all elements the same?")
(A == A_sparse.toarray()).all()
Are all elements the same?
True

Many of the algorithms we’ll talk about in this course can be sped up by making use of sparse matrices where appropriate.

NetworkX graphs#

There are many other ways to represent networks in code - probably the most popular of these in Python is by using the library NetworkX, which was created specifically for working with networks.

import networkx as nx

g = nx.Graph()

g.add_edge(0, 1)
g.add_edge(1, 2)
g.add_edge(1, 4)
g.add_edge(2, 1)
g.add_edge(2, 3)

g
<networkx.classes.graph.Graph at 0x13115ce80>

What happens when we convert to an array representation?

nx.to_numpy_array(g, nodelist=[0, 1, 2, 3, 4])
array([[0., 1., 0., 0., 0.],
       [1., 0., 1., 0., 1.],
       [0., 1., 0., 1., 0.],
       [0., 0., 1., 0., 0.],
       [0., 1., 0., 0., 0.]])

Note what happened in the above - the array we got back might not be exactly what you expected. For instance, we did not specify an edge from node \(4\) to node \(1\), but element \([4, 1]\) has a \(1\) in it.

The NetworkX Graph object always assumes that your graph is undirected. If you’d like to create a directed graph, you need to do so explicitly.

g_directed = nx.DiGraph()

g_directed.add_edge(0, 1)
g_directed.add_edge(1, 2)
g_directed.add_edge(1, 4)
g_directed.add_edge(2, 1)
g_directed.add_edge(2, 3)

nx.to_numpy_array(g_directed, nodelist=[0, 1, 2, 3, 4])
array([[0., 1., 0., 0., 0.],
       [0., 0., 1., 0., 1.],
       [0., 1., 0., 1., 0.],
       [0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.]])

Question

Why did we have to specify the nodelist parameter?

Here’s a hit: look what happens when we don’t:

nx.to_numpy_array(g)
array([[0., 1., 0., 0., 0.],
       [1., 0., 1., 1., 0.],
       [0., 1., 0., 0., 1.],
       [0., 1., 0., 0., 0.],
       [0., 0., 1., 0., 0.]])
nx.to_numpy_array(g, nodelist=[0, 1, 2, 3, 4])
array([[0., 1., 0., 0., 0.],
       [1., 0., 1., 0., 1.],
       [0., 1., 0., 1., 0.],
       [0., 0., 1., 0., 0.],
       [0., 1., 0., 0., 0.]])

Here’s another hint - we can look at the .nodes attribute of the graph:

g.nodes
NodeView((0, 1, 2, 4, 3))

Note that these nodes are stored in the order in which they were added by default, and not in any sorted order.

This brings up an important point about adjacency matrices - any permutation of the row/column indices of the adjacency matrix are valid. We’ll revisit this point many times throughout the course.

We can also directly create a sparse array representation.

A_sparse = nx.to_scipy_sparse_matrix(g, nodelist=[0, 1, 2, 3, 4])
A_sparse
<5x5 sparse matrix of type '<class 'numpy.int64'>'
	with 8 stored elements in Compressed Sparse Row format>

Bespoke network formats#

Many libraries for dealing with networks create their own data structure for storing networks under the hood.

The advantage of this approach is that the data structure can be optimized to the kind of operations that a specific library often does. See, for example, the graph classes in stellargraph or graph-tool. These data structures usually are implemented in faster, compiled languages like C, too. The main disadvantage of this approach is that it usually requires the user to learn a new format each time they pick up a new library, and it can become more difficult to chain together different operations from different libraries.

In contrast, the approach in NetworkX is to use pure Python for everything under the hood, which tends to make operations slow. Indeed, many algorithms in NetworkX have to convert to a sparse/dense adjacency matrix each time a particular function is called, and others have to iterate through nodes/edges using Python for loops. Both of which are slow.

Question

What are the advantages and disadvantages of using a dense (numpy) adjacency matrix?

What are the advantages and disadvantages of using a sparse (scipy) adjacency matrix?

What are the advantages and disadvantages of using a networkx graph?

What are the advantages and disadvantages of using a bespoke, custom graph object for a particular set of algorithms?