Counting paths in pandas & networkx#

Welcome back to Cameron’s Corner! It’s the second week of January, and I’m already here to talk about graphs. No, not the kind we make in Matplotlib, but network graphs! This blog post was inspired by a project I’ve been working on: counting the number of indirect connections between two non-adjacent nodes in a bipartite graph.

In graph theory terms, a graph is bipartite if its nodes are segmented into discrete levels, where nodes from one level connect to nodes from another level but never within the same level. Here is an example from Wikipedia of what a complete bipartite graph might look like:

Raw Data#

The raw data for this problem are kept in a relational database, and, thankfully, we know all of the levels of our graph in advance (in this example, we will only have 2 levels). Another criterion for levels > 2 would be that each node in this graph only connects to an adjacent level.

Looking at the data below, our levels are divided into 'source' and 'target'. The question we are trying to answer is, “How many indirect paths exist between a source node and another source node?” We constrain ourselves to only consider paths of length 2, meaning our transitions will move from 'source''target''source'.

from pandas import DataFrame, option_context

topology = DataFrame({
    'source': [*'ABCDEFGHICDDEF'],
    'target': [*'XXXXYYZZZYYZZZ']
})

with option_context('display.max_rows', 8):
    display(topology)
source target
0 A X
1 B X
2 C X
3 D X
... ... ...
10 D Y
11 D Z
12 E Z
13 F Z

14 rows × 2 columns

The data is currently stored as an edgelist with each row in topology corresponding to a single edge between two nodes.

Adjacency Matrix#

Since our data is already in pandas, we can .pivot it to change our view from an edgelist into an adjacency matrix. Note that, since the graph is bipartite, we have disjoint sets of nodes for our 'source' and 'target'.

(
    topology.assign(edge=True)
    .pivot(index='source', columns='target', values='edge', )
    .fillna('') # display purposes
)
target X Y Z
source
A True
B True
C True True
D True True True
E True True
F True True
G True
H True
I True

Counting in pandas#

The naive approach will require us to traverse this tree structure for a given root. Let’s select the starting node root='A' and see how many other 'source' nodes we can reach from two hops.

To ease this process, we can set up a list of possible transitions using pandas.Series objects. This will allow us to perform quick lookups when traversing from one level to the next. We essentially take the nodes from one level to .loc them from the next! Taking advantage of pandas.Index is very useful here.

from itertools import pairwise

transitions = [ # pre-determined transitions
    topology[[left, right]].set_index(left).squeeze()
    for left, right in pairwise(['source', 'target', 'source'])
]

root = 'A'
nodes = [root]
for transit in transitions:
    # remove .unique to get redundant paths
    nodes = transit.loc[nodes].unique()
nodes = nodes[nodes != root] # dont include the root
print(nodes.tolist(), len(nodes))
['B', 'C', 'D'] 3

This means that if we start at node 'A', we can draw a path to nodes ['B', 'C', 'D'] by taking two hops between our bipartite levels.

If we want to repeat this process for all possible starting nodes, we can use a for-loop like so:

for root in topology['source'].unique():
    nodes = [root]
    for transit in transitions:
        # remove .unique to get redundant paths
        nodes = transit.loc[nodes].unique()
    nodes = nodes[nodes != root]
    print(root, len(nodes), nodes)
A 3 ['B' 'C' 'D']
B 3 ['A' 'C' 'D']
C 5 ['A' 'B' 'D' 'E' 'F']
D 8 ['A' 'B' 'C' 'E' 'F' 'G' 'H' 'I']
E 6 ['F' 'C' 'D' 'G' 'H' 'I']
F 6 ['E' 'C' 'D' 'G' 'H' 'I']
G 5 ['H' 'I' 'D' 'E' 'F']
H 5 ['G' 'I' 'D' 'E' 'F']
I 5 ['G' 'H' 'D' 'E' 'F']

Networkx#

While our data is stored in a tabular format, they do not necessarily represent tabular data. We can use networkx to create an undirected Graph so that we can perform these lookups at the Python level.

The nice part about using networkx here is that we do not need to manually specify the transitions as we did with pandas, we only need use the number of transitions we can take (2), and the lookups will take care of the rest.

import networkx as nx
from itertools import pairwise, chain

def unique(iterable):
    seen = set()
    for elem in iterable:
        if elem in seen: continue
        seen.add(elem)
        yield elem

G = nx.from_pandas_edgelist(
    topology, source='source', target='target', create_using=nx.Graph
)

root = 'A'
nodes = [root]
for _ in range(2): # maximum number of transitions
    # remove unique for redundant paths
    nodes = unique(chain.from_iterable(G[n] for n in nodes))
nodes = [n for n in nodes if n != root]
print(root, nodes, len(nodes))
A ['B', 'C', 'D'] 3

In the same fashion as the above pandas example, we can repeat this process for root nodes like so:

for root in topology['source'].unique():
    nodes = [root]
    for _ in range(2): # maximum number of transitions
        # remove unique for redundant paths
        nodes = unique(chain.from_iterable(G[n] for n in nodes))
    nodes = [n for n in nodes if n != root]
    print(root, len(nodes), nodes)
A 3 ['B', 'C', 'D']
B 3 ['A', 'C', 'D']
C 5 ['A', 'B', 'D', 'E', 'F']
D 8 ['A', 'B', 'C', 'E', 'F', 'G', 'H', 'I']
E 6 ['F', 'C', 'D', 'G', 'H', 'I']
F 6 ['E', 'C', 'D', 'G', 'H', 'I']
G 5 ['H', 'I', 'D', 'E', 'F']
H 5 ['G', 'I', 'D', 'E', 'F']
I 5 ['G', 'H', 'D', 'E', 'F']

Wrap-Up#

And there we have it! Two straightforward ways to count the number of paths given our graph topology. While this is all of the path counting we’ll do for today, we do have faster algorithmic approaches we can take to arrive at these results. So, make sure you tune back in next week, when we will find a better way to use the adjacency matrix we made earlier! Talk to you then.