Counting paths in pandas & networkx#
Welcome back to Cameron’s Corner! It’s the second week of January, and I’m already here to talk about graphs. No, not the kind we make in Matplotlib, but network graphs! This blog post was inspired by a project I’ve been working on: counting the number of indirect connections between two non-adjacent nodes in a bipartite graph.
In graph theory terms, a graph is bipartite if its nodes are segmented into discrete levels, where nodes from one level connect to nodes from another level but never within the same level. Here is an example from Wikipedia of what a complete bipartite graph might look like:
Raw Data#
The raw data for this problem are kept in a relational database, and, thankfully, we know all of the levels of our graph in advance (in this example, we will only have 2 levels). Another criterion for levels > 2 would be that each node in this graph only connects to an adjacent level.
Looking at the data below, our levels are divided into 'source'
and 'target'
. The question we are trying to answer is, “How many indirect paths exist between a source node and another source node?” We constrain ourselves to only consider paths of length 2, meaning our transitions will move from 'source'
→
'target'
→ 'source'
.
from pandas import DataFrame, option_context
topology = DataFrame({
'source': [*'ABCDEFGHICDDEF'],
'target': [*'XXXXYYZZZYYZZZ']
})
with option_context('display.max_rows', 8):
display(topology)
source | target | |
---|---|---|
0 | A | X |
1 | B | X |
2 | C | X |
3 | D | X |
... | ... | ... |
10 | D | Y |
11 | D | Z |
12 | E | Z |
13 | F | Z |
14 rows × 2 columns
The data is currently stored as an edgelist with each row in topology
corresponding
to a single edge between two nodes.
Adjacency Matrix#
Since our data is already in pandas, we can .pivot
it to change our view from
an edgelist into an adjacency matrix. Note that, since the graph is bipartite, we have
disjoint sets of nodes for our 'source'
and 'target'
.
(
topology.assign(edge=True)
.pivot(index='source', columns='target', values='edge', )
.fillna('') # display purposes
)
target | X | Y | Z |
---|---|---|---|
source | |||
A | True | ||
B | True | ||
C | True | True | |
D | True | True | True |
E | True | True | |
F | True | True | |
G | True | ||
H | True | ||
I | True |
Counting in pandas#
The naive approach will require us to traverse this tree structure for a given
root
. Let’s select the starting node root='A'
and see how many other 'source'
nodes we can reach from two hops.
To ease this process, we can set up a list of possible transitions using
pandas.Series
objects. This will allow us to perform quick lookups when
traversing from one level to the next. We essentially take the nodes from one
level to .loc
them from the next! Taking advantage of pandas.Index
is very
useful here.
from itertools import pairwise
transitions = [ # pre-determined transitions
topology[[left, right]].set_index(left).squeeze()
for left, right in pairwise(['source', 'target', 'source'])
]
root = 'A'
nodes = [root]
for transit in transitions:
# remove .unique to get redundant paths
nodes = transit.loc[nodes].unique()
nodes = nodes[nodes != root] # dont include the root
print(nodes.tolist(), len(nodes))
['B', 'C', 'D'] 3
This means that if we start at node 'A'
, we can draw a path to nodes ['B', 'C', 'D']
by taking two hops between our bipartite levels.
If we want to repeat this process for all possible starting nodes, we can use a for-loop like so:
for root in topology['source'].unique():
nodes = [root]
for transit in transitions:
# remove .unique to get redundant paths
nodes = transit.loc[nodes].unique()
nodes = nodes[nodes != root]
print(root, len(nodes), nodes)
A 3 ['B' 'C' 'D']
B 3 ['A' 'C' 'D']
C 5 ['A' 'B' 'D' 'E' 'F']
D 8 ['A' 'B' 'C' 'E' 'F' 'G' 'H' 'I']
E 6 ['F' 'C' 'D' 'G' 'H' 'I']
F 6 ['E' 'C' 'D' 'G' 'H' 'I']
G 5 ['H' 'I' 'D' 'E' 'F']
H 5 ['G' 'I' 'D' 'E' 'F']
I 5 ['G' 'H' 'D' 'E' 'F']
Networkx#
While our data is stored in a tabular format, they do not necessarily represent
tabular data. We can use networkx
to create an undirected Graph
so that
we can perform these lookups at the Python level.
The nice part about using networkx
here is that we do not need to manually
specify the transitions as we did with pandas, we only need use the
number of transitions we can take (2), and the lookups will take care of the rest.
import networkx as nx
from itertools import pairwise, chain
def unique(iterable):
seen = set()
for elem in iterable:
if elem in seen: continue
seen.add(elem)
yield elem
G = nx.from_pandas_edgelist(
topology, source='source', target='target', create_using=nx.Graph
)
root = 'A'
nodes = [root]
for _ in range(2): # maximum number of transitions
# remove unique for redundant paths
nodes = unique(chain.from_iterable(G[n] for n in nodes))
nodes = [n for n in nodes if n != root]
print(root, nodes, len(nodes))
A ['B', 'C', 'D'] 3
In the same fashion as the above pandas example, we can repeat this process for root nodes like so:
for root in topology['source'].unique():
nodes = [root]
for _ in range(2): # maximum number of transitions
# remove unique for redundant paths
nodes = unique(chain.from_iterable(G[n] for n in nodes))
nodes = [n for n in nodes if n != root]
print(root, len(nodes), nodes)
A 3 ['B', 'C', 'D']
B 3 ['A', 'C', 'D']
C 5 ['A', 'B', 'D', 'E', 'F']
D 8 ['A', 'B', 'C', 'E', 'F', 'G', 'H', 'I']
E 6 ['F', 'C', 'D', 'G', 'H', 'I']
F 6 ['E', 'C', 'D', 'G', 'H', 'I']
G 5 ['H', 'I', 'D', 'E', 'F']
H 5 ['G', 'I', 'D', 'E', 'F']
I 5 ['G', 'H', 'D', 'E', 'F']
Wrap-Up#
And there we have it! Two straightforward ways to count the number of paths given our graph topology. While this is all of the path counting we’ll do for today, we do have faster algorithmic approaches we can take to arrive at these results. So, make sure you tune back in next week, when we will find a better way to use the adjacency matrix we made earlier! Talk to you then.