Elliptic Cryptocurrency Transactions

Elliptic Cryptocurrency Transactions#

Author: Adi Kondepudi

https://www.kaggle.com/datasets/ellipticco/elliptic-data-set?resource=download

The elliptic data set maps bitcoin transfers between entities, both licit and illicit.

There are 203,769 nodes and 234,355 edges. Nodes are entities and edges are transactions.

Among the nodes, 2% (4545) are illicit and 21% (42019) are licit. The rest are unknown.

There are 166 features for each node.

The first feature is the time step for that node. Each time step represents a connected component of nodes with edges (transactions) that have all occured within three hours of each other.

The next 93 features give information about the transactions made by that node (fees, volume, averages, etc).

The last 72 features are aggregated from adjacent nodes.

# Importing general libraries.
# Specific functions and whatnot will be imported seperately in later cells. 

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import networkx as nx
import graspologic
import matplotlib.pyplot as plt

# reading data
df_features = pd.read_csv("./elliptic/elliptic_txs_features.csv", header=None)
df_classes= pd.read_csv("./elliptic/elliptic_txs_classes.csv")
df_edgelist = pd.read_csv("./elliptic/elliptic_txs_edgelist.csv")

# renaming columns
df_classes.loc[df_classes['class'] == '1', 'class'] = "illicit"
df_classes.loc[df_classes['class'] == '2', 'class'] = "licit"

df_features.columns = ["id", "time step"] + [f"local_feat_{i}" for i in range(93)] + [f"agg_feat_{i}" for i in range(72)]
df_classes.columns = ["id", "class"]

# adding class data
df = pd.merge(df_features, df_classes, how="inner", on="id")
second_column = df.pop('class')
df.insert(1, 'class', second_column)
df.head()

	id	class	time step	local_feat_0	local_feat_1	local_feat_2	local_feat_3	local_feat_4	local_feat_5	local_feat_6	...	agg_feat_62	agg_feat_63	agg_feat_64	agg_feat_65	agg_feat_66	agg_feat_67	agg_feat_68	agg_feat_69	agg_feat_70	agg_feat_71
0	230425980	unknown	1	-0.171469	-0.184668	-1.201369	-0.121970	-0.043875	-0.113002	-0.061584	...	-0.562153	-0.600999	1.461330	1.461369	0.018279	-0.087490	-0.131155	-0.097524	-0.120613	-0.119792
1	5530458	unknown	1	-0.171484	-0.184668	-1.201369	-0.121970	-0.043875	-0.113002	-0.061584	...	0.947382	0.673103	-0.979074	-0.978556	0.018279	-0.087490	-0.131155	-0.097524	-0.120613	-0.119792
2	232022460	unknown	1	-0.172107	-0.184668	-1.201369	-0.121970	-0.043875	-0.113002	-0.061584	...	0.670883	0.439728	-0.979074	-0.978556	-0.098889	-0.106715	-0.131155	-0.183671	-0.120613	-0.119792
3	232438397	licit	1	0.163054	1.963790	-0.646376	12.409294	-0.063725	9.782742	12.414558	...	-0.577099	-0.613614	0.241128	0.241406	1.072793	0.085530	-0.131155	0.677799	-0.120613	-0.119792
4	230460314	unknown	1	1.011523	-0.081127	-1.201369	1.153668	0.333276	1.312656	-0.061584	...	-0.511871	-0.400422	0.517257	0.579382	0.018279	0.277775	0.326394	1.293750	0.178136	0.179117

5 rows × 168 columns

# Grouping dataset by timestep and sorting from largest to smallest.
# Time step 1 is the largest with 7880 entities and time step 27 is the smallest with 1089 entities.
df2 = df.groupby(['time step'])['time step'].count().sort_values(ascending=False)
df2

time step
   7880
  7140
   6803
  6727
   6621
  6393
   6048
  5894
   5693
  5598
  5507
  5342
  5121
  5063
   4996
  4975
  4592
   4544
  4528
  4525
  4481
   4457
   4328
  4296
  4291
  4275
  4165
  3639
  3537
  3519
  3506
  3385
  3306
  3151
  2975
  2954
  2891
  2816
  2760
  2523
  2486
  2483
  2454
  2314
  2047
  2022
  1976
  1653
  1089
Name: time step, dtype: int64

grouped = df.groupby(['time step', 'class'])['id'].count().reset_index().rename(columns={'id': 'count'})
ax = sns.lineplot(x='time step', y='count', hue='class', data=grouped)
sns.move_legend(ax, "lower center", bbox_to_anchor=(.5, 1), ncol=3, title=None, frameon=False)

../_images/7c5c294b66a76a3511dbc5fb0574872edcc93353c93c824e4d8ad075549a8eb5.png

grouped_illicit = grouped[grouped["class"] == "illicit"]
grouped_illicit.sort_values(by="count", ascending=False)

	time step	class	count
93	32	illicit	342
84	29	illicit	329
36	13	illicit	291
57	20	illicit	260
24	9	illicit	248
123	42	illicit	239
102	35	illicit	182
63	22	illicit	158
42	15	illicit	147
69	24	illicit	137
30	11	illicit	131
45	16	illicit	128
72	25	illicit	118
120	41	illicit	116
117	40	illicit	112
111	38	illicit	111
90	31	illicit	106
18	7	illicit	102
60	21	illicit	100
48	17	illicit	99
75	26	illicit	96
81	28	illicit	85
87	30	illicit	83
114	39	illicit	81
54	19	illicit	80
21	8	illicit	67
144	49	illicit	56
66	23	illicit	53
51	18	illicit	52
39	14	illicit	43
108	37	illicit	40
99	34	illicit	37
141	48	illicit	36
105	36	illicit	33
9	4	illicit	30
78	27	illicit	24
126	43	illicit	24
129	44	illicit	24
96	33	illicit	23
138	47	illicit	22
27	10	illicit	18
3	2	illicit	18
0	1	illicit	17
33	12	illicit	16
6	3	illicit	11
12	5	illicit	8
132	45	illicit	5
15	6	illicit	5
135	46	illicit	2

sns.lineplot(x='time step', y='count', data=grouped_illicit)

<AxesSubplot: xlabel='time step', ylabel='count'>

../_images/dc391af0496942ae28011cdd15e70b01f5d912e64db96af59f14cfc5de053c91.png

# Using PyTorch to use a neural network to classify the unknown values
import torch
from torch.utils.data import DataLoader, TensorDataset

# df with known valyes
df_licit_illicit = df[df["class"] != "unknown"]

y = df_licit_illicit.iloc[:, 0:2]
y = y.drop(["id"], axis=1).values
X = df_licit_illicit.drop(["class", "id", "time step"], axis=1).values
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
y = y.ravel()
y = le.fit_transform(y)

X_tensor = torch.from_numpy(X).float()
y_tensor = torch.from_numpy(y).long()

# Create a TensorDataset to hold the data
data = TensorDataset(X_tensor, y_tensor)

# Split the data into training and test sets
train_data, test_data = torch.utils.data.random_split(data, [int(0.8 * len(data)), len(data) - int(0.8 * len(data))])

# Create data loaders for the training and test sets
train_loader = DataLoader(train_data, batch_size=32, shuffle=True)
test_loader = DataLoader(test_data, batch_size=32, shuffle=True)

# Define the neural network
class Net(torch.nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.fc1 = torch.nn.Linear(in_features=X.shape[1], out_features=64)
        self.fc2 = torch.nn.Linear(in_features=64, out_features=64)
        self.fc3 = torch.nn.Linear(in_features=64, out_features=len(set(y)))

    def forward(self, x):
        x = torch.nn.functional.relu(self.fc1(x))
        x = torch.nn.functional.relu(self.fc2(x))
        x = self.fc3(x)
        return x

# Create an instance of the network
net = Net()

# Define the loss function and optimizer
criterion = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(net.parameters(), lr=0.001)

# Train the network
for epoch in range(10):
    for inputs, labels in train_loader:
        optimizer.zero_grad()
        outputs = net(inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()

# # Test the network
with torch.no_grad():
    correct = 0
    total = 0
    for inputs, labels in test_loader:
        outputs = net(inputs)
        _, predicted = torch.max(outputs.data, 1)
        total += labels.size(0)
        correct += (predicted == labels).sum().item()
    print(f'Accuracy on test data: {correct / total}')


# creating df of unknown values
df_unknown = df[df["class"] == "unknown"]
df_unknown = df_unknown.drop(['class', 'time step', "id"], axis=1)
df_unknown = df_unknown.values
df_unknown_tensor = torch.from_numpy(df_unknown).float()

# Use the neural network to classify the new data
with torch.no_grad():
    # Run the new data through the network
    output = net(df_unknown_tensor)
    # Get the predicted class for each sample
    _, predicted = torch.max(output.data, 1)

predicted = predicted.cpu()
predicted = le.inverse_transform(predicted)

df_df_unknown = df[df["class"] == "unknown"]
df_df_unknown = df_df_unknown.drop(['class', 'time step', "id"], axis=1)

df_df_unknown['classes'] = predicted.tolist()

data = df_df_unknown["classes"]
sns.histplot(data, bins=20)
plt.title("Distribution of licit and illicit of previously unknown entities")

plt.show()

Accuracy on test data: 0.9815311929560829

../_images/c118cf28e15b77cb4a9cf7556a26db0f67a07f3aa7c5a2d6f475acb09ecbdf1e.png

# function to create a networkx graph for each time step
def plot(i):
    time_step_i = df.loc[(df['time step'] == i), 'id']
    time_step_i = df_edgelist.loc[df_edgelist['txId1'].isin(time_step_i)]
    g = nx.from_pandas_edgelist(time_step_i, source = 'txId1', target = 'txId2', create_using = nx.DiGraph())
    return g

from graspologic.utils import is_fully_connected

for i in range (1,50):
    time_step_i = df.loc[(df['time step'] == i), 'id']
    time_step_i = df_edgelist.loc[df_edgelist['txId1'].isin(time_step_i)]
    g = nx.from_pandas_edgelist(time_step_i, source = 'txId1', target = 'txId2', create_using = nx.DiGraph())
    if(is_fully_connected(g) == True):
        # print("Time Step ", i, "is strongly connected.")
        pass
    else:
        if(nx.is_weakly_connected(g) == True):
            print("Time Step ", i, "is weakly connected.")
        else:
            print("Time Step ", i, "isn't connected.")

# Because no statements were printed, all nodes in each time step are strongly connected.
# This what Elliptic says about the data, each component is completely connected. 

From this point forward, we’ll be doing most of our measures on a time step 27. Not only will this increase our processing times, our measures will make more sense as all nodes within time step 27 are strongly connected. The reason why we chose time step 27 is because it is the time step with the least number of nodes.

# Computing the eigenvector centrality

def map_to_nodes(node_map):
    node_map.setdefault(0)
    # utility function to make it easy to compare dicts to array outputs
    return np.array(np.vectorize(lambda x: node_map.setdefault(x, 0))(nodelist))

g = plot(27)
nodelist = list(g.nodes)
eg_centrality = map_to_nodes(dict(nx.eigenvector_centrality(g, max_iter=1000)))
eg_centrality.sort()

betweenness_centrality = map_to_nodes(nx.betweenness_centrality(g))
betweenness_centrality.sort()

# plotting Time Step 27

import time
from pathlib import Path
import pandas as pd
import networkx as nx
import numpy as np
from scipy.sparse import csr_array
from graspologic.partition import leiden


g = plot(27)
nodelist = list(g.nodes)
adj = nx.to_scipy_sparse_array(g, nodelist=nodelist)

# the below function simply just converts a graph to be undirected.
# we can use the networkx function because our graph isn't weighted
def symmetrze_nx(g):
    """Leiden requires a symmetric/undirected graph. This converts a directed graph to
    undirected just for this community detection step"""
    sym_g = g.to_undirected()
    return sym_g

sym_g = symmetrze_nx(g)
out2 = leiden(sym_g)

node_df = pd.Series(out2)
node_df.index.name = "node_id"
node_df.name = "community"
node_df = node_df.to_frame()

nodelist = node_df.index
adj = nx.to_scipy_sparse_array(g, nodelist=nodelist)

node_df["strength"] = adj.sum(axis=1) + adj.sum(axis=0)
node_df['rank_strength'] = node_df['strength'].rank(method='dense')

from graspologic.embed import LaplacianSpectralEmbed
from graspologic.utils import pass_to_ranks

ptr_adj = pass_to_ranks(adj)

lse = LaplacianSpectralEmbed(n_components=32, concat=True)
lse_embedding = lse.fit_transform(adj)

from umap import UMAP

n_components = 32
n_neighbors = 32
min_dist = 0.8
metric = "cosine"
umap = UMAP(
    n_components=2,
    n_neighbors=n_neighbors,
    min_dist=min_dist,
    metric=metric,
)
umap_embedding = umap.fit_transform(lse_embedding)

node_df["x"] = umap_embedding[:, 0]
node_df["y"] = umap_embedding[:, 1]

def subsample_edges(adjacency, n_edges_kept=100_000):
    row_inds, col_inds = np.nonzero(adjacency)
    n_edges = len(row_inds)
    if n_edges_kept > n_edges:
        return adjacency

    choice_edge_inds = np.random.choice(n_edges, size=n_edges_kept, replace=False)
    row_inds = row_inds[choice_edge_inds]
    col_inds = col_inds[choice_edge_inds]
    data = adjacency[row_inds, col_inds]

    return csr_array((data, (row_inds, col_inds)), shape=adjacency.shape)

from graspologic.plot import networkplot

# this is optional, may not need depending on the number of edges you have
sub_adj = subsample_edges(adj, 100_000)

ax = networkplot(
    sub_adj,
    x="x",
    y="y",
    node_data=node_df,
    node_size="rank_strength",
    node_sizes=(10, 80),
    figsize=(10, 10),
    node_hue="community",
    edge_linewidth=0.3,
)

ax.axis("off")

(-8.944562411308288,
 15.430572485923767,
 -3.235624873638153,
 16.602758252620696)

../_images/0afc2c0ffcc8a060640a2bce81122eaeffbd4039016016b227c9d182e23d40a3.png

node_df_grouped = node_df.groupby(['community'])['community'].count().sort_values(ascending=False)
node_df_grouped
node_df_largest_communities = node_df[(node_df["community"] == 1) | (node_df["community"] == 2) | (node_df["community"] == 12) | (node_df["community"] == 22) | (node_df["community"] == 5)]
node_df_largest_communities

	community	strength	rank_strength	x	y
node_id
294468623	1	1	1.0	-7.083713	6.587129
294326756	2	1	1.0	11.408134	14.842425
294370619	22	2	2.0	7.499983	10.402857
294372200	1	1	1.0	-7.002121	6.867332
294324191	1	1	1.0	-7.139164	7.864409
...	...	...	...	...	...
1891081	1	95	25.0	8.650565	13.803895
294374011	1	1	1.0	-6.657413	6.512280
294300722	22	1	1.0	6.279359	2.508360
294375201	1	1	1.0	-6.410471	7.060765
1757629	22	1	1.0	8.746396	6.634493

228 rows × 5 columns

# plotting five largest communities within Time Step 27

import time
from pathlib import Path
import pandas as pd
import networkx as nx
import numpy as np
from scipy.sparse import csr_array
from graspologic.partition import leiden


# g = plot(27)
# nodelist = list(g.nodes)
# adj = nx.to_scipy_sparse_array(g, nodelist=nodelist)

# # the below function simply just converts a graph to be undirected.
# # we can use the networkx function because our graph isn't weighted
# def symmetrze_nx(g):
#     """Leiden requires a symmetric/undirected graph. This converts a directed graph to
#     undirected just for this community detection step"""
#     sym_g = g.to_undirected()
#     return sym_g

# sym_g = symmetrze_nx(g)
# out2 = leiden(sym_g)

# node_df = pd.Series(out2)
# node_df.index.name = "node_id"
# node_df.name = "community"
# node_df = node_df.to_frame()

nodelist = node_df_largest_communities.index
adj = nx.to_scipy_sparse_array(g, nodelist=nodelist)

# node_df["strength"] = adj.sum(axis=1) + adj.sum(axis=0)
# node_df['rank_strength'] = node_df['strength'].rank(method='dense')

from graspologic.embed import LaplacianSpectralEmbed
from graspologic.utils import pass_to_ranks

ptr_adj = pass_to_ranks(adj)

lse = LaplacianSpectralEmbed(n_components=32, concat=True)
lse_embedding = lse.fit_transform(adj)

from umap import UMAP

n_components = 32
n_neighbors = 32
min_dist = 0.8
metric = "cosine"
umap = UMAP(
    n_components=2,
    n_neighbors=n_neighbors,
    min_dist=min_dist,
    metric=metric,
)
umap_embedding = umap.fit_transform(lse_embedding)

node_df_largest_communities["x"] = umap_embedding[:, 0]
node_df_largest_communities["y"] = umap_embedding[:, 1]

def subsample_edges(adjacency, n_edges_kept=100_000):
    row_inds, col_inds = np.nonzero(adjacency)
    n_edges = len(row_inds)
    if n_edges_kept > n_edges:
        return adjacency

    choice_edge_inds = np.random.choice(n_edges, size=n_edges_kept, replace=False)
    row_inds = row_inds[choice_edge_inds]
    col_inds = col_inds[choice_edge_inds]
    data = adjacency[row_inds, col_inds]

    return csr_array((data, (row_inds, col_inds)), shape=adjacency.shape)

from graspologic.plot import networkplot

# this is optional, may not need depending on the number of edges you have
sub_adj = subsample_edges(adj, 100_000)

ax = networkplot(
    sub_adj,
    x="x",
    y="y",
    node_data=node_df_largest_communities,
    node_size="rank_strength",
    node_sizes=(10, 80),
    figsize=(10, 10),
    node_hue="community",
    edge_linewidth=0.3,
)

ax.axis("off")

/Users/adikondepudi/Library/Python/3.9/lib/python/site-packages/graspologic/embed/base.py:199: UserWarning: Input graph is not fully connected. Results may notbe optimal. You can compute the largest connected component byusing ``graspologic.utils.largest_connected_component``.
  warnings.warn(msg, UserWarning)
/var/folders/dk/ntqqxtcj77d6kpjl7pz6dbbr0000gn/T/ipykernel_71030/4021646743.py:60: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  node_df_largest_communities["x"] = umap_embedding[:, 0]
/var/folders/dk/ntqqxtcj77d6kpjl7pz6dbbr0000gn/T/ipykernel_71030/4021646743.py:61: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  node_df_largest_communities["y"] = umap_embedding[:, 1]

(-3.685612750053406,
 19.581494879722595,
 -4.315412318706512,
 11.561751401424408)

../_images/c73643c4d681d318dae1e4882e14ce8cb5de3338df89d08289d44694534c69b8.png

import pandas as pd
import matplotlib.pyplot as plt
from graspologic.plot import networkplot
import seaborn as sns
from matplotlib import colors

g = plot(27)
nodelist = list(g.nodes)
A = nx.to_numpy_array(g, nodelist=nodelist)

node_data = pd.DataFrame(index=g.nodes())
node_data["degree"] = node_data.index.map(dict(nx.degree(g)))
node_data["eigenvector"] = node_data.index.map(nx.eigenvector_centrality(g, max_iter = 1000))
node_data["pagerank"] = node_data.index.map(nx.pagerank(g))
node_data["betweenness"] = node_data.index.map(nx.betweenness_centrality(g))

pos = nx.kamada_kawai_layout(g)
node_data["x"] = [pos[node][0] for node in node_data.index]
node_data["y"] = [pos[node][1] for node in node_data.index]

sns.set_context("talk", font_scale=1.5)
fig, axs = plt.subplots(1, 4, figsize=(20, 5))


def plot_node_scaled_network(A, node_data, key, ax):
    # REF: https://github.com/mwaskom/seaborn/blob/9425588d3498755abd93960df4ab05ec1a8de3ef/seaborn/_core.py#L215

    levels = list(np.sort(node_data[key].unique()))
    cmap = sns.color_palette("Blues", as_cmap=True)
    vmin = np.min(levels)
    norm = colors.Normalize(vmin=0.3 * vmin)
    palette = dict(zip(levels, cmap(norm(levels))))

    networkplot(
        A,
        node_data=node_data,
        x="x",
        y="y",
        ax=ax,
        edge_linewidth=1.0,
        node_size=key,
        node_hue=key,
        palette=palette,
        node_sizes=(5, 200),
        node_kws=dict(linewidth=1, edgecolor="black"),
        node_alpha=1.0,
        edge_kws=dict(color=sns.color_palette()[0]),
    )

    ax.axis("off")
    ax.set_title(key.capitalize())


ax = axs[0]
plot_node_scaled_network(A, node_data, "degree", ax)

ax = axs[1]
plot_node_scaled_network(A, node_data, "eigenvector", ax)

ax = axs[2]
plot_node_scaled_network(A, node_data, "pagerank", ax)

ax = axs[3]
plot_node_scaled_network(A, node_data, "betweenness", ax)

fig.set_facecolor("w")

../_images/c31a1d49702d793ad2ecfccb5a878067b1561a6f3327c5889bf423c2b9380aa2.png

sns.pairplot(
    node_data, vars=["degree", "eigenvector", "pagerank", "betweenness"], diag_kind="kde", height=4
)

<seaborn.axisgrid.PairGrid at 0x2e6458d90>

../_images/f9265c2d3b8347873c4f96e3d4365f2870abe84b830fd4e02fceb64b2b7a55b4.png

Here we try some community detection, in this case we’re attempting to divide the network into two communities. This is mostly just to test the algorithims and learn the techniques, as the network we have isnt’t made to be split communities.

First, we’ll start with Naive Optimization

import networkx as nx
import pandas as pd

g = plot(27)

nodelist = list(g.nodes)
node_df = pd.DataFrame(index=nodelist)

pos = nx.kamada_kawai_layout(g)

xs = []
ys = []
for node in nodelist:
    xs.append(pos[node][0])
    ys.append(pos[node][1])
xs = np.array(xs)
ys = np.array(ys)

node_df["x"] = xs
node_df["y"] = ys

adj = nx.to_numpy_array(g, nodelist=nodelist)
n = adj.shape[0]

node_df["degree"] = adj.sum(axis=0)

# random bernoullis
rng = np.random.default_rng(888)
partition = rng.binomial(1, 0.5, size=n)

partition

array([0, 0, 0, ..., 0, 0, 1])

def modularity_from_adjacency(sym_adj, partition, resolution=1):
    if isinstance(partition, dict):
        partition_labels = np.vectorize(partition.__getitem__)(
            np.arange(sym_adj.shape[0])
        )
    else:
        partition_labels = partition
    partition_labels = np.array(partition_labels)
    in_comm_mask = partition_labels[:, None] == partition_labels[None, :]
    degrees = np.squeeze(np.asarray(sym_adj.sum(axis=0)))
    degree_prod_mat = np.outer(degrees, degrees) / sym_adj.sum()
    mod_mat = sym_adj - resolution * degree_prod_mat
    return mod_mat[in_comm_mask].sum() / sym_adj.sum()


mod_score = modularity_from_adjacency(adj, partition)
mod_score

-0.014742446988177846

from graspologic.plot import networkplot, heatmap

node_df["random_partition"] = partition


def plot_network_partition(adj, node_data, partition_key):
    fig, axs = plt.subplots(1, 2, figsize=(15, 7))

    networkplot(
        adj,
        x="x",
        y="y",
        node_data=node_df.reset_index(),
        node_alpha=0.9,
        edge_alpha=0.7,
        edge_linewidth=0.4,
        node_hue=partition_key,
        node_size="degree",
        edge_hue="source",
        ax=axs[0],
    )
    _ = axs[0].axis("off")

    _ = heatmap(
        adj,
        inner_hier_labels=node_data[partition_key],
        ax=axs[1],
        cbar=False,
        cmap="Purples",
        vmin=0,
        center=None,
        sort_nodes=True,
    )

    return fig, ax


plot_network_partition(adj, node_df, "random_partition")

(<Figure size 1500x700 with 4 Axes>,
 <AxesSubplot: title={'center': 'Betweenness'}>)

../_images/924cbc58ad41297208b42f544924637d69a798d25e6c6d7de93ef08e6afe1718.png

n_trials = 10000

best_scores = []
best_iterations = []
last_mod_score = mod_score
for iteration in range(n_trials):
    # choose a random node to perturb
    index = rng.choice(n)

    # check what group it is currently in, and swap it
    current_group = partition[index]
    if current_group:
        partition[index] = 0
    else:
        partition[index] = 1

    # compute modularity with this slightly modified partition
    mod_score = modularity_from_adjacency(adj, partition)

    # decide whether to keep that change or not
    if mod_score < last_mod_score:
        # swap it back
        if current_group:
            partition[index] = 1
        else:
            partition[index] = 0
    else:
        last_mod_score = mod_score
        best_scores.append(mod_score)
        best_iterations.append(iteration)

mod_score

# took 95seconds!

0.401541095890411

fig, ax = plt.subplots(1, 1, figsize=(8, 6))
sns.scatterplot(x=best_iterations, y=best_scores, ax=ax, s=40, linewidth=0)
ax.set(ylabel="Modularity", xlabel="Iteration")
ax.spines.right.set_visible(False)
ax.spines.top.set_visible(False)

../_images/d56be0a8cbcaa784f8703dec48a7e812abb96cf8de5a6c22b21d8f2e792fca4a.png

node_df["naive_partition"] = partition

plot_network_partition(adj, node_df, "naive_partition")

(<Figure size 1500x700 with 4 Axes>,
 <AxesSubplot: xlabel='Iteration', ylabel='Modularity'>)

../_images/c44f1070ce1e08ce545cee1ad6829ced6f353a5346fc129574c00b8cd54411ea.png

Both spectral and leidan optimization require the network to be undirected. The following code will convert the directed network into being undirected.

def symmetrze_nx(g):
    sym_g = g.to_undirected()
    return sym_g

Spectral optimization:

g = symmetrze_nx(g)
B = nx.modularity_matrix(g, nodelist=nodelist)
eigenvalues, eigenvectors = np.linalg.eigh(B)
first_eigenvector = np.squeeze(np.asarray(eigenvectors[:, -1]))
eig_partition = first_eigenvector.copy()
eig_partition[eig_partition > 0] = 1
eig_partition[eig_partition <= 0] = 0
eig_partition = eig_partition.astype(int)
modularity_from_adjacency(adj, eig_partition)

<class 'networkx.utils.decorators.argmap'> compilation 25:5: FutureWarning: modularity_matrix will return a numpy array instead of a matrix in NetworkX 3.0.

0.24689053996997573

node_df["eig_partition"] = eig_partition

plot_network_partition(adj, node_df, "eig_partition")

(<Figure size 1500x700 with 4 Axes>,
 <AxesSubplot: xlabel='Iteration', ylabel='Modularity'>)

../_images/8c3d4516508e5a921480b15290068579b54a3fb22b11f32d3054242c0e6155dc.png

That was pretty bad… let’s try leiden!

from graspologic.partition import leiden, modularity

leiden_partition_map = leiden(g, random_seed=7)
nx.set_edge_attributes(g, 1, "weight")

modularity(g, leiden_partition_map)

0.9048667960686809

node_df["leiden_partition"] = node_df.index.map(leiden_partition_map)

plot_network_partition(adj, node_df, "leiden_partition")

(<Figure size 1500x700 with 4 Axes>,
 <AxesSubplot: xlabel='Iteration', ylabel='Modularity'>)

../_images/06b7114c63210a296c08775b14b8c3b8f6f25d2dc967c5f1e853ce8aa0123727.png

g = plot(27)
in_degree = list(g.in_degree())
df_in_degree = pd.DataFrame(in_degree, columns =['id', 'in degree'])
out_degree = list(g.out_degree())
df_out_degree = pd.DataFrame(out_degree, columns =['id', 'out degree'])
df_in_out_degree = pd.merge(df_in_degree, df_out_degree, how="inner", on="id")
df_in_out_degree_1_1 = df_in_out_degree.loc[(df_in_out_degree['in degree'] == 1) & (df_in_out_degree['out degree'] == 1)]
df_in_out_degree_0_1 = df_in_out_degree.loc[(df_in_out_degree['in degree'] == 0) & (df_in_out_degree['out degree'] == 1)]
df_in_out_degree_1_0 = df_in_out_degree.loc[(df_in_out_degree['in degree'] == 1) & (df_in_out_degree['out degree'] == 0)]
print(df_in_out_degree_1_1.shape)
print(df_in_out_degree_0_1.shape)
print(df_in_out_degree_1_0.shape)
print(df_in_out_degree.shape)

(348, 3)
(302, 3)
(211, 3)
(1089, 3)
228

348 of the 1089 nodes (32%) have an in degree and an out degree of 1.#

302 of the 1089 nodes (32%) have an in degree of 0 and an out degree of 1.#

211 of the 1089 nodes (32%) have an in degree of 1 and an out degree of 0.#

If we look at the graph above, we can see there are chains of nodes. It seems that occasionally, there are transactions that pass through a chain of entities. These entities simply receieve the transaction from one entity and transfer it to the next. We can reason that this could potentially be a part of a money laundering operation, specifically layering. In this step, money is transfered through multiple accounts to further the distance from the money from the original source. These entities are likely shell companies, trusts, or other corporate vehicles which aren’t required to reveal their true owner.#

copy_df_in_out_degree = df_in_out_degree

print(copy_df_in_out_degree.shape)
# drop the rows from df1 where the value in the 'id' column matches a value in the 'id' column of df2
copy_df_in_out_degree = copy_df_in_out_degree[~copy_df_in_out_degree['id'].isin(df_in_out_degree_1_1['id'])]
print(copy_df_in_out_degree.shape)

copy_df_in_out_degree = copy_df_in_out_degree[~copy_df_in_out_degree['id'].isin(df_in_out_degree_0_1['id'])]
print(copy_df_in_out_degree.shape)

copy_df_in_out_degree = copy_df_in_out_degree[~copy_df_in_out_degree['id'].isin(df_in_out_degree_1_0['id'])]
print(copy_df_in_out_degree.shape)

(741, 3)
(532, 3)
(288, 3)
(110, 3)

I spent a lot (a lot!) of time trying collapse the chains so we could better represent the network. Unfortunately, pandas didn’t want to work with me so we’re stuck with the above information.#

I’ll probably work on this project more in the future as I think there’s still quite a bit of information that I still haven’t gotten around through analyzing.#