Calculate contract similarity through Bytecode

Recently, the Forta Foundation asked me to help them create a monitoring robot that can determine whether a created contract is similar to any contract in a collection of malicious contracts. I have been reading dozens of papers these days and finally completed this requirement satisfactorily. The whole task feels very interesting and challenging, and the final project can be found here.

What is Forta Network#

Forta Network

The official website of Forta provides the following introduction:

The leading decentralized security & operational monitoring network for wallets, developers, and investors

In simple terms, Forta Network is a distributed computing platform for security monitoring. Developers can first develop a Forta Bot according to their needs, which can handle every transaction on the blockchain. Then developers can upload the Forta Bot to the Forta Network, allowing it to process blockchain information 24/7.

For example, developers can write a Forta Bot to monitor the stability of algorithmic stablecoins in real-time. If they become unstable, a warning will be issued. Another example is developers can write a Forta Bot to monitor whether the ownership of a certain project's contract has changed, which can be used to detect potential attacks in advance.

Core Challenges#

Since most contracts uploaded to the blockchain are not open source, we cannot directly analyze the semantic information of the contract's source code. Instead, we can only analyze the bytecode of the contract. Bytecode lacks a lot of semantic information from the source code, so both academically and industrially, similarity detection of contract bytecode has been rarely explored.

Solution#

We ultimately used function-level similarity detection. In general, the steps of our method are as follows:

Decompiling the contract bytecode into OPcodes and CFGs.
Traversing the CFGs to extract each function of the contract and extracting all instructions in each function in the order of the basic blocks.
Vectorizing all instructions of a function into a dense vector.
Calculating the similarity between contracts based on the similarity between functions.

Decompilation#

As we all know, due to the inevitable loss of a large amount of information during the compilation process, reliable decompilation is impossible. However, it is completely possible and reliable to decompile the contract bytecode into OPcodes. It's like translating binary code into assembly language on x86, just follow the instruction manual line by line. So the first step of our decompilation is to translate the contract bytecode into OPcodes according to the EVM documentation.

After that, with the OPcodes, although it is difficult to directly know what the contract does, we can know the execution order of the code in the contract. For example, when encountering the jump instruction, we can roughly know that there will be a jump here. By analyzing the OPcodes, we can further transform the contract into a Control Flow Graph (CFG).

Control Flow Graph

In the CFG, each block is called a Basicblock, and the official definition is:

In compiler construction, a basic block is a straight-line code sequence with no branches in except to the entry and no branches out except at the exit.

In simple terms, a Basicblock represents a part of the code that can only be entered and run from the starting instruction, and it can only exit at the ending instruction. In other words, once the code in a Basicblock is executed, it will be executed together without jumping to another Basicblock in the middle.

With the CFG, we can find each function contained in the contract and the offsets of each function from dispatcher(), and through static analysis, we can determine which Basicblocks are included in each function. At this point, most of the decompilation is completed.

Instruction Extraction#

Since we are performing similarity calculation between functions, our processing naturally revolves around functions. With the CFG obtained in the previous section, we can traverse all the instructions involved in each function according to the execution order of the CFG. Here, since we are traversing according to the order of the CFG, the resulting instructions naturally have a certain level of semantics. Finally, we convert each instruction into a string, separate different instructions with spaces, and extract the instructions of a contract.

Vectorization#

Since a function may contain thousands or even tens of thousands of instructions, using deep learning-based vectorization methods may be too complex and inefficient. Here, we use doc2vec based on word2vec to perform unsupervised vectorization of the extracted instructions. If possible, readers can also try contrastive learning-based vectorization methods, such as codebert based on contrastive learning.

During training, we used the slither-audited-smart-contracts dataset, which contains approximately 120,000 contract bytecodes. After processing the dataset, we have about 2 million functions available for training. This amount of data should be sufficient for doc2vec, but if readers want to try contrastive learning-based vectorization methods, they can try the recently open-sourced paradigm-data-portal dataset.

Contract Similarity Calculation#

Finally, with the similarity between functions, we define the similarity between contracts $C_1$ and $C_2$ as:

$Sim(C_1, C_2) = \sum_{f_i \in C_1} log \frac{P(f_i, f_2^*)}{P(f_i, \bar{f_2})},$

where $f_i$ represents the $i$-th function of $C_1$, $f_2^$ represents the most similar function in $C_2$ to $f_1$, and $\bar{f_2}$ represents the average similarity between $f_i$ and any function in $C_2$. $P(f_i, f_2^)$ and $P(f_i, \bar{f_2})$ are the probabilities that $f_i$ is similar to $f_2^*$ and $\bar{f_2}$, respectively. $P(\cdot)$ is defined as:

$P(f_i, f_j) = \frac{1}{1 + e^{-k * cos(f_i, f_2)}},$

where $k$ is a hyperparameter, and $cos(\cdot, \cdot)$ is the cosine similarity between two vectors.

Finally, with $Sim(C_1, C_2)$, we calculate the similarity score between $C_1$ and $C_2$ as:

$score = \frac{Sim(C_1, C_2)}{Sim(C_1, C_1)}$

This score can be used to distinguish similar contracts from dissimilar contracts.