TRAPDROID is the implementation of the Bare-metal Android Malware Behavior Analysis Framework research. It runs unknown #android apps on real phones, watches what they do at the kernel level, and decides whether they are malicious.

The core idea is simple. Modern #malware can tell when it is running inside an emulator or under a debugger, so it stays quiet and looks harmless. Running the sample on bare-metal hardware removes that hiding place. The framework then collects low level signals, turns them into a clean behavior profile, and feeds that to a classifier. The whole pipeline is built around #machine-learning from real runtime behavior rather than static guesswork.

The problem

Static analysis reads the app without running it. It is fast but easy to fool with obfuscation, encryption, and code that gets downloaded at runtime. Dynamic analysis runs the app and watches its behavior, which is harder to fake, but most dynamic tools run inside emulators or rely on strace. Both are detectable. A sample that sees ptrace attached to it can just refuse to misbehave.

TRAPDROID avoids these tells. It uses a kernel module instead of ptrace, and it runs on a physical device instead of an emulator. In a small test with evasive samples from the Ztorg, OBAD, and Hehe families, four of five quit immediately inside the emulator. All five ran fully on the real device.

How it works

The system splits into two parts. A kernel driver on the phone intercepts activity and ships it out, and a server turns that raw stream into a behavior profile and runs the detection model.

Loading diagram...

The driver

The driver is a loadable kernel module, an #LKM. It hooks a small set of system calls using kretprobes, so it never touches the system call table and never uses ptrace. It watches one app at a time by checking the UID that Android assigns to each install.

It collects four kinds of signal:

•
System calls. A compact set of 15, grouped by intent: file (open, close, rename, unlink, mkdir, access, getdents), process (fork, execve, clone, ptrace), network (socket, connect), and privilege (chown, chmod).
•
Binder traffic. It filters ioctl calls on the binder driver to catch every IPC message, which is how Android apps talk to system services.
•
PMU events. Hardware performance counters like cache access, branch misprediction, and instruction counts, read straight from the perf subsystem.
•
Process stats. Accounting fields from task_struct such as min_flt, stime, utime, read_char, and write_bytes.

Events go into a circular queue, and a separate kernel worker sends them to the server over a custom UDP protocol using netpoll.

The box

The server side runs on #python with #flask, #mongodb, and #elasticsearch. The hardware is TBOX, a #raspberry-pi running #arch-linux ARM. TBOX gives the phone a protected WiFi network, charges it and talks to it over USB 3.0 and ADB, captures all network packets with tcpdump, and parses them with #scapy. It also serves the dashboard where an analyst starts a run and reads the report.

The test device was a Sony Xperia XZ 2 on Android 10. Targeting a recent Android version mattered, because a lot of new malware only runs there.

Automatic state restoration

Every run starts from a clean phone. The framework keeps pristine copies of the system and userdata partitions as sparse images. The app under analysis can write to userdata, but system stays protected by #selinux policy. After each run the device boots into fastboot, wipes the cache, and reflashes the original partitions. The full cycle of restore plus stimulation takes about four minutes.

Stimulation

Malware often sits idle until something happens, like an incoming SMS or a screen unlock. Stimulation is the act of poking the app to draw out hidden behavior. TRAPDROID uses a modified DroidBot that broadcasts intents, auto enables accessibility and notification services, and drives the UI. When the automatic engine is not enough, a human analyst steps in, though that was rarely needed. An expert analyst surfaced about 22.5% more distinct actions than the automated monkey tool on average.

UPL and UBF

The raw logs are messy. Two pieces turn them into something a model can learn from.

UPL, the UBF Processing Language, is a small DSL for describing how to read the logs. It is not Turing complete, but it has a grammar and it can embed #python, so you get pattern matching and ML libraries for free. A UPL script is built from blocks: behavior, annotation, context, auxiliary, and network.

UBF, the Unified Behavior Format, is the output, a tidy profile of what the app did. The Analysis Engine combines events and system wide measurements into a UBF stream. Formally, given observable events

Loading math...

and system wide signals like cache use and memory patterns

Loading math...

the engine maps both into a set of UBF events ${u_1, u_2, \dots, u_n}$ .

UBF does a few smart things. It folds similar calls into one behavior, so open and openat both become file_open. It tracks file descriptors across calls, so closing a socket and closing a file look different. It reads call arguments and return values, so O_RDONLY becomes a read intent and an ENOENT error becomes a no_such_file_or_directory annotation. These small details are what bridge the gap between a raw syscall and an actual behavior.

A short UPL example, mapping a behavior to the permission it should depend on:

text

behavior api_sms {
    where   := args.call == "com.android.internal.telephony.ISms"
    meta    := "SMS: {args.code}"
    depends := android.permission.RECEIVE_SMS, android.permission.SEND_SMS
}

Behavior coverage

Behavior coverage, written

\lambda

, measures how much of an app's expected behavior the stimulation actually triggered. It maps the permissions declared in the manifest to the behaviors seen at runtime, using PScout for the permission to API mapping:

Loading math...

If an app declares the camera permission, the engine expects to see a camera API call. Low coverage means the app stayed quiet and the sample is noisy. Testing showed $\lambda = 0.2$ is a reasonable floor: enough signal to learn from without throwing away too much of the dataset.

Detection

UBF gets tokenized into text, then treated as a plain text classification problem. No hand crafted feature engineering. Three encodings were tested:

•
UBF-P uses the behavior name only.
•
UBF-R adds the return value.
•
UBF-A adds well defined annotations.

Classic ML with TF-IDF

The tokens are turned into vectors with TF-IDF over n-grams ( $n=3$ worked best). For a UBF event $u_i$ in app

a

Loading math...

#svm, #random-forest, and #gradient-boosting were tested. Gradient Boosting won. After tuning and feature selection on a balanced set of 6000 samples ( $\lambda = 0.2$ , MWR = 0.5):

Encoding	Classifier	F-1	Accuracy
UBF-P	GB	97.75%	97.75%
UBF-R	GB	97.92%	97.92%
UBF-A	GB	98.08%	98.08%

Deep learning with CNN

For the #deep-learning side, a text classification #CNN was adapted to the malware domain. The single embedding version reached its best with a [50, 100, 50] dense stack, an embedding dimension of 200, and a sequence length of 4500, hitting 98.31% F-1. A second version with separate embedding layers for the behavior and annotation streams pushed accuracy to 98.41% on the balanced 6000 sample set.

Dataset

Older datasets like Drebin only show malware up to their collection date, so the dataset was rebuilt: 3638 benign and 3701 malicious samples drawn from Drebin, Maldroid 2020, Google Play, F-Droid, and private feeds from security researchers. It also includes 0-day and 1-day samples that surfaced during the research. Samples were filtered on VirusTotal detections, behavior coverage, permission balance, and category, and about 3.92% were dropped for being dormant or broken.

What it showed

•
Low level features really do carry enough signal to detect malware, which closes part of the semantic gap between a syscall and a real action.
•
A bare-metal device beats an emulator for evasive samples, plainly.
•
PMU and hardware counters are a workable, novel feature source for this problem.
•
It even handled edge cases like Zpware, a zero permission sample that does damage without declaring anything in its manifest.

The problem

How it works

The system splits into two parts. A kernel driver on the phone intercepts activity and ships it out, and a server turns that raw stream into a behavior profile and runs the detection model.

Mermaid Diagram

Rendering diagram...

Loading diagram...

The driver

It collects four kinds of signal:

•
System calls. A compact set of 15, grouped by intent: file (open, close, rename, unlink, mkdir, access, getdents), process (fork, execve, clone, ptrace), network (socket, connect), and privilege (chown, chmod).
•
Binder traffic. It filters ioctl calls on the binder driver to catch every IPC message, which is how Android apps talk to system services.
•
PMU events. Hardware performance counters like cache access, branch misprediction, and instruction counts, read straight from the perf subsystem.
•
Process stats. Accounting fields from task_struct such as min_flt, stime, utime, read_char, and write_bytes.

Events go into a circular queue, and a separate kernel worker sends them to the server over a custom UDP protocol using netpoll.

The box

The test device was a Sony Xperia XZ 2 on Android 10. Targeting a recent Android version mattered, because a lot of new malware only runs there.

Automatic state restoration

Stimulation

UPL and UBF

The raw logs are messy. Two pieces turn them into something a model can learn from.

LaTeX Equation

E = {e_1, e_2, \dots, e_n}

Loading math...

and system wide signals like cache use and memory patterns

LaTeX Equation

S = {s_1, s_2, \dots, s_n}

Loading math...

the engine maps both into a set of UBF events

{u_1, u_2, \dots, u_n}

${u_1, u_2, \dots, u_n}$ .

A short UPL example, mapping a behavior to the permission it should depend on:

text

behavior api_sms {
    where   := args.call == "com.android.internal.telephony.ISms"
    meta    := "SMS: {args.code}"
    depends := android.permission.RECEIVE_SMS, android.permission.SEND_SMS
}

behavior api_sms {
    where   := args.call == "com.android.internal.telephony.ISms"
    meta    := "SMS: {args.code}"
    depends := android.permission.RECEIVE_SMS, android.permission.SEND_SMS
}

Behavior coverage

Behavior coverage, written

\lambda

LaTeX Equation

\lambda = \frac{\text{behaviors observed that link to a declared permission}}{\text{number of declared permissions}}

Loading math...

If an app declares the camera permission, the engine expects to see a camera API call. Low coverage means the app stayed quiet and the sample is noisy. Testing showed

\lambda = 0.2

$\lambda = 0.2$ is a reasonable floor: enough signal to learn from without throwing away too much of the dataset.

Detection

UBF gets tokenized into text, then treated as a plain text classification problem. No hand crafted feature engineering. Three encodings were tested:

•
UBF-P uses the behavior name only.
•
UBF-R adds the return value.
•
UBF-A adds well defined annotations.

Classic ML with TF-IDF

The tokens are turned into vectors with TF-IDF over n-grams (

n=3

$n=3$ worked best). For a UBF event

u_i

$u_i$ in app

a

LaTeX Equation

\text{TF}_i = \frac{f_i}{n}, \qquad \text{IDF}_i = \log\left(\frac{N}{df_i}\right), \qquad \text{TF-IDF}_i = \text{TF}_i \times \text{IDF}_i

Loading math...

#svm, #random-forest, and #gradient-boosting were tested. Gradient Boosting won. After tuning and feature selection on a balanced set of 6000 samples (

\lambda = 0.2

$\lambda = 0.2$ , MWR = 0.5):

Encoding	Classifier	F-1	Accuracy
UBF-P	GB	97.75%	97.75%
UBF-R	GB	97.92%	97.92%
UBF-A	GB	98.08%	98.08%

Deep learning with CNN

Dataset

What it showed

•
Low level features really do carry enough signal to detect malware, which closes part of the semantic gap between a syscall and a real action.
•
A bare-metal device beats an emulator for evasive samples, plainly.
•
PMU and hardware counters are a workable, novel feature source for this problem.
•
It even handled edge cases like Zpware, a zero permission sample that does damage without declaring anything in its manifest.

Trapdroid

The problem

How it works

The driver

The box

Automatic state restoration

Stimulation

UPL and UBF

Behavior coverage

Detection

Classic ML with TF-IDF

Deep learning with CNN

Dataset

What it showed

Related Nodes

Trapdroid

The problem

How it works

The driver

The box

Automatic state restoration

Stimulation

UPL and UBF

Behavior coverage

Detection

Classic ML with TF-IDF

Deep learning with CNN

Dataset

What it showed

Related Nodes