Benchmark for BAAI/bge-3m on an Nvidia A800/CPU/Mac M1

在同一个服务器上测试 CPU / GPU 性能差异

设备信息

GPU 设备信息

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A800-SXM4-80GB          Off | 00000000:3D:00.0 Off |                    0 |
| N/A   35C    P0              63W / 400W |  47848MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+

CPU 信息

属性	值
Architecture	x86_64
CPU op-mode(s)	32-bit, 64-bit
Byte Order	Little Endian
CPU(s)	128
On-line CPU(s) list	0-127
Thread(s) per core	2
Core(s) per socket	32
Socket(s)	2
NUMA node(s)	2
Vendor ID	GenuineIntel
CPU family	6
Model	106
Model name	Intel(R) Xeon(R) Platinum 8358P CPU @ 2.60GHz
Stepping	6
CPU MHz	800.000
CPU max MHz	2601.0000
CPU min MHz	800.0000
BogoMIPS	5200.00
Virtualization	VT-x
L1d cache	48K
L1i cache	32K
L2 cache	1280K
L3 cache	49152K
NUMA node0 CPU(s)	0-31, 64-95
NUMA node1 CPU(s)	32-63, 96-127
Flags	fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb cat_l3 invpcid_single ssbd mba rsb_ctxsw ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb intel_pt avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq md_clear pconfig spec_ctrl intel_stibp flush_l1d arch_capabilities

内存信息

类型	总计	已用	空闲	共享	缓冲/缓存	可用
Mem	2.0T	302G	37G	8.1G	1.6T	1.7T
Swap	0B	0B	0B	-	-	-

测试方案-代码直接调用

测试代码

import time

import sentence_transformers
import torch

if __name__ == "__main__":
    device = "cuda" if torch.cuda.is_available() else "cpu"
    embedding = sentence_transformers.SentenceTransformer(
        model_name_or_path="/Volumes/SD/huggingface-models/bge-m3",
        cache_folder="/Volumes/SD/huggingface-models",
        device=device
    )
    total = 10000
    batch_size = 100
    start_time = time.time()
    sentences = ["I am AnCopilot, nice to meet you!"]
    for i in range(total // batch_size):
        embedding.encode(sentences * batch_size, normalize_embeddings=True)
        print(f"{i + 1} / {total // batch_size}")
    end_time = time.time()
    total_time = end_time - start_time
    average_time = total_time / total
    throughput = total / total_time
    print(f"Device {device}")
    print(f"Total {total} sentences")
    print(f"Batch size: {batch_size}")
    print(f"Total time: {total_time:.4f} seconds")
    print(f"Average time per iteration: {average_time:.4f} seconds")
    print(f"Throughput: {throughput:.2f} iterations per second")

以下是基准测试结果的对比表格：

Device	Batch Size	Total Sentences	Total Time (seconds)	Average Time per Iteration (seconds)	Throughput (iterations per second)
CUDA	100	10000	13.6438	0.0014	732.93
CUDA	200	10000	12.1587	0.0012	822.46
CPU	100	10000	77.3202	0.0077	129.33
CPU	200	10000	72.6335	0.0073	137.68

测试方案-FastAPI封装

测试代码

# 数据中写入 100 个文本
echo '{"inputs":["I am AnCopilot, nice to meet you!","I am AnCopilot, nice to meet you!",...],"model":"bge-m3"}' > data.json
# 发送 100 次（总计发送 10000 条），10 并发
ab -n 100 -c 10 -p data.json -T application/json http://10.1.251.228:58123/v1/embeddings/bulk

以下是基准测试结果的对比表格：

Device	Batch Size	Total Sentences	Concurrency	Total Time (seconds)	Average Time per Iteration (ms)	Throughput (iterations per second)
CPU	100	10000	10	182.555	1825.549	55
CPU	100	10000	50	超时失败
CUDA	100	10000	10	23.903	239.031	418
CUDA	100	10000	50	23.345	233.448	428
CUDA	100	10000	100	21.885	218.850	457

总结

根据基准测试结果，可以得出以下分析和总结：

设备性能：
- CUDA（GPU）设备的表现明显优于CPU设备。无论是100还是200的批量大小，CUDA设备的总时间和每次迭代的时间均显著低于CPU设备，CPU的总时间大约是CUDA的5到6倍。
批量大小的影响：
- 增加批量大小对CUDA设备有利：在CUDA上，从100的批量大小提升到200，虽然总时间略有减少（从13.6438秒降至12.1587秒），每次迭代的时间也减少了（从0.0014秒降至0.0012秒），同时吞吐量提升了（从732.93迭代/秒增至822.46迭代/秒）。
- 在CPU设备上，增加批量大小同样有利：在CPU上，从批量大小100提高到200，虽然总时间也有所减少（从77.3202秒降至72.6335秒），每次迭代的时间也有小幅下降（从0.0077秒下降至0.0073秒），吞吐量由129.33迭代/秒提升至137.68迭代/秒。
FastAPI封装：
- 封装后性能下降 50%

I'm Lei Zhang

Benchmark for BAAI/bge-3m on an Nvidia A800/CPU/Mac M1

设备信息

测试方案-代码直接调用

测试方案-FastAPI封装

总结