Benchmark for BAAI/bge-3m on an Nvidia A800/CPU/Mac M1

在同一个服务器上测试 CPU / GPU 性能差异

设备信息

GPU 设备信息

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A800-SXM4-80GB          Off | 00000000:3D:00.0 Off |                    0 |
| N/A   35C    P0              63W / 400W |  47848MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+

CPU 信息

属性
Architecturex86_64
CPU op-mode(s)32-bit, 64-bit
Byte OrderLittle Endian
CPU(s)128
On-line CPU(s) list0-127
Thread(s) per core2
Core(s) per socket32
Socket(s)2
NUMA node(s)2
Vendor IDGenuineIntel
CPU family6
Model106
Model nameIntel(R) Xeon(R) Platinum 8358P CPU @ 2.60GHz
Stepping6
CPU MHz800.000
CPU max MHz2601.0000
CPU min MHz800.0000
BogoMIPS5200.00
VirtualizationVT-x
L1d cache48K
L1i cache32K
L2 cache1280K
L3 cache49152K
NUMA node0 CPU(s)0-31, 64-95
NUMA node1 CPU(s)32-63, 96-127
Flagsfpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb cat_l3 invpcid_single ssbd mba rsb_ctxsw ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb intel_pt avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq md_clear pconfig spec_ctrl intel_stibp flush_l1d arch_capabilities

内存信息

类型总计已用空闲共享缓冲/缓存可用
Mem2.0T302G37G8.1G1.6T1.7T
Swap0B0B0B---

测试方案-代码直接调用

测试代码

import time

import sentence_transformers
import torch

if __name__ == "__main__":
    device = "cuda" if torch.cuda.is_available() else "cpu"
    embedding = sentence_transformers.SentenceTransformer(
        model_name_or_path="/Volumes/SD/huggingface-models/bge-m3",
        cache_folder="/Volumes/SD/huggingface-models",
        device=device
    )
    total = 10000
    batch_size = 100
    start_time = time.time()
    sentences = ["I am AnCopilot, nice to meet you!"]
    for i in range(total // batch_size):
        embedding.encode(sentences * batch_size, normalize_embeddings=True)
        print(f"{i + 1} / {total // batch_size}")
    end_time = time.time()
    total_time = end_time - start_time
    average_time = total_time / total
    throughput = total / total_time
    print(f"Device {device}")
    print(f"Total {total} sentences")
    print(f"Batch size: {batch_size}")
    print(f"Total time: {total_time:.4f} seconds")
    print(f"Average time per iteration: {average_time:.4f} seconds")
    print(f"Throughput: {throughput:.2f} iterations per second")

以下是基准测试结果的对比表格:

DeviceBatch SizeTotal SentencesTotal Time (seconds)Average Time per Iteration (seconds)Throughput (iterations per second)
CUDA1001000013.64380.0014732.93
CUDA2001000012.15870.0012822.46
CPU1001000077.32020.0077129.33
CPU2001000072.63350.0073137.68

测试方案-FastAPI封装

测试代码

# 数据中写入 100 个文本
echo '{"inputs":["I am AnCopilot, nice to meet you!","I am AnCopilot, nice to meet you!",...],"model":"bge-m3"}' > data.json
# 发送 100 次(总计发送 10000 条),10 并发
ab -n 100 -c 10 -p data.json -T application/json http://10.1.251.228:58123/v1/embeddings/bulk

以下是基准测试结果的对比表格:

DeviceBatch SizeTotal SentencesConcurrencyTotal Time (seconds)Average Time per Iteration (ms)Throughput (iterations per second)
CPU1001000010182.5551825.54955
CPU1001000050超时失败
CUDA100100001023.903239.031418
CUDA100100005023.345233.448428
CUDA1001000010021.885218.850457

总结

根据基准测试结果,可以得出以下分析和总结:

  1. 设备性能

    • CUDA(GPU)设备的表现明显优于CPU设备。无论是100还是200的批量大小,CUDA设备的总时间和每次迭代的时间均显著低于CPU设备,CPU的总时间大约是CUDA的5到6倍。
  2. 批量大小的影响

    • 增加批量大小对CUDA设备有利:在CUDA上,从100的批量大小提升到200,虽然总时间略有减少(从13.6438秒降至12.1587秒),每次迭代的时间也减少了(从0.0014秒降至0.0012秒),同时吞吐量提升了(从732.93迭代/秒增至822.46迭代/秒)。
    • 在CPU设备上,增加批量大小同样有利:在CPU上,从批量大小100提高到200,虽然总时间也有所减少(从77.3202秒降至72.6335秒),每次迭代的时间也有小幅下降(从0.0077秒下降至0.0073秒),吞吐量由129.33迭代/秒提升至137.68迭代/秒。
  3. FastAPI封装

    • 封装后性能下降 50%