[A100/MIG] 동일 PID가 여러 GPU에 표시될 때 해결 가이드

A100 80GB 서버에서 하나의 PID가 여러 GPU에 중복 표기되는 현상을 원인 분석(MIG 비활성화, 자원 격리 부재)과 MIG 구성 절차로 해결한 사례 정리.

Posted Jul 18, 2025 Updated Sep 11, 2025

By eun2ce

13 min read

a100 80gb gpu 4장이 장착된 서버에서 모델을 서빙하던 중 하나의 python 프로세스가 모든 gpu에 중복 할당되는 문제가 발생했다. 해결하는 방법을 정리합니다.

문제 현상

동일한 pid가 gpu 0~3번에 표시되고 있고, 각 gpu의 vram이 분산되어 점유되고 있는 상태

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A           78914      C   python                                 3752MiB |
|    0   N/A  N/A           81007      C   python                                 3780MiB |
|    1   N/A  N/A           78914      C   python                                 5082MiB |
|    1   N/A  N/A           81007      C   python                                 5082MiB |
|    2   N/A  N/A           78914      C   python                                 4992MiB |
|    2   N/A  N/A           81007      C   python                                 5066MiB |
|    3   N/A  N/A           78914      C   python                                 4234MiB |
|    3   N/A  N/A           81007      C   python                                 4236MiB |
+-----------------------------------------------------------------------------------------+

원인 분석

MIG 비활성화 상태

  
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 575.57.08              Driver Version: 575.57.08      CUDA Version: 12.9     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A100-SXM4-80GB          Off |   00000000:00:06.0 Off |                    0 |
| N/A   32C    P0             81W /  400W |   32759MiB /  81920MiB |      0%      Default |
|                                         |                        |             Disabled |

MIG M. 항목이 disabled

코드에 PyTorch GPU 메모리나 자원 관리에 대한 내용이 없음

대응 방안

대응방안은 두가지 정도로 보였음

torch.cuda.set_per_process_memory_fraction

pytorch에서 set_per_process_memory_fraction를 사용하면 gpu메모리 할당을 제한 할 수 있음 예를 들어 한 프로세스당 GPU 메모리를 50%만 사용하도록 설정할 수 있다.

  
torch.cuda.set_per_process_memory_fraction(0.5, device=0)

그러나 아래와 같은 문제점이 있음

메모리 사용량만 제한, SM(streaming multiprocessor)이나 연산 유닛은 분리하지 못함
두 개 이상 프로세스가 동일 GPU에서 연산을 병행할 경우, 연산 충돌이나 스케줄링 간섭이 발생할 수 있음
추론 지연, batch단위 처리 실패, 기타 타임아웃 이슈 발생 가능

따라서 적절하지 않음

MIG(Multi-Instance GPU)

MIG는 a100 GPU를 물리적으로 분할하여, 각 인스턴스가 메모리 뿐만 아니라 연산자원(SMs)까지 독립적으로 할당받을 수 있도록 해준다. MIG를 활성화하면 각 인스턴스는 고유한 MIG UUID를 가지며, 물리적으로 완전히 분리된 논리 GPU처럼 동작한다.

  
# MIG 모드 활성화 예시
sudo nvidia-smi -i 0 -mig 1

# 인스턴스 구성 예시 (3g.40gb * 2개)
sudo nvidia-mig-config -i 0 -C "3g.40gb,3g.40gb"

이후 각 프로세스나 컨테이너에 다음과 같이 UUID를 지정하여 할당

  
os.environ["CUDA_VISIBLE_DEVICES"] = "MIG-GPU-xxxxxxxx-xxxxxxxx-xxxxxxxx-xxxxxxxx"

장점
- 메모리, 연산 자원(SMs), 캐시 등이 완전 격리
- 서로 다른 프로세스 간 자원 충돌 발생
- nivdia-smi에서도 인스턴스 단위로 모니터링 가능

해결

위와같은 이유로 mig모드를 활성화해 사용하기로 결정

MIG 활성화 상태 확인

  
(base) ubuntu@a100-80g-4:~/works$ nvidia-smi -L
GPU 0: NVIDIA A100-SXM4-80GB (UUID: GPU-xxxxxxxxxxxxxxxxxxxd7)
GPU 1: NVIDIA A100-SXM4-80GB (UUID: GPU-c9f6xxxexxxxxx356xc5e)
GPU 2: NVIDIA A100-SXM4-80GB (UUID: GPU-5266xxxxxxxxx1439d32e)
GPU 3: NVIDIA A100-SXM4-80GB (UUID: GPU-93bc108axxx614b220524)

GPU MIG 활성화 및 분할

GPU 0, 1 만 분할 하도록 한다.

기존 서비스 종료

  
(base) ubuntu@a100-80g-4:~/works$ ps -ef | grep "python"
root        1202       1  0 Jul16 ?        00:00:00 /usr/bin/python3 /usr/bin/networkd-dispatcher --run-startup-triggers
ubuntu    282878       1  1 Jul17 ?        00:19:12 python xxx.py
ubuntu    283741       1  2 Jul17 ?        00:25:31 python xxx1.py

프로세스 kill 후 정상종료 확인

  
(base) ubuntu@a100-80g-4:~/works$ kill 282878 283741
(base) ubuntu@a100-80g-4:~/works$ ps -ef | grep "python"
root        1202       1  0 Jul16 ?        00:00:00 /usr/bin/python3 /usr/bin/networkd-dispatcher --run-startup-triggers
ubuntu    490212  478855  0 10:30 pts/2    00:00:00 grep --color=auto python

사용중인 프로세스가 없는지 확인

  
(base) ubuntu@a100-80g-4:~/works$ nvidia-smi
Fri Jul 18 10:31:13 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 575.57.08              Driver Version: 575.57.08      CUDA Version: 12.9     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A100-SXM4-80GB          Off |   00000000:00:06.0 Off |                    0 |
| N/A   32C    P0             81W /  400W |       0MiB /  81920MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA A100-SXM4-80GB          Off |   00000000:00:07.0 Off |                    0 |
| N/A   31C    P0             78W /  400W |       0MiB /  81920MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA A100-SXM4-80GB          Off |   00000000:00:08.0 Off |                    0 |
| N/A   30C    P0             76W /  400W |       0MiB /  81920MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA A100-SXM4-80GB          Off |   00000000:00:09.0 Off |                    0 |
| N/A   30C    P0             64W /  400W |       0MiB /  81920MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+
(base) ubuntu@a100-80g-4:~/works$

MIG 모드 활성화

  
(base) ubuntu@a100-80g-4:~/works$ sudo nvidia-smi -i 0 -mig 1
Warning: MIG mode is in pending enable state for GPU 00000000:00:06.0:In use by another client
00000000:00:06.0 is currently being used by one or more other processes (e.g. CUDA application or a monitoring application such as another instance of nvidia-smi). Please first kill all processes using the device and retry the command or reboot the system to make MIG mode effective.
All done.
(base) ubuntu@a100-80g-4:~/works$ sudo nvidia-smi -i 1 -mig 1
Warning: MIG mode is in pending enable state for GPU 00000000:00:07.0:In use by another client
00000000:00:07.0 is currently being used by one or more other processes (e.g. CUDA application or a monitoring application such as another instance of nvidia-smi). Please first kill all processes using the device and retry the command or reboot the system to make MIG mode effective.
All done.

# 이후 재시작
(base) ubuntu@a100-80g-4:~/works$ reboot

// MIG 정상 활성화되었는지 확인
nvidia-smi -L

persistence mode가 비활성화 되어있다는 내용

Persistence Mode가 비활성화되어 있어 MIG 인스턴스 생성 시 자동으로 유지되지 않을 수 있다는 경고만 존재합니다 (중요하지는 않지만 성능 및 안정성상 활성화 권장)

  
(base) ubuntu@a100-80g-4:~$ sudo nvidia-smi -i 0 -mig 1
Enabled MIG Mode for GPU 00000000:00:06.0

Warning: persistence mode is disabled on device 00000000:00:06.0. See the Known Issues section of the nvidia-smi(1) man page for more information. Run with [--help | -h] switch to get more information on how to enable persistence mode.
All done.
(base) ubuntu@a100-80g-4:~$ sudo nvidia-smi -i 1 -mig 1
Enabled MIG Mode for GPU 00000000:00:07.0

Warning: persistence mode is disabled on device 00000000:00:07.0. See the Known Issues section of the nvidia-smi(1) man page for more information. Run with [--help | -h] switch to get more information on how to enable persistence mode.
All done.

persistence mode 활성화 (선택)

  
(base) ubuntu@a100-80g-4:~$ sudo nvidia-smi -i 0 -pm 1
Enabled Legacy persistence mode for GPU 00000000:00:06.0.
All done.
(base) ubuntu@a100-80g-4:~$ sudo nvidia-smi -i 1 -pm 1
Enabled Legacy persistence mode for GPU 00000000:00:07.0.
All done.
(base) ubuntu@

MIG 인스턴스 분할

-cgi는 Compute GPU Instance Profile, 3g.40gb는 MIG에서 GPU 3개 + 40GB 메모리를 사용하는 인스턴스를 의미 메모리와 연산 슬라이스가 함께 남아야 유효하기때문에 1g.10gb 인스턴스 하나를 더 만들고 싶어도 남은 메모리가 0GB라서 사용 불가

  
(base) ubuntu@a100-80g-4:~$ sudo nvidia-smi mig -i 0 -cgi 3g.40gb -C
Successfully created GPU instance ID  2 on GPU  0 using profile MIG 3g.40gb (ID  9)
Successfully created compute instance ID  0 on GPU  0 GPU instance ID  2 using profile MIG 3g.40gb (ID  2)
(base) ubuntu@a100-80g-4:~$ sudo nvidia-smi mig -i 0 -cgi 3g.40gb -C
Successfully created GPU instance ID  1 on GPU  0 using profile MIG 3g.40gb (ID  9)
Successfully created compute instance ID  0 on GPU  0 GPU instance ID  1 using profile MIG 3g.40gb (ID  2)

(base) ubuntu@a100-80g-4:~$ nvidia-smi -L
GPU 0: NVIDIA A100-SXM4-80GB (UUID: GPU-1e46xxx...avcx6f8447)
  MIG 3g.40gb     Device  0: (UUID: MIG-9558a1x...xxx5cv337a)
  MIG 3g.40gb     Device  1: (UUID: MIG-ee15a23...7x25b663c6)
GPU 1: NVIDIA A100-SXM4-80GB (UUID: GPU-c9f61e9...37ae356c5e)
  MIG 3g.40gb     Device  0: (UUID: MIG-dd2a82fe...b9e0ead89)
  MIG 3g.40gb     Device  1: (UUID: MIG-32384341...2a09c3b4d)
GPU 2: NVIDIA A100-SXM4-80GB (UUID: GPU-52664791...c1439d32e)
GPU 3: NVIDIA A100-SXM4-80GB (UUID: GPU-93bc108a...14b220524)

nvidia-smi 에서도 확인 가능

  
(base) ubuntu@a100-80g-4:~/works/qwen_ocr$ nvidia-smi
Fri Jul 18 11:10:59 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 575.57.08              Driver Version: 575.57.08      CUDA Version: 12.9     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A100-SXM4-80GB          On  |   00000000:00:06.0 Off |                   On |
| N/A   30C    P0             55W /  400W |     213MiB /  81920MiB |     N/A      Default |
|                                         |                        |              Enabled |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA A100-SXM4-80GB          On  |   00000000:00:07.0 Off |                   On |
| N/A   29C    P0             52W /  400W |     213MiB /  81920MiB |     N/A      Default |
|                                         |                        |              Enabled |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA A100-SXM4-80GB          Off |   00000000:00:08.0 Off |                    0 |
| N/A   28C    P0             55W /  400W |       0MiB /  81920MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA A100-SXM4-80GB          Off |   00000000:00:09.0 Off |                    0 |
| N/A   29C    P0             57W /  400W |       0MiB /  81920MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| MIG devices:                                                                            |
+------------------+----------------------------------+-----------+-----------------------+
| GPU  GI  CI  MIG |                     Memory-Usage |        Vol|        Shared         |
|      ID  ID  Dev |                       BAR1-Usage | SM     Unc| CE ENC  DEC  OFA  JPG |
|                  |                                  |        ECC|                       |
|==================+==================================+===========+=======================|
|  0    1   0   0  |             107MiB / 40192MiB    | 42      0 |  3   0    2    0    0 |
|                  |                 0MiB / 65535MiB  |           |                       |
+------------------+----------------------------------+-----------+-----------------------+
|  0    2   0   1  |             107MiB / 40192MiB    | 42      0 |  3   0    2    0    0 |
|                  |                 0MiB / 65535MiB  |           |                       |
+------------------+----------------------------------+-----------+-----------------------+
|  1    1   0   0  |             107MiB / 40192MiB    | 42      0 |  3   0    2    0    0 |
|                  |                 0MiB / 65535MiB  |           |                       |
+------------------+----------------------------------+-----------+-----------------------+
|  1    2   0   1  |             107MiB / 40192MiB    | 42      0 |  3   0    2    0    0 |
|                  |                 0MiB / 65535MiB  |           |                       |
+------------------+----------------------------------+-----------+-----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+
(base) ubuntu@a100-80g-4:~/works/qwen_ocr$

분할하여 사용

bash

$ CUDA_VISIBLE_DEVICES=MIG-<UUID> python your_script.py

python

python의 경우 torch, tensorflow 등 GPU 관련 모듈 임포트 전에 호출해야함

  
import os

# 사용할 MIG 인스턴스 UUID 지정
os.environ["CUDA_VISIBLE_DEVICES"] = "MIG-9558axxxx5337a"

import torch

# 이후부터는 해당 MIG 인스턴스만 인식됨
print("사용 중인 디바이스:", torch.cuda.get_device_name(0))

docker or docker-compose

  
services:
  model-a:
    image: your_image
    environment:
      CUDA_VISIBLE_DEVICES: MIG-9558a1d2-1f99-xxxxx-a84a-e2c326d5337a
    runtime: nvidia

  model-b:
    image: your_image
    environment:
      CUDA_VISIBLE_DEVICES: MIG-ee15a714-cb85-xxx-be77-5d5725b663c6
    runtime: nvidia

infrastructure, gpu

This post is licensed under CC BY 4.0 by the author.