Ошибки при запуске Linpack
Я пытаюсь запустить HPL Linpack на своем личном ноутбуке. Я использую CentOS 8 на виртуальной машине.
Выделено ядер: 6
Память: 12,5 гб
Узлы: 1
Когда я запускаю с меньшими значениями N, все работает нормально, но когда я пытаюсь максимизировать использование ЦП с большими значениями N(пытаюсь увеличить до 75-80% использования), я каждый раз получаю разные ошибки.
ОШИБКИ - все ошибки появлялись при отдельных запусках.
[1617771807.179752] [localhost:3301 :0] sock.c:344 UCX ERROR recv(fd=28) failed: Bad address
[1617771807.188129] [localhost:3298 :0] sock.c:344 UCX ERROR recv(fd=27) failed: Connection reset by peer
[1617771807.249456] [localhost:3298 :0] sock.c:344 UCX ERROR sendv(fd=-1) failed: Bad file descriptor
[localhost:03298] *** An error occurred in MPI_Send
[localhost:03298] *** reported by process [3696427009,2]
[localhost:03298] *** on communicator MPI COMMUNICATOR 5 SPLIT FROM 3
[localhost:03298] *** MPI_ERR_OTHER: known error not in list
[localhost:03298] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[localhost:03298] *** and potentially your MPI job)
_________________________________________________________________________________________
malloc(): corrupted top size
[localhost:06009] *** Process received signal ***
[localhost:06009] Signal: Aborted (6)
[localhost:06009] Signal code: (-6)
[localhost:06009] [ 0] /lib64/libpthread.so.0(+0x12b20)[0x7f230e65cb20]
[localhost:06009] [ 1] /lib64/libc.so.6(gsignal+0x10f)[0x7f230e2be7ff]
[localhost:06009] [ 2] /lib64/libc.so.6(abort+0x127)[0x7f230e2a8c35]
[localhost:06009] [ 3] /lib64/libc.so.6(+0x7a987)[0x7f230e301987]
[localhost:06009] [ 4] /lib64/libc.so.6(+0x81d8c)[0x7f230e308d8c]
[localhost:06009] [ 5] /lib64/libc.so.6(+0x851f5)[0x7f230e30c1f5]
[localhost:06009] [ 6] /lib64/libc.so.6(__libc_malloc+0x1e2)[0x7f230e30d412]
[localhost:06009] [ 7] ./xhpl[0x4232e3]
[localhost:06009] [ 8] ./xhpl[0x4202cd]
[localhost:06009] [ 9] ./xhpl[0x41168e]
[localhost:06009] [10] ./xhpl[0x408eff]
[localhost:06009] [11] ./xhpl[0x4018aa]
[localhost:06009] [12] /lib64/libc.so.6(__libc_start_main+0xf3)[0x7f230e2aa7b3]
[localhost:06009] [13] ./xhpl[0x401cae]
[localhost:06009] *** End of error message ***
_________________________________________________________________________________________
corrupted size vs. prev_size
[localhost:05847] *** Process received signal ***
[localhost:05847] Signal: Aborted (6)
[localhost:05847] Signal code: (-6)
[localhost:05847] [ 0] /lib64/libpthread.so.0(+0x12b20)[0x7f07c812eb20]
[localhost:05847] [ 1] /lib64/libc.so.6(gsignal+0x10f)[0x7f07c7d907ff]
[localhost:05847] [ 2] /lib64/libc.so.6(abort+0x127)[0x7f07c7d7ac35]
[localhost:05847] [ 3] /lib64/libc.so.6(+0x7a987)[0x7f07c7dd3987]
[localhost:05847] [ 4] /lib64/libc.so.6(+0x81d8c)[0x7f07c7ddad8c]
[localhost:05847] [ 5] /lib64/libc.so.6(+0x825e6)[0x7f07c7ddb5e6]
[localhost:05847] [ 6] /lib64/libc.so.6(+0x83a1b)[0x7f07c7ddca1b]
[localhost:05847] [ 7] ./xhpl[0x423596]
[localhost:05847] [ 8] ./xhpl[0x4202a6]
[localhost:05847] [ 9] ./xhpl[0x41168e]
[localhost:05847] [10] ./xhpl[0x408eff]
[localhost:05847] [11] ./xhpl[0x4018aa]
[localhost:05847] [12] /lib64/libc.so.6(__libc_start_main+0xf3)[0x7f07c7d7c7b3]
[localhost:05847] [13] ./xhpl[0x401cae]
[localhost:05847] *** End of error message ***
используя формулу:
N = int((round(sqrt((memory_per_node * 1024 * 1024 * 1024 * узлов) / 8))) * процентное_использование)