cv::parallel_for_ не очень большое улучшение
Я тестирую класс cv::ParallelLoopBody
для обработки кода изображения.
Сначала я начал реализовывать нормализацию, где я должен разделить все пиксели с определенными значениями для каждого канала, что является простым приятным распараллеленным кодом.
Однако при тестировании я не вижу разницы.
Я что-то здесь не так делаю?
Это мой класс:
class Parallel_process : public cv::ParallelLoopBody
{
private:
cv::Mat img; //my image to normalize
std::vector<int> A;
int diff;
public:
Parallel_process(cv::Mat inputImage, std::vector<int> AA, int diffVal)
: img(inputImage), A(AA), diff(diffVal){}
virtual void operator()(const cv::Range& range) const
{
for(int i = range.start; i < range.end; i++)
{
//in is a patch of my original image
cv::Mat in(img, cv::Rect(0, (img.rows/diff)*i, img.cols, img.rows/diff));
std::vector<int> AAA (A);
in.forEach<cv::Vec3f>
(
[&AAA](cv::Vec3f &pixel, const int* po) -> void
{
pixel[0]/=AAA[0];
pixel[1]/=AAA[1];
pixel[2]/=AAA[2];
}
);
}
}
};
И в main()
Функция, которую я называю своим оператором, выглядит так:
cv::parallel_for_(cv::Range(0, 91), Parallel_process(img, AA, 91)); //my image is 1288*728 size so 728/91=8
РЕДАКТИРОВАТЬ
Это моя конфигурация OpenCV:
General configuration for OpenCV 3.3.1 =====================================
Version control: unknown
Extra modules:
Location (extra): /home/jrsros/opencv_contrib-3.3.1/modules
Version control (extra): unknown
Platform:
Timestamp: 2017-12-14T13:05:47Z
Host: Linux 4.10.0-40-generic x86_64
CMake: 3.5.1
CMake generator: Unix Makefiles
CMake build tool: /usr/bin/make
Configuration: Release
CPU/HW features:
Baseline: SSE SSE2 SSE3
requested: SSE3
Dispatched code generation: SSE4_1 SSE4_2 FP16 AVX AVX2
requested: SSE4_1 SSE4_2 AVX FP16 AVX2
SSE4_1 (3 files): + SSSE3 SSE4_1
SSE4_2 (1 files): + SSSE3 SSE4_1 POPCNT SSE4_2
FP16 (1 files): + SSSE3 SSE4_1 POPCNT SSE4_2 FP16 AVX
AVX (5 files): + SSSE3 SSE4_1 POPCNT SSE4_2 AVX
AVX2 (8 files): + SSSE3 SSE4_1 POPCNT SSE4_2 FP16 FMA3 AVX AVX2
C/C++:
Built as dynamic libs?: YES
C++11: YES
C++ Compiler: /usr/bin/c++ (ver 7.2.0)
C++ flags (Release): -fsigned-char -W -Wall -Werror=return-type -Werror=non-virtual-dtor -Werror=address -Werror=sequence-point -Wformat -Werror=format-security -Wmissing-declarations -Wundef -Winit-self -Wpointer-arith -Wshadow -Wsign-promo -Wuninitialized -Winit-self -Wno-narrowing -Wno-delete-non-virtual-dtor -Wno-comment -Wno-implicit-fallthrough -fdiagnostics-show-option -Wno-long-long -pthread -fomit-frame-pointer -ffast-math -ffunction-sections -msse -msse2 -msse3 -fvisibility=hidden -fvisibility-inlines-hidden -fopenmp -O3 -DNDEBUG -DNDEBUG
C++ flags (Debug): -fsigned-char -W -Wall -Werror=return-type -Werror=non-virtual-dtor -Werror=address -Werror=sequence-point -Wformat -Werror=format-security -Wmissing-declarations -Wundef -Winit-self -Wpointer-arith -Wshadow -Wsign-promo -Wuninitialized -Winit-self -Wno-narrowing -Wno-delete-non-virtual-dtor -Wno-comment -Wno-implicit-fallthrough -fdiagnostics-show-option -Wno-long-long -pthread -fomit-frame-pointer -ffast-math -ffunction-sections -msse -msse2 -msse3 -fvisibility=hidden -fvisibility-inlines-hidden -fopenmp -g -O0 -DDEBUG -D_DEBUG
C Compiler: /usr/bin/gcc-5
C flags (Release): -fsigned-char -W -Wall -Werror=return-type -Werror=non-virtual-dtor -Werror=address -Werror=sequence-point -Wformat -Werror=format-security -Wmissing-declarations -Wmissing-prototypes -Wstrict-prototypes -Wundef -Winit-self -Wpointer-arith -Wshadow -Wuninitialized -Winit-self -Wno-narrowing -Wno-comment -fdiagnostics-show-option -Wno-long-long -pthread -fomit-frame-pointer -ffast-math -ffunction-sections -msse -msse2 -msse3 -fvisibility=hidden -fopenmp -O3 -DNDEBUG -DNDEBUG
C flags (Debug): -fsigned-char -W -Wall -Werror=return-type -Werror=non-virtual-dtor -Werror=address -Werror=sequence-point -Wformat -Werror=format-security -Wmissing-declarations -Wmissing-prototypes -Wstrict-prototypes -Wundef -Winit-self -Wpointer-arith -Wshadow -Wuninitialized -Winit-self -Wno-narrowing -Wno-comment -fdiagnostics-show-option -Wno-long-long -pthread -fomit-frame-pointer -ffast-math -ffunction-sections -msse -msse2 -msse3 -fvisibility=hidden -fopenmp -g -O0 -DDEBUG -D_DEBUG
Linker flags (Release):
Linker flags (Debug):
ccache: NO
Precompiled headers: NO
Extra dependencies: dl m pthread rt /usr/lib/x86_64-linux-gnu/libGLU.so /usr/lib/x86_64-linux-gnu/libGL.so /usr/lib/x86_64-linux-gnu/libtbb.so cudart nppc nppial nppicc nppicom nppidei nppif nppig nppim nppist nppisu nppitc npps cublas cufft -L/usr/local/cuda-8.0/lib64
3rdparty dependencies:
OpenCV modules:
To be built: cudev core cudaarithm flann hdf imgproc ml objdetect phase_unwrapping plot reg surface_matching video viz xphoto bgsegm cudabgsegm cudafilters cudaimgproc cudawarping dnn face freetype fuzzy img_hash imgcodecs photo shape videoio xobjdetect cudacodec highgui bioinspired dpm features2d line_descriptor saliency text calib3d ccalib cudafeatures2d cudalegacy cudaobjdetect cudaoptflow cudastereo datasets rgbd stereo structured_light superres tracking videostab xfeatures2d ximgproc aruco optflow stitching python2
Disabled: js world contrib_world
Disabled by dependency: -
Unavailable: java python3 ts cnn_3dobj cvv dnn_modern matlab sfm
GUI:
QT: NO
GTK+ 2.x: YES (ver 2.24.30)
GThread : YES (ver 2.48.2)
GtkGlExt: YES (ver 1.2.0)
OpenGL support: YES (/usr/lib/x86_64-linux-gnu/libGLU.so /usr/lib/x86_64-linux-gnu/libGL.so)
VTK support: YES (ver 6.2.0)
Media I/O:
ZLib: /usr/lib/x86_64-linux-gnu/libz.so (ver 1.2.8)
JPEG: /usr/lib/x86_64-linux-gnu/libjpeg.so (ver )
WEBP: /usr/lib/x86_64-linux-gnu/libwebp.so (ver encoder: 0x0202)
PNG: /usr/lib/x86_64-linux-gnu/libpng.so (ver 1.2.54)
TIFF: /usr/lib/x86_64-linux-gnu/libtiff.so (ver 42 - 4.0.6)
JPEG 2000: /usr/lib/x86_64-linux-gnu/libjasper.so (ver 1.900.1)
OpenEXR: build (ver 1.7.1)
GDAL: NO
GDCM: NO
Video I/O:
DC1394 1.x: NO
DC1394 2.x: NO
FFMPEG: YES
avcodec: YES (ver 56.60.100)
avformat: YES (ver 56.40.101)
avutil: YES (ver 54.31.100)
swscale: YES (ver 3.1.101)
avresample: NO
GStreamer:
base: YES (ver 1.8.3)
video: YES (ver 1.8.3)
app: YES (ver 1.8.3)
riff: YES (ver 1.8.3)
pbutils: YES (ver 1.8.3)
OpenNI: NO
OpenNI PrimeSensor Modules: NO
OpenNI2: NO
PvAPI: NO
GigEVisionSDK: NO
Aravis SDK: NO
UniCap: NO
UniCap ucil: NO
V4L/V4L2: NO/YES
XIMEA: NO
Xine: NO
Intel Media SDK: NO
gPhoto2: NO
Parallel framework: TBB (ver 4.4 interface 9002)
Trace: YES (with Intel ITT)
Other third-party libraries:
Use Intel IPP: 2017.0.3 [2017.0.3]
at: /home/jrsros/opencv-3.3.1/build/3rdparty/ippicv/ippicv_lnx
Use Intel IPP IW: sources (2017.0.3)
at: /home/jrsros/opencv-3.3.1/build/3rdparty/ippicv/ippiw_lnx
Use VA: NO
Use Intel VA-API/OpenCL: NO
Use Lapack: NO
Use Eigen: YES (ver 3.2.92)
Use Cuda: YES (ver 8.0)
Use OpenCL: YES
Use OpenVX: NO
Use custom HAL: NO
NVIDIA CUDA
Use CUFFT: YES
Use CUBLAS: YES
USE NVCUVID: NO
NVIDIA GPU arch: 20 30 35 37 50 52 60 61
NVIDIA PTX archs:
Use fast math: YES
OpenCL: <Dynamic loading of OpenCL library>
Include path: /home/jrsros/opencv-3.3.1/3rdparty/include/opencl/1.2
Use AMDFFT: NO
Use AMDBLAS: NO
Python 2:
Interpreter: /usr/bin/python2.7 (ver 2.7.12)
Libraries: /usr/lib/x86_64-linux-gnu/libpython2.7.so (ver 2.7.12)
numpy: /usr/lib/python2.7/dist-packages/numpy/core/include (ver 1.11.0)
packages path: lib/python2.7/dist-packages
Python 3:
Interpreter: /usr/bin/python3 (ver 3.5.2)
Python (for build): /usr/bin/python2.7
Java:
ant: NO
JNI: NO
Java wrappers: NO
Java tests: NO
Matlab:
mex: /usr/local/MATLAB/R2017b/bin/mex
Compiler/generator: Not working (bindings will not be generated)
Documentation:
Doxygen: NO
Tests and samples:
Tests: NO
Performance tests: NO
C/C++ Examples: NO
Install path: /usr/local
cvconfig.h is in: /home/jrsros/opencv-3.3.1/build
-----------------------------------------------------------------
Пример кода, который я использую для проверки моего параллелизма:
#include <opencv2/core.hpp>
#include <opencv2/highgui.hpp>
#include <opencv2/imgproc.hpp>
#include <opencv2/imgcodecs.hpp>
#include <opencv2/core/utility.hpp>
#include <iostream>
#include <sys/time.h>
class Parallel_process : public cv::ParallelLoopBody
{
private:
cv::Mat img;
std::vector<int> A;
int diff;
public:
Parallel_process(cv::Mat inputImgage, std::vector<int> AA, int diffVal)
: img(inputImgage), A(AA), diff(diffVal){}
virtual void operator()(const cv::Range& range) const
{
for(int i = range.start; i < range.end; i++)
{
cv::Mat in(img, cv::Rect(0, (img.rows/diff)*i, img.cols, img.rows/diff));
std::vector<int> AAA (A);
in.forEach<cv::Vec3f>
(
[&AAA](cv::Vec3f &pixel, const int* po) -> void
{
pixel[0]/=AAA[0];
pixel[1]/=AAA[1];
pixel[2]/=AAA[2];
}
);
}
}
};
int main(int argc, char* argv[])
{
cv::Mat src=cv::imread(argv[1]);
std::vector<std::vector<int> > AA;
//compute AA here
double timeStart=tic(), timeEnd=0;
//Normalize the image /AA
cv::parallel_for_(cv::Range(0, 91), Parallel_process(src, AA, 91));
timeEnd = tic() - timeStart;
std::cout << "ALL " <<1/timeEnd << std::endl<< std::endl; //FPS
return 0;
}
Я компилирую свой код в Linux Ubuntu Core i7-2630QM 2Ghz * 8 потоков (4 ядра):
g++ -std=c++1z -Wall -Weffc++ -Ofast -march=native test4.cpp -o test4 `pkg-config --cflags --libs opencv`
EDIT2 In htop
Я вижу, что он использует все потоки в конце
1 2 3 4 5 6 7 8
Variant 1
threads,mean(us),min(us),max(us)
1,18798.2,18106,22780
2,9762.84,9427,10397
3,6813.55,6765,9296
4,5317.75,5088,7433
5,5067.11,4931,7552
6,4925.41,4780,9473
7,4797.74,4641,9492
8,4798.18,4504,27244
1 2 3 4 5 6 7 8
Variant 2
threads,mean(us),min(us),max(us)
1,18512.8,17780,20084
2,9788.73,9302,11338
3,6850.47,6671,9765
4,5209.64,5022,8831
5,5052.46,4881,7041
6,6851.99,4762,11422
7,5077.32,4624,9886