在整数向量上使用

网站建设790 更新时间：2025-06-08 11:08:04

在整数向量上使用_mm_shuffle_ps的含义(implications of using _mm_shuffle_ps on integer vector)

SSE内在函数包括_mm_shuffle_ps xmm1 xmm2 immx ，它允许从xmm1选择2个元素与xmm2 2个元素连接。然而，这是浮动，（由_ps暗示，打包单）。但是，如果你强制打包整数__m128i ，那么你也可以使用_mm_shuffle_ps ：

#include <iostream> #include <immintrin.h> #include <sstream> using namespace std; template <typename T> std::string __m128i_toString(const __m128i var) { std::stringstream sstr; const T* values = (const T*) &var; if (sizeof(T) == 1) { for (unsigned int i = 0; i < sizeof(__m128i); i++) { sstr << (int) values[i] << " "; } } else { for (unsigned int i = 0; i < sizeof(__m128i) / sizeof(T); i++) { sstr << values[i] << " "; } } return sstr.str(); } int main(){ cout << "Starting SSE test" << endl; cout << "integer shuffle" << endl; int A[] = {1, -2147483648, 3, 5}; int B[] = {4, 6, 7, 8}; __m128i pC; __m128i* pA = (__m128i*) A; __m128i* pB = (__m128i*) B; *pA = (__m128i)_mm_shuffle_ps((__m128)*pA, (__m128)*pB, _MM_SHUFFLE(3, 2, 1 ,0)); pC = _mm_add_epi32(*pA,*pB); cout << "A[0] = " << A[0] << endl; cout << "A[1] = " << A[1] << endl; cout << "A[2] = " << A[2] << endl; cout << "A[3] = " << A[3] << endl; cout << "B[0] = " << B[0] << endl; cout << "B[1] = " << B[1] << endl; cout << "B[2] = " << B[2] << endl; cout << "B[3] = " << B[3] << endl; cout << "pA = " << __m128i_toString<int>(*pA) << endl; cout << "pC = " << __m128i_toString<int>(pC) << endl; }

相关相应组件的片段（mac osx，macports gcc 4.8，-march = native在ivybridge CPU上）：

vshufps $228, 16(%rsp), %xmm1, %xmm0 vpaddd 16(%rsp), %xmm0, %xmm2 vmovdqa %xmm0, 32(%rsp) vmovaps %xmm0, (%rsp) vmovdqa %xmm2, 16(%rsp) call __ZStlsISt11char_traitsIcEERSt13basic_ostreamIcT_ES5_PKc ....

因此它似乎在整数上工作正常，我预期因为寄存器对类型不可知，但是必须有一个理由说明文档说这个指令仅用于浮点数。有人知道任何缺点或我错过的影响吗？

SSE intrinsics includes _mm_shuffle_ps xmm1 xmm2 immx which allows one to pick 2 elements from xmm1 concatenated with 2 elements from xmm2. However this is for floats, (implied by the _ps , packed single). However if you cast your packed integers __m128i, then you can use _mm_shuffle_ps as well:

#include <iostream> #include <immintrin.h> #include <sstream> using namespace std; template <typename T> std::string __m128i_toString(const __m128i var) { std::stringstream sstr; const T* values = (const T*) &var; if (sizeof(T) == 1) { for (unsigned int i = 0; i < sizeof(__m128i); i++) { sstr << (int) values[i] << " "; } } else { for (unsigned int i = 0; i < sizeof(__m128i) / sizeof(T); i++) { sstr << values[i] << " "; } } return sstr.str(); } int main(){ cout << "Starting SSE test" << endl; cout << "integer shuffle" << endl; int A[] = {1, -2147483648, 3, 5}; int B[] = {4, 6, 7, 8}; __m128i pC; __m128i* pA = (__m128i*) A; __m128i* pB = (__m128i*) B; *pA = (__m128i)_mm_shuffle_ps((__m128)*pA, (__m128)*pB, _MM_SHUFFLE(3, 2, 1 ,0)); pC = _mm_add_epi32(*pA,*pB); cout << "A[0] = " << A[0] << endl; cout << "A[1] = " << A[1] << endl; cout << "A[2] = " << A[2] << endl; cout << "A[3] = " << A[3] << endl; cout << "B[0] = " << B[0] << endl; cout << "B[1] = " << B[1] << endl; cout << "B[2] = " << B[2] << endl; cout << "B[3] = " << B[3] << endl; cout << "pA = " << __m128i_toString<int>(*pA) << endl; cout << "pC = " << __m128i_toString<int>(pC) << endl; }

Snippet of relevant corresponding assembly (mac osx, macports gcc 4.8, -march=native on an ivybridge CPU):

vshufps $228, 16(%rsp), %xmm1, %xmm0 vpaddd 16(%rsp), %xmm0, %xmm2 vmovdqa %xmm0, 32(%rsp) vmovaps %xmm0, (%rsp) vmovdqa %xmm2, 16(%rsp) call __ZStlsISt11char_traitsIcEERSt13basic_ostreamIcT_ES5_PKc ....

Thus it seemingly works fine on integers, which I expected as the registers are agnostic to types, however there must be a reason why the docs say that this instruction is only for floats. Does someone know any downsides, or implications I have missed?

最满意答案

整数没有等效的_mm_shuffle_ps 。在这种情况下，你可以做到同样的效果

SSE2

*pA = _mm_shuffle_epi32(_mm_unpacklo_epi32(*pA, _mm_shuffle_epi32(*pB, 0xe)),0xd8);

SSE4.1

*pA = _mm_blend_epi16(*pA, *pB, 0xf0);

或者像这样更改为浮点域

*pA = _mm_castps_si128( _mm_shuffle_ps(_mm_castsi128_ps(*pA), _mm_castsi128_ps(*pB), _MM_SHUFFLE(3, 2, 1 ,0)));

但是，更改域可能会导致某些CPU出现旁路延迟延迟。请记住，根据Agner的说法

旁路延迟在延迟是瓶颈的长依赖链中很重要，但不是吞吐量而非延迟的重要因素。

您必须测试代码并查看上述哪种方法更有效。

幸运的是，在大多数Intel / AMD CPU上，在大多数整数向量指令之间使用shufps通常没有任何代价。阿格纳说：

例如，我发现混合PADDD和SHUFPS [在Sandybridge]时没有延迟。

Nehalem对SHUFPS有2次旁路延迟延迟，但即使这样，单个SHUFPS通常仍然比其他多个指令更快。额外的指令也有延迟，以及成本计算吞吐量。

反向（FP数学指令之间的整数混洗）不安全：

在示例8.3a中的第112页的Agner Fog的微体系结构中，他表明在浮点域中使用PSHUFD （ _mm_shuffle_epi32 ）而不是SHUFPS （ _mm_shuffle_ps ）会导致四个时钟周期的旁路延迟。在例8.3b中，他使用SHUFPS来消除延迟（在他的例子中有效）。

在Nehalem上实际上有五个域。 Nahalem似乎受影响最大（在Nahalem之前不存在旁路延迟）。在Sandy Bridge，延误不太重要。哈斯韦尔更是如此。事实上，Haswell Agner说他发现SHUFPS或PSHUFD之间没有延迟（参见第140页）。

There is no equivalent to _mm_shuffle_ps for integers. To achieve the same effect in this case you can do

SSE2

*pA = _mm_shuffle_epi32(_mm_unpacklo_epi32(*pA, _mm_shuffle_epi32(*pB, 0xe)),0xd8);

SSE4.1

*pA = _mm_blend_epi16(*pA, *pB, 0xf0);

or change to the floating point domain like this

*pA = _mm_castps_si128( _mm_shuffle_ps(_mm_castsi128_ps(*pA), _mm_castsi128_ps(*pB), _MM_SHUFFLE(3, 2, 1 ,0)));

But changing domains may incur bypass latency delays on some CPUs. Keep in mind that according to Agner

The bypass delay is important in long dependency chains where latency is a bottleneck, but not where it is throughput rather than latency that matters.

You have to test your code and see which method above is more efficient.

Fortunately, on most Intel/AMD CPUs, there is usually no penalty for using shufps between most integer-vector instructions. Agner says:

For example, I found no delay when mixing PADDD and SHUFPS [on Sandybridge].

Nehalem does have 2 bypass-delay latency to/from SHUFPS, but even then a single SHUFPS is often still faster than multiple other instructions. Extra instructions have latency, too, as well as costing throughput.

The reverse (integer shuffles between FP math instructions) is not as safe:

In Agner Fog's microarchitecture on page 112 in Example 8.3a, he shows that using PSHUFD (_mm_shuffle_epi32) instead of SHUFPS (_mm_shuffle_ps) when in the floating point domain causes a bypass delay of four clock cycles. In Example 8.3b he uses SHUFPS to remove the delay (which works in his example).

On Nehalem there are actually five domains. Nahalem seems to be the most effected (the bypass delays did not exist before Nahalem). On Sandy Bridge the delays are less significant. This is even more true on Haswell. In fact on Haswell Agner said he found no delays between SHUFPS or PSHUFD (see page 140).