So -val + 1<<15, or mapped = 1<<15 - val maps our signed values to unsigned, in such a way that the lowest unsigned value is the greatest signed value. ![]() We need to do that first, because 0 stays 0, while 0xFFFF becomes 0x0001, etc. Since we only have an instruction to find the horizontal minimum, we can reverse the range with negation. (add, subtract, or XOR (add-without-carry) are all equivalent for this.) Given a signed 16-bit val: rangeshift = val + 1<<15 maps the lowest to 0, and the highest to 65535. We can add 32768 to each element and then use unsigned stuff. (If our elements were unsigned, maybe we could use SSE4.1 PHMINPOSUW on 0 - vec to find the max in one go, but the OP is using signed compares.) It would need an extra shuffle to get the horizontal max. It would be easy with PSHUFB (with a 16-byte mask constant).īut I'm trying to limit it to SSE2 (and maybe code that could be adapted to MMX). I looked at trying to only do one PMULLW by creating inputs for it with two shuffles. (Using psrldq to shift in zeros would produce zero, but signed products can be 0, since two odd numbers will make a positive $ objdump -b binary -Mintel -D -mi8086 16bit-SSEģ: f3 0f 7e 46 04 movq xmm0,QWORD PTR ĭesign notes for the 2 shuffle, 2 multiply way. pextrw ax, xmm0, 0 doesn't assemble (with NASM).įun fact #2: ndisasm -b16 incorrectly disassembles the MOVQ load as movq xmm0, xmm10: $ nasm -fbin 16bit-SSE.asm The 66 prefix is already part of the required encoding. With the odd elements all zero, the horizontal add part of PMADDWD just gets the result of the signed multiply in the even elements.įun fact: MOVD and pextrw eax, xmm0, 0 don't need an operand-size prefix to write to eax in 16-bit mode. Maybe unpack with zeros (or PMOVZXWD), and use PMADDWD to do 16b * 16b->32b vector multiplies. If you want 32-bit products, PMAXSD is part of SSE4.1. To clear the upper half of EAX, you could use this instead of MOVD: I'm assuming the caller only looks at a 16-bit return value. maximum product result in the low word of xmm0 Pshuflw xmm0, xmm1, 0b00001110 elements = elements, rest don't care then find the max word element between the bottom halves of xmm1 and xmm2 We only evaluate 16-bit products, and use signed comparisons on them. (See also the x86 tag wiki for more x86 links) untested, but it does assemble (with NASM) It's just 11 instructions (not counting the prologue/epilogue), and they're all fast, mostly single-uop on most CPUs. (You could do the same thing in MMX registers using only SSE1, which added the MMX-register version of PMAXSW). This does the trick with no branching, using SSE2. See below for an SSE4.1 version that finds the max and 2nd-max separately. It will work if you can rule out negative inputs, or otherwise rule out inputs where this is a problem. Update: I just realized that this will give the wrong answer if the largest pairwise product is from two negative numbers. ![]() If brute force doesn't work, you aren't using enough :P Since you only need to return the product of the two highest numbers, you could just produce all 6 pairwise products and take the max. If you're actually interested in a high performance implementation, rather than just anything that works, please update your question. ![]() IDK how to answer this the way you probably want without just doing your homework for you. Fortunately, I was able to write it without any extra MOVDQA instructions anyway, so AVX wouldn't help. VEX-encoded instructions don't, so you can't use the AVX 3-operand versions. And yes, SSE instructions work in 16-bit mode. Presumably you weren't looking for a SIMD answer, but I though it would be interesting to write.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |