SPO Project Week 9

Hi all, I was assigned to work on an Open source project. Here is the outline:

Examine the glibc code to identify a function that is a candidate for optimization.

Project Steps

  • Identify a possible approach to the problem
  • Test your approach.
    • You may do this by testing an alternative library containing your implementation of the function.
  • Modify glibc. Use git to track your changes.
  • Submit your change to the glibc community.
    • Make sure that your change does not cause regressions on other platforms, especially x86_64

Getting Started:

I am using strspn.c from the glibc as part of my project.

Definition:

According to cplusplus.com

strspn
size_t strspn ( const char * str1, const char * str2 );
Get span of character set in string
Returns the length of the initial portion of str1 which consists only of characters that are part of str2.

The search does not include the terminating null-characters of either strings, but ends there.

I have to test the code on betty (aarch64 ) to see how it works and what is the algorithm that it is using. here is my first demo code.

#include <stdio.h>
#include "timer.h"
#include <string.h>
int main(void){
 Timer t;
 int i, y=0, text,set;
 double rtime;
 char strtext[] = "1aaaaabeiou";
 char cset[] = "123456789abcdefghojklmnopqrstuvwxyz123456789abcdefghojklmnopqrstuvwxyz123456789abcdefghojklmnopqrstuvwxyz123456789abcdefghojklmnopqrstuvwxyz123456789abcdefghojklmnopqrstuvwxyz123456789abcdefghojklmnopqrstuvwxyz123456789abcdefghojklmnopqrstuvwxyz123456789abcdefghojklmnopqrstuvwxyz123456789abcdefghojklmnopqrstuvwxyz123456789abcdefghojklmnopqrstuvwxyz123456789abcdefghojklmnopqrstuvwxyz123456789abcdefghojklmnopqrstuvwxyz123456789abcdefghojklmnopqrstuvwxyz123456789abcdefghojklmnopqrstuvwxyz123456789abcdefghojklmnopqrstuvwxyz123456789abcdefghojklmnopqrstuvwxyz123456789abcdefghojklmnopqrstuvwxyz123456789abcdefghojklmnopqrstuvwxyz123456789abcdefghojklmnopqrstuvwxyz123456789abcdefghojklmnopqrstuvwxyz123456789abcdefghojklmnopqrstuvwxyz123456789abcdefghojklmnopqrstuvwxyz123456789abcdefghojklmnopqrstuvwxyz123456789abcdefghojklmnopqrstuvwxyz123456789abcdefghojklmnopqrstuvwxyz123456789abcdefghojklmnopqrstuvwxyz123456789abcdefghojklmnopqrstuvwxyz123456789abcdefghojklmnopqrstuvwxyz123456789abcdefghojklmnopqrstuvwxyz123456789abcdefghojklmnopqrstuvwxyz123456789abcdefghojklmnopqrstuvwxyz123456789abcdefghojklmnopqrstuvwxyz";
 for (; y <= 100000; y++){
 t.start();
 i = strspn(strtext, cset);
 t.stop();
 }
 text = sizeof(strtext);
 set = sizeof(cset);
 rtime = t.currtime();
 printf("strtext length: %d\n", text);
 printf("cset length: %d\n", set);
 printf("The initial number has %d digits.\n", i);
 printf("It took %lf second\n", rtime);
 printf("//////////////////////////\n");
 return 0;

The result of the demo is very simple. it works very fast since the data is not large enough to slow down the function call. So I try a different approach.

Searching for Vectorization:

I start to read more about how to apply the Vectorization concept in C++ code. It turns out that the compiler will do most of the optimization including Vectorize and basic compiler optimization functions. Because when we using a compile option -O3 which in modern gcc turns on auto-vectorization for us. But there is still some step I can do to optimize the code. So I start to research and edit the strspn function base on the result.

  1. adding  __builtin_assume_aligned
    • instead of regular declaration const char *a = str
    • we can allow the compiler to assume that the returned pointer is at least align bytes aligned
    • here is the result const char *a = __builtin_assume_aligned(str, 16);
  2. Inlining
    • is a manual or compiler optimization that replaces a function call site with the body of the called function.
    • this optimization is used in the memset part where we use multiple small memsets to enable inlining on most targets.
  3. Switch case
    • instead of using multiple if statement to check for the result of accepted string. we can replace it with switch case in order to increase the speed of check the conditions.
  4. size_t
    • we can replace the integer type with size_t because the variables will not become negative number.

Here is the edited code

#include <string.h>
#include <stdint.h>
#include 	<libc-internal.h>

#undef strspn
#ifndef STRSPN
# define STRSPN strspn
#endi

/* Return the length of the maximum initial segment
of S which contains only characters in ACCEPT. */
size_t
STRSPN (const char *str, const char *accept)
{
if (accept[0] == '\0')
return 0;
if (__glibc_unlikely (accept[1] == '\0'))
{
//---const char *a = str;
const char *a = __builtin_assume_aligned(str, 16);// (+) allows the compiler to assume that\
the returned pointer is at least align bytes aligned
for (; *str == *accept; str++);
return str - a;
}

/* Use multiple small memsets to enable inlining on most targets. */
//Sets the first num bytes of the block of memory pointed by ptr to the specified value (interpreted as an unsigned char).
unsigned char table[256];
unsigned char *p = memset (table, 0, 64);
// (+) Inlining, is a manual or compiler optimization that replaces a function call site with the body of the called function.
memset (p + 64, 0, 64);
memset (p + 128, 0, 64);
memset (p + 192, 0, 64);

unsigned char *s = (unsigned char*) accept;
/* Different from strcspn it does not add the NULL on the table
so can avoid check if str[i] is NULL, since table['\0'] will
be 0 and thus stopping the loop check. */
do
p[*s++] = 1;
while (*s);

s = (unsigned char*) str; // chars which are only 1 byte (8 bits long)\
A char has a range -127 to 127 while an unsigned char has a range of 0 to 255.
if (!p[s[0]]) return 0;
if (!p[s[1]]) return 1;
if (!p[s[2]]) return 2;
if (!p[s[3]]) return 3;
//(+) replace if case with with switch case
//* * * * * * * * * * Added
switch(s){
case !p[s[0]]:
return 0;
case !p[s[1]]:
return 1;
case !p[s[2]]:
return 2;
case !p[s[3]]:
return 3;
}
//* * * * * * * * * *
s = (unsigned char *) PTR_ALIGN_DOWN (s, 4);

//---unsigned int c0, c1, c2, c3;
unsigned size_t c0, c1, c2, c3; // (+) Because the variables will not be negative number
do {
s += 4;
c0 = p[s[0]];
c1 = p[s[1]];
c2 = p[s[2]];
c3 = p[s[3]];
} while ((c0 & c1 & c2 & c3) != 0);

size_t count = s - (unsigned char *) str;
return (c0 & c1) == 0 ? count + c0 : count + c2 + 2;
// (+) break down the conditional statement
//* * * * * * * * * * Added
if (c0 & c1 == 0){
return count + c0;
}else{
return count + c2 + 2 ;
}
//* * * * * * * * * *
}
libc_hidden_builtin_def (strspn)

The next step is to test this code base on the large amount of data which will provide the different results when run in other to compare the result and improve the function.

Resources:

https://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html

https://en.wikipedia.org/wiki/Inline_expansion

https://software.intel.com/sites/default/files/m/4/8/8/2/a/31848-CompilerAutovectorizationGuide.pdf

https://en.wikipedia.org/wiki/Automatic_vectorization#Techniques

https://blogs.msdn.microsoft.com/nativeconcurrency/2012/04/12/what-is-vectorization/

https://gcc.gnu.org/projects/tree-ssa/vectorization.html#using

https://sourceware.org/ml/libc-alpha/2016-04/msg00016.html

SPO600 Lab 7 Inline Assembler Lab

Hi all, I was assigned to work in my lab for my SPO600 class. Here is the outline of the specifications:

1. Write a version of the Volume Scaling solution from the Algorithm Selection Lab for AArch64 that uses the SQDMULH or SQRDMULH instructions via inline assembler. Test the performance of your solution and compare it to your previous solution(s).

Here is the code structure:

  1. Get data into vector register
    • ld1 {v.8h}, [sample register holding pointer]
  2. Get scaling factor into vector register or SIMD scalar register
    • dup v1.8h, [some register holding volumn]
  3. Scaling using SQDMULH
    • SQDMULH v0.8h, v0.8h, h1
  4. Store data back to array
    • st1 {v0.8h}, [sample register holding pointer] #16

We added an “Int16_t volint” and calculate it by using ” volint = volume * 32767;”

We added these asm snippet to our program:

__asm__ ( "ld1 : {v0.8h} x2, dup : v1.8h, x2 SQDMULH : v0.8h, v0.8h, h1, st1 {v0.8h}, x2 #16 ",
 : //empty output
 : "r"(x), "r"(volint) //input
 :
 );

 

2.Find the assembler in that software, and determine:

  • How much assembly-language code is present, which platform(s) it is used on, why it is there (what it does)
    • The program I look at is SooperLooper , SooperLooper is a live looping sampler capable of immediate loop recording, overdubbing, multiplying, reversing and more. It allows for multiple simultaneous multi-channel loops limited only by your computer’s available memory.
    • the asm code is used in a file name atomic.h. It contains Load and Store instructions and the atomic operation execution which is guaranteed to be complete before interrupt handler executes.
    • the platform that the atomic.h file is executed on is x86_64, here is the snippet of the code.

/**
* atomic_add - add integer to atomic variable
* @i: integer value to add
* @v: pointer of type atomic_t
*
* Atomically adds @i to @v. Note that the guaranteed useful range
* of an atomic_t is only 24 bits.
*/
static __inline__ void atomic_add(int i, atomic_t *v)
{
__asm__ __volatile__(
SMP_LOCK "addl %1,%0"
:"=m" (v->counter)
:"ir" (i), "m" (v->counter));
}

  • Your opinion of the value of the assembler code VS the loss of portability/increase in complexity of the code.
    • The atomic operation is a very powerful feature when it comes to working with multiple threads. Because the program has to wait for a specific thread to be completed done before performing the other thread. This will helps eliminated the possibility of deadlock in the program.

References:

http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0802a/SQDMULH_advsimd_vec_vector.html

http://essej.net/sooperlooper/