SPO Assignment Phase 2

Hi all. After some research on phase one. I have tested and edited some part of the strspn code in order to make it run and be optimized. But the result is not very optimistic. It turns out that some of my code is not being optimized. However, I have learned many different ways of optimizing. Because of the limited time, I have to end the process but hopefully, someone will try to continue the process and learn more from it.

Here is the first original part that I am looking at

STRSPN (const char *str, const char *accept)
{
 if (accept[0] == '\0')
 return 0;
 if (__glibc_unlikely (accept[1] == '\0'))
 {
 const char *a = str;
 for (; *str == *accept; str++);
 return str - a;
 }

This is my edited code

I removed the “__glibc_unlikely”. This might helps in the case where the prediction is wrong. then it means the processor pipeline will not need to be flushed and reduces cost several cycles. So as long as the prediction is correct most of the time this will tend to be food for performance.

I also added “__builtin_assume_aligned” to allow the compiler to assume that the returned pointer is at least align bytes aligned


STRSPN (const char *str, const char *accept)
{
if (accept[0] == '\0')
return 0;
if (accept[1] == '\0'))
{
//---const char *a = str;
const char *a = __builtin_assume_aligned(str, 16);
for (; *str == *accept; str++);
return str - a;
}

Here is the second original part I am looking at

</pre><pre>s = (unsigned char*) str; // chars which are only 1 byte (8 bits long)\
A char has a range -127 to 127 while an unsigned char has a range of 0 to 255.
if (!p[s[0]]) return 0;
if (!p[s[1]]) return 1;
if (!p[s[2]]) return 2;
if (!p[s[3]]) return 3;
//(+) replace if case with with switch case
//* * * * * * * * * * Added
switch(s){
case !p[s[0]]:
return 0;
case !p[s[1]]:
return 1;
case !p[s[2]]:
return 2;
case !p[s[3]]:
return 3;

In my last post, I decided to change the if and else to switch case But it slow down the program so I did not include it.

Next part is adding size_t. I replaced the integer type with size_t because the variables will not become the negative number.

//unsigned int c0, c1, c2, c3;
unsigned size_t c0, c1, c2, c3; // (+) Because the variables will not be negative number

For testing, I wrote a tester that will call the function 3
times and compare the time between my edited functions with the library one

 

int main() {
printf("Test 1 STRSPN \n ------------------------ \n\n");
int i;
char str[200000 ];
char set[200000 ];

memset(str, '8', 200000);
memset(set, '8', 200000);

clock_t start = clock();
clock_t diff;
i = STRSPN_GNU(str, set);
diff = clock() - start;
int msec = diff * 1000 / CLOCKS_PER_SEC;
printf("Time taken %d seconds %d milliseconds \n", msec / 1000, msec % 1000);
printf("The initial number has %d digits.\n", i);

clock_t start2 = clock(), diff2;
i = STRSPN_MY(str, set);
diff2 = clock() - start2;

int msec2 = diff2 * 1000 / CLOCKS_PER_SEC;
printf("Time taken %d seconds %d milliseconds \n", msec2 / 1000, msec2 % 1000);
printf("The initial number has %d digits.\n", i);

//===============================
printf("Test 2 STRSPN \n ------------------------ \n\n");
char str1[100000];
char set1[100000];

memset(str1, '2', 100000);
memset(set1, '2', 100000);

clock_t start1 = clock();
clock_t diff1;
i = STRSPN_GNU(str1, set1);
diff1 = clock() - start1;
int msec1 = diff1 * 1000 / CLOCKS_PER_SEC;
printf("Time taken %d seconds %d milliseconds \n", msec1 / 1000, msec1 % 1000);
printf("The initial number has %d digits.\n", i);

clock_t start3 = clock(), diff3;
i = STRSPN_MY(str1, set1);
diff3 = clock() - start3;

int msec3 = diff2 * 1000 / CLOCKS_PER_SEC;
printf("Time taken %d seconds %d milliseconds \n", msec3 / 1000, msec3 % 1000);
printf("The initial number has %d digits.\n", i);
//==============================

printf("Test 3 STRSPN \n ------------------------ \n\n");
char str5[200000];
char set5[200000];

memset(str5, "3", 200000);
memset(set5, "12345678", 200000);

clock_t start5 = clock();
clock_t diff5;
i = STRSPN_GNU(str5, set5);
diff5 = clock() - start5;
int msec5 = diff5 * 1000 / CLOCKS_PER_SEC;
printf("Time taken %d seconds %d milliseconds \n", msec5 / 1000, msec5 % 1000);
printf("The initial number has %d digits.\n", i);

clock_t start6 = clock(), diff6;
i = STRSPN_MY(str5, set5);
diff6 = clock() - start6;

int msec6 = diff6 * 1000 / CLOCKS_PER_SEC;
printf("Time taken %d seconds %d milliseconds \n", msec6 / 1000, msec6 % 1000);
printf("The initial number has %d digits.\n", i);

The result of the program is not very good. My function is the second one and it run slower than the library one. The strange thing is that I got different time for each time and run the program. So I choose the average one as the result.

Capture

Resources:

blog.man7.org

stackoverflow.com

https://en.wikipedia.org/wiki/Inline_expansion

SPO Project Week 9

Hi all, I was assigned to work on an Open source project. Here is the outline:

Examine the glibc code to identify a function that is a candidate for optimization.

Project Steps

  • Identify a possible approach to the problem
  • Test your approach.
    • You may do this by testing an alternative library containing your implementation of the function.
  • Modify glibc. Use git to track your changes.
  • Submit your change to the glibc community.
    • Make sure that your change does not cause regressions on other platforms, especially x86_64

Getting Started:

I am using strspn.c from the glibc as part of my project.

Definition:

According to cplusplus.com

strspn
size_t strspn ( const char * str1, const char * str2 );
Get span of character set in string
Returns the length of the initial portion of str1 which consists only of characters that are part of str2.

The search does not include the terminating null-characters of either strings, but ends there.

I have to test the code on betty (aarch64 ) to see how it works and what is the algorithm that it is using. here is my first demo code.

#include <stdio.h>
#include "timer.h"
#include <string.h>
int main(void){
 Timer t;
 int i, y=0, text,set;
 double rtime;
 char strtext[] = "1aaaaabeiou";
 char cset[] = "123456789abcdefghojklmnopqrstuvwxyz123456789abcdefghojklmnopqrstuvwxyz123456789abcdefghojklmnopqrstuvwxyz123456789abcdefghojklmnopqrstuvwxyz123456789abcdefghojklmnopqrstuvwxyz123456789abcdefghojklmnopqrstuvwxyz123456789abcdefghojklmnopqrstuvwxyz123456789abcdefghojklmnopqrstuvwxyz123456789abcdefghojklmnopqrstuvwxyz123456789abcdefghojklmnopqrstuvwxyz123456789abcdefghojklmnopqrstuvwxyz123456789abcdefghojklmnopqrstuvwxyz123456789abcdefghojklmnopqrstuvwxyz123456789abcdefghojklmnopqrstuvwxyz123456789abcdefghojklmnopqrstuvwxyz123456789abcdefghojklmnopqrstuvwxyz123456789abcdefghojklmnopqrstuvwxyz123456789abcdefghojklmnopqrstuvwxyz123456789abcdefghojklmnopqrstuvwxyz123456789abcdefghojklmnopqrstuvwxyz123456789abcdefghojklmnopqrstuvwxyz123456789abcdefghojklmnopqrstuvwxyz123456789abcdefghojklmnopqrstuvwxyz123456789abcdefghojklmnopqrstuvwxyz123456789abcdefghojklmnopqrstuvwxyz123456789abcdefghojklmnopqrstuvwxyz123456789abcdefghojklmnopqrstuvwxyz123456789abcdefghojklmnopqrstuvwxyz123456789abcdefghojklmnopqrstuvwxyz123456789abcdefghojklmnopqrstuvwxyz123456789abcdefghojklmnopqrstuvwxyz123456789abcdefghojklmnopqrstuvwxyz";
 for (; y <= 100000; y++){
 t.start();
 i = strspn(strtext, cset);
 t.stop();
 }
 text = sizeof(strtext);
 set = sizeof(cset);
 rtime = t.currtime();
 printf("strtext length: %d\n", text);
 printf("cset length: %d\n", set);
 printf("The initial number has %d digits.\n", i);
 printf("It took %lf second\n", rtime);
 printf("//////////////////////////\n");
 return 0;

The result of the demo is very simple. it works very fast since the data is not large enough to slow down the function call. So I try a different approach.

Searching for Vectorization:

I start to read more about how to apply the Vectorization concept in C++ code. It turns out that the compiler will do most of the optimization including Vectorize and basic compiler optimization functions. Because when we using a compile option -O3 which in modern gcc turns on auto-vectorization for us. But there is still some step I can do to optimize the code. So I start to research and edit the strspn function base on the result.

  1. adding  __builtin_assume_aligned
    • instead of regular declaration const char *a = str
    • we can allow the compiler to assume that the returned pointer is at least align bytes aligned
    • here is the result const char *a = __builtin_assume_aligned(str, 16);
  2. Inlining
    • is a manual or compiler optimization that replaces a function call site with the body of the called function.
    • this optimization is used in the memset part where we use multiple small memsets to enable inlining on most targets.
  3. Switch case
    • instead of using multiple if statement to check for the result of accepted string. we can replace it with switch case in order to increase the speed of check the conditions.
  4. size_t
    • we can replace the integer type with size_t because the variables will not become negative number.

Here is the edited code

#include <string.h>
#include <stdint.h>
#include 	<libc-internal.h>

#undef strspn
#ifndef STRSPN
# define STRSPN strspn
#endi

/* Return the length of the maximum initial segment
of S which contains only characters in ACCEPT. */
size_t
STRSPN (const char *str, const char *accept)
{
if (accept[0] == '\0')
return 0;
if (__glibc_unlikely (accept[1] == '\0'))
{
//---const char *a = str;
const char *a = __builtin_assume_aligned(str, 16);// (+) allows the compiler to assume that\
the returned pointer is at least align bytes aligned
for (; *str == *accept; str++);
return str - a;
}

/* Use multiple small memsets to enable inlining on most targets. */
//Sets the first num bytes of the block of memory pointed by ptr to the specified value (interpreted as an unsigned char).
unsigned char table[256];
unsigned char *p = memset (table, 0, 64);
// (+) Inlining, is a manual or compiler optimization that replaces a function call site with the body of the called function.
memset (p + 64, 0, 64);
memset (p + 128, 0, 64);
memset (p + 192, 0, 64);

unsigned char *s = (unsigned char*) accept;
/* Different from strcspn it does not add the NULL on the table
so can avoid check if str[i] is NULL, since table['\0'] will
be 0 and thus stopping the loop check. */
do
p[*s++] = 1;
while (*s);

s = (unsigned char*) str; // chars which are only 1 byte (8 bits long)\
A char has a range -127 to 127 while an unsigned char has a range of 0 to 255.
if (!p[s[0]]) return 0;
if (!p[s[1]]) return 1;
if (!p[s[2]]) return 2;
if (!p[s[3]]) return 3;
//(+) replace if case with with switch case
//* * * * * * * * * * Added
switch(s){
case !p[s[0]]:
return 0;
case !p[s[1]]:
return 1;
case !p[s[2]]:
return 2;
case !p[s[3]]:
return 3;
}
//* * * * * * * * * *
s = (unsigned char *) PTR_ALIGN_DOWN (s, 4);

//---unsigned int c0, c1, c2, c3;
unsigned size_t c0, c1, c2, c3; // (+) Because the variables will not be negative number
do {
s += 4;
c0 = p[s[0]];
c1 = p[s[1]];
c2 = p[s[2]];
c3 = p[s[3]];
} while ((c0 & c1 & c2 & c3) != 0);

size_t count = s - (unsigned char *) str;
return (c0 & c1) == 0 ? count + c0 : count + c2 + 2;
// (+) break down the conditional statement
//* * * * * * * * * * Added
if (c0 & c1 == 0){
return count + c0;
}else{
return count + c2 + 2 ;
}
//* * * * * * * * * *
}
libc_hidden_builtin_def (strspn)

The next step is to test this code base on the large amount of data which will provide the different results when run in other to compare the result and improve the function.

Resources:

https://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html

https://en.wikipedia.org/wiki/Inline_expansion

https://software.intel.com/sites/default/files/m/4/8/8/2/a/31848-CompilerAutovectorizationGuide.pdf

https://en.wikipedia.org/wiki/Automatic_vectorization#Techniques

https://blogs.msdn.microsoft.com/nativeconcurrency/2012/04/12/what-is-vectorization/

https://gcc.gnu.org/projects/tree-ssa/vectorization.html#using

https://sourceware.org/ml/libc-alpha/2016-04/msg00016.html

SPO600 Lab 7 Inline Assembler Lab

Hi all, I was assigned to work in my lab for my SPO600 class. Here is the outline of the specifications:

1. Write a version of the Volume Scaling solution from the Algorithm Selection Lab for AArch64 that uses the SQDMULH or SQRDMULH instructions via inline assembler. Test the performance of your solution and compare it to your previous solution(s).

Here is the code structure:

  1. Get data into vector register
    • ld1 {v.8h}, [sample register holding pointer]
  2. Get scaling factor into vector register or SIMD scalar register
    • dup v1.8h, [some register holding volumn]
  3. Scaling using SQDMULH
    • SQDMULH v0.8h, v0.8h, h1
  4. Store data back to array
    • st1 {v0.8h}, [sample register holding pointer] #16

We added an “Int16_t volint” and calculate it by using ” volint = volume * 32767;”

We added these asm snippet to our program:

__asm__ ( "ld1 : {v0.8h} x2, dup : v1.8h, x2 SQDMULH : v0.8h, v0.8h, h1, st1 {v0.8h}, x2 #16 ",
 : //empty output
 : "r"(x), "r"(volint) //input
 :
 );

 

2.Find the assembler in that software, and determine:

  • How much assembly-language code is present, which platform(s) it is used on, why it is there (what it does)
    • The program I look at is SooperLooper , SooperLooper is a live looping sampler capable of immediate loop recording, overdubbing, multiplying, reversing and more. It allows for multiple simultaneous multi-channel loops limited only by your computer’s available memory.
    • the asm code is used in a file name atomic.h. It contains Load and Store instructions and the atomic operation execution which is guaranteed to be complete before interrupt handler executes.
    • the platform that the atomic.h file is executed on is x86_64, here is the snippet of the code.

/**
* atomic_add - add integer to atomic variable
* @i: integer value to add
* @v: pointer of type atomic_t
*
* Atomically adds @i to @v. Note that the guaranteed useful range
* of an atomic_t is only 24 bits.
*/
static __inline__ void atomic_add(int i, atomic_t *v)
{
__asm__ __volatile__(
SMP_LOCK "addl %1,%0"
:"=m" (v->counter)
:"ir" (i), "m" (v->counter));
}

  • Your opinion of the value of the assembler code VS the loss of portability/increase in complexity of the code.
    • The atomic operation is a very powerful feature when it comes to working with multiple threads. Because the program has to wait for a specific thread to be completed done before performing the other thread. This will helps eliminated the possibility of deadlock in the program.

References:

http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0802a/SQDMULH_advsimd_vec_vector.html

http://essej.net/sooperlooper/

SPO600 Lab 6 Vectorization Lab

Hi all, I was assigned to work in a lab that and there are the specifications.

1. Write a short program that creates two 1000-element integer arrays and fills them with random numbers, then sums those two arrays to a third array, and finally sums the third array to a long int and prints the result.

Here is my lab6q1.cpp code:

#include <iostream>
#include <stdio.h>
#include <stdlib.h>
using namespace std;

int main(){
int first[1000];
int second[1000];
int sum[1000];
int i;
cout<<"Please enter the calue in 1st array\n";
for(i=0;i<1000;i++){
first[i] = rand() %100;
}
cout<<"Please enter the calue in 2nd array\n";
for(i=0;i<1000;i++){
second[i] = rand() %100;
}
cout<<"\n1st array values:\n";
for(i=0;i<1000;i++){
cout<<first[i]<<" ";
}
cout<<"\n2nd array values:\n";
for(i=0;i<1000;i++){
cout<<second[i]<<" ";
}
for(i=0;i<1000;i++){
sum[i]=first[i]+second[i];
}
cout<<"\nSum of two arraies:\n";
for(i=0;i<1000;i++){
cout << sum[i]<<" ";
}
return 0;
}

2. Compile this program on aarchie in such a way that the code is auto-vectorized. Annotate the emitted code (i.e., obtain a dissassembly via objdump -d and add comments to the instructions in <main> explaining what the code does).

I compile the program on Betty Aarch64 with g++ -O3 lab6q1.cpp. The program starting and ending logic of the assembly code is very similar to the regular C++ code. I use “cout” to separate different parts of the code so it is easier to read.

00000000004008c8 &lt;main&gt;:
 4008c8: d1400bff sub sp, sp, #0x2, lsl #12 // set up
 4008cc: d13b83ff sub sp, sp, #0xee0
 4008d0: a9bb7bfd stp x29, x30, [sp,#-80]!
 4008d4: 910003fd mov x29, sp
 4008d8: a90363f7 stp x23, x24, [sp,#48]
 4008dc: 90000001 adrp x1, 400000 &lt;_init-0x7b8&gt; // load x1 with the address label
 4008e0: b0000098 adrp x24, 411000 &lt;_DYNAMIC+0x80&gt; // load x24 with the address label
 4008e4: 91082300 add x0, x24, #0x208
 4008e8: 9132c021 add x1, x1, #0xcb0
 4008ec: a90153f3 stp x19, x20, [sp,#16]
 4008f0: a9025bf5 stp x21, x22, [sp,#32]
 4008f4: f90023f9 str x25, [sp,#64] // store

 //================First array input =========
 4008f8: 97ffffd6 bl 400850 &lt;_ZStlsISt11char_traitsIcEERSt13basic_ostreamIcT_ES5_PKc@plt&gt; // print msg
 4008fc: 910143a0 add x0, x29, #0x50
 400900: 913fc3b5 add x21, x29, #0xff0
 400904: aa0003f3 mov x19, x0
 400908: 52800c94 mov w20, #0x64 // #100
 40090c: 97ffffd9 bl 400870 &lt;rand@plt&gt; // random function call
 400910: 1ad40c01 sdiv w1, w0, w20
 400914: 1b148020 msub w0, w1, w20, w0
 400918: b8004660 str w0, [x19],#4
 40091c: eb15027f cmp x19, x21
 400920: 54ffff61 b.ne 40090c &lt;main+0x44&gt; // branch to label 40090c if not equal
 400924: 90000001 adrp x1, 400000 &lt;_init-0x7b8&gt;
 400928: 91336021 add x1, x1, #0xcd8
 40092c: 91082300 add x0, x24, #0x208

 //================Second array input =========
 400930: 97ffffc8 bl 400850 &lt;_ZStlsISt11char_traitsIcEERSt13basic_ostreamIcT_ES5_PKc@plt&gt; // print msg
 400934: 913fc3a1 add x1, x29, #0xff0
 400938: 913e8034 add x20, x1, #0xfa0
 40093c: aa0103f3 mov x19, x1
 400940: 52800c96 mov w22, #0x64 // #100
 400944: 97ffffcb bl 400870 &lt;rand@plt&gt; // random function call
 400948: 1ad60c01 sdiv w1, w0, w22
 40094c: 1b168020 msub w0, w1, w22, w0
 400950: b8004660 str w0, [x19],#4
 400954: eb14027f cmp x19, x20
 400958: 54ffff61 b.ne 400944 &lt;main+0x7c&gt;// branch to label 400944 if not equal
 40095c: 91082316 add x22, x24, #0x208
 400960: 90000001 adrp x1, 400000 &lt;_init-0x7b8&gt;
 400964: aa1603e0 mov x0, x22
 400968: 91340021 add x1, x1, #0xd00
 40096c: 90000019 adrp x25, 400000 &lt;_init-0x7b8&gt;

 //================First array output =========
 400970: 97ffffb8 bl 400850 &lt;_ZStlsISt11char_traitsIcEERSt13basic_ostreamIcT_ES5_PKc@plt&gt; // print msg
 400974: 910143b3 add x19, x29, #0x50
 400978: 91346337 add x23, x25, #0xd18
 40097c: b8404661 ldr w1, [x19],#4
 400980: aa1603e0 mov x0, x22
 400984: 97ffff9b bl 4007f0 &lt;_ZNSolsEi@plt&gt; // print first array elements
 400988: aa1703e1 mov x1, x23
 40098c: d2800022 mov x2, #0x1 // #1
 400990: 97ffffbc bl 400880 &lt;_ZSt16__ostream_insertIcSt11char_traitsIcEERSt13basic_ostreamIT_T0_ES6_PKS3_l@plt&gt; // print new line
 400994: eb15027f cmp x19, x21
 400998: 54ffff21 b.ne 40097c &lt;main+0xb4&gt; // branch to label 40097c if not equal
 40099c: 90000001 adrp x1, 400000 &lt;_init-0x7b8&gt;
 4009a0: aa1603e0 mov x0, x22
 4009a4: 91348021 add x1, x1, #0xd20

 //================Second array output =========
 4009a8: 97ffffaa bl 400850 &lt;_ZStlsISt11char_traitsIcEERSt13basic_ostreamIcT_ES5_PKc@plt&gt; // print msg
 4009ac: 913fc3b3 add x19, x29, #0xff0
 4009b0: b8404661 ldr w1, [x19],#4
 4009b4: aa1603e0 mov x0, x22
 4009b8: 97ffff8e bl 4007f0 &lt;_ZNSolsEi@plt&gt;
 4009bc: aa1703e1 mov x1, x23
 4009c0: d2800022 mov x2, #0x1 // #1
 4009c4: 97ffffaf bl 400880 &lt;_ZSt16__ostream_insertIcSt11char_traitsIcEERSt13basic_ostreamIT_T0_ES6_PKS3_l@plt&gt; // print new line
 4009c8: eb14027f cmp x19, x20
 4009cc: 54ffff21 b.ne 4009b0 &lt;main+0xe8&gt; // branch to label 4009b0 if not equal
 4009d0: d2800000 mov x0, #0x0 // #0
 4009d4: 910143a3 add x3, x29, #0x50
 4009d8: 8b000062 add x2, x3, x0
 4009dc: 913fc3a3 add x3, x29, #0xff0

 //================Second array output =========
 4009e0: 8b000061 add x1, x3, x0
 4009e4: 4c407841 ld1 {v1.4s}, [x2]
 4009e8: d283f202 mov x2, #0x1f90 // #8080
 4009ec: 4c407820 ld1 {v0.4s}, [x1]
 4009f0: 8b1d0042 add x2, x2, x29
 4009f4: 8b000041 add x1, x2, x0
 4009f8: 4ea08420 add v0.4s, v1.4s, v0.4s
 4009fc: 91004000 add x0, x0, #0x10
 400a00: 4c007820 st1 {v0.4s}, [x1]
 400a04: f13e801f cmp x0, #0xfa0
 400a08: 54fffe61 b.ne 4009d4 &lt;main+0x10c&gt; // branch to label 4009d4 if not equal
 400a0c: 91082315 add x21, x24, #0x208
 400a10: 90000001 adrp x1, 400000 &lt;_init-0x7b8&gt;
 400a14: aa1503e0 mov x0, x21
 400a18: 9134e021 add x1, x1, #0xd38
 400a1c: d283f213 mov x19, #0x1f90 // #8080
 400a20: d285e616 mov x22, #0x2f30 // #12080

 //================Sum array output =========
 400a24: 97ffff8b bl 400850 &lt;_ZStlsISt11char_traitsIcEERSt13basic_ostreamIcT_ES5_PKc@plt&gt; // print msg
 400a28: 8b1d0273 add x19, x19, x29
 400a2c: 8b1d02d6 add x22, x22, x29
 400a30: 91346334 add x20, x25, #0xd18
 400a34: b8404661 ldr w1, [x19],#4
 400a38: aa1503e0 mov x0, x21
 400a3c: 97ffff6d bl 4007f0 &lt;_ZNSolsEi@plt&gt; // print sum elements
 400a40: aa1403e1 mov x1, x20
 400a44: d2800022 mov x2, #0x1 // #1
 400a48: 97ffff8e bl 400880 &lt;_ZSt16__ostream_insertIcSt11char_traitsIcEERSt13basic_ostreamIT_T0_ES6_PKS3_l@plt&gt; // print new line
 400a4c: eb16027f cmp x19, x22
 400a50: 54ffff21 b.ne 400a34 &lt;main+0x16c&gt;
 400a54: a94153f3 ldp x19, x20, [sp,#16]
 400a58: a9425bf5 ldp x21, x22, [sp,#32]
 400a5c: a94363f7 ldp x23, x24, [sp,#48]
 400a60: f94023f9 ldr x25, [sp,#64]
 400a64: a8c57bfd ldp x29, x30, [sp],#80
 400a68: 52800000 mov w0, #0x0 // #0
 400a6c: 913b83ff add sp, sp, #0xee0
 400a70: 91400bff add sp, sp, #0x2, lsl #12
 400a74: d65f03c0 ret

3. Review the vector instructions for AArch64. Find a way to scale an array of sound samples (see Lab 5) by a factor between 0.000-1.000 using SIMD. (Note: you may need to convert some data types). You DO NOT need to code this solution (but feel free if you want to!).

SIMD(Single Instruction Multiple Data) extensions simplify development of application software by offering a single tool-chain and processing device, when compared to architectures with separate programmable DSPs or accelerators. The single tool-chain environment speeds time-to-market as software plays an increasingly important role in product development. The SIMD extensions are completely transparent to the operating system (OS), allowing existing OS ports to be used. New applications running on the OS can be written to explicitly use the SIMD extensions, providing an additional power/performance advantage. ( https://www.arm.com/products/processors/technologies/dsp-simd.php)

 

SPO600 Lab 5 Algorithm Selection Lab

Hi all, I was assigned to work in my lab for my SPO600 class. Here is the outline of the specifications:

For this lab, we are designing two different approaches to simulate the process of adjusting the volume of a sequence of sound samples in the C language.

What is the impact of various optimization levels on the software performance?

  • The optimization levels have different impact on the software performance
Comparison

 

Speed of compile Code size Execute speed Safety
-O0 fast average slow yes
-O1 medium medium medium yes
-O2 medium medium medium yes
-O3 slow big fast no

 

Does the distribution of data matter?

  • Yes, the distribution of data matter for the software.

If samples are fed at CD rate (44100 samples per second x 2 channels), can both algorithms keep up?

  • This many keep up but the time is it take to process is slow

What is the performance of each approach?

  • Xerxes and Betty have different performance profiles, so it’s not reasonable to compare performance between the machines, but it is reasonable to compare the relative performance of the two algorithms in each context. Do you get similar results?
  • On Betty with -O3 optimization level. We have the fastest measurement time on both approaches. Bit-shifting has a faster time than multiplication.
  • On Xerxes, both approaches had similar time.

SPO600 Lab4 Compiled C Lab

Hi all, I was assigned to work in my lab for my SPO600 class. Here is the outline of the specifications:

Create a C program to print “Hello Word” and compile with -g, -O0, -fno-builtin. Use objdump to analysis the object.

(1) Add the compiler option -static. Note and explain the change in size, section headers, and the function call.

With -static:
-rwxrwxr-x. 1 qdtran qdtran 682979 Feb 7 11:20 a.out
Without -static:
-rwxrwxr-x. 1 qdtran qdtran 8486 Jan 31 12:50 a.out
The size of the static file is much bigger than the original one. Because it imports the all the library that was in the program.

(2) Remove the compiler option -fno-builtin. Note and explain the change in the function call. When

When objdump the object we see that the function puts@plt is used instead of the printf@plt. The compiler tries to find differents function to increase the speed of the program.

(3) Remove the compiler option -g. Note and explain the change in size, section headers, and disassembly output. This will reduce the size of the file because we don’t enable the debugging option in the program.

This will reduce the size of the file because we don’t enable the debugging option in the program.

(4) Add additional arguments to the printf() function in your program. Note which register each argument is placed in. (Tip: Use sequential integer arguments after the first string argument. Go up to 10 arguments and note the pattern).

We note that there are more registers being use to store arguments for the program. The arguments are put on stack.

(5) Move the printf() call to a separate function named output(), and call that function from main(). Explain the changes in the object code.

Creating a function in the separate part and call it will be less efficient than coding them as in line format which will create high optimization and only gets called once.

(6) Remove -O0 and add -O3 to the gcc options. Note and explain the difference in the compiled code.

with -O0 the compiler has less debugging information and optimization. While The -O3 has more debugging information and optimization.

 

 

SPO600 Lab3 Assembler Lab

Hi all, I was assigned to work in my lab for my SPO600 class. Here is the outline of the specifications:

Blog about the programs you’ve written.

Describe the experience of writing and debugging in assembler, as compared to writing in other languages. Contrast x86_64 and aarch64 assembler, your experience with each, and your opinions of each.

Include links to the source code for both of your assembler programs.


Program:

This program is used to print a loop that can display a list of text and number.

If we compare this program to other programming languages such as Java, C or C++. This program is written in a much more low-level language. That means the computer can easy understand and execute the program in the fastest way comparing to high-level languages where is much more Human readable language.

x86_64:

The program written in x86_64 can be run on Xerxes. This assembler has a more organizable structure of code comparing to aarch64. Since the program is developed using msg and ASCII to print the message, the x86_64 has an easier reading syntax. for example, labeling with the loop, skip…etc. When moving data from the register using instructions “mov %99, %r10” the data is moved from right to left (put 99 into r10). The other useful feature of x86_64 is call jumping. For example, “jne loop” will take us to the label “loop:” that we were included in the program. this is useful when we have to divide the program into different section such as perform the looping when the value is less than some other value (loop_lt).

Aarch64

The program written in Aarch64 can be run on Betty. This assembler has a different way of organizing the structure of the code. When moving data from the register using instructions “mov r0,99” the data is moved from right to left (load r0 with value 99) is can cause confusion when we have to work with large an amount of code.

Conclusion:

The example of this lab is very interesting and useful for the new programmer who is interested in machine code. For me, at first, I have some difficulties with understanding the syntax and the structures of the code. But after working in the group and discussing with each other and spending some time to review the lecture. I have a better understanding of machine languages and how they are implemented to make the program run faster and more efficient.

Here is the link to our work:

Aarch64: https://www.dropbox.com/s/6r9toblje2n7ytx/lab3Bet.s?dl=0

x86_64: https://www.dropbox.com/s/yvpvpl21rxgnaau/lab3Xer.s?dl=0