SPO600 Lab 6 Vectorization Lab

Hi all, I was assigned to work in a lab that and there are the specifications.

1. Write a short program that creates two 1000-element integer arrays and fills them with random numbers, then sums those two arrays to a third array, and finally sums the third array to a long int and prints the result.

Here is my lab6q1.cpp code:

#include <iostream>
#include <stdio.h>
#include <stdlib.h>
using namespace std;

int main(){
int first[1000];
int second[1000];
int sum[1000];
int i;
cout<<"Please enter the calue in 1st array\n";
first[i] = rand() %100;
cout<<"Please enter the calue in 2nd array\n";
second[i] = rand() %100;
cout<<"\n1st array values:\n";
cout<<first[i]<<" ";
cout<<"\n2nd array values:\n";
cout<<second[i]<<" ";
cout<<"\nSum of two arraies:\n";
cout << sum[i]<<" ";
return 0;

2. Compile this program on aarchie in such a way that the code is auto-vectorized. Annotate the emitted code (i.e., obtain a dissassembly via objdump -d and add comments to the instructions in <main> explaining what the code does).

I compile the program on Betty Aarch64 with g++ -O3 lab6q1.cpp. The program starting and ending logic of the assembly code is very similar to the regular C++ code. I use “cout” to separate different parts of the code so it is easier to read.

00000000004008c8 &lt;main&gt;:
 4008c8: d1400bff sub sp, sp, #0x2, lsl #12 // set up
 4008cc: d13b83ff sub sp, sp, #0xee0
 4008d0: a9bb7bfd stp x29, x30, [sp,#-80]!
 4008d4: 910003fd mov x29, sp
 4008d8: a90363f7 stp x23, x24, [sp,#48]
 4008dc: 90000001 adrp x1, 400000 &lt;_init-0x7b8&gt; // load x1 with the address label
 4008e0: b0000098 adrp x24, 411000 &lt;_DYNAMIC+0x80&gt; // load x24 with the address label
 4008e4: 91082300 add x0, x24, #0x208
 4008e8: 9132c021 add x1, x1, #0xcb0
 4008ec: a90153f3 stp x19, x20, [sp,#16]
 4008f0: a9025bf5 stp x21, x22, [sp,#32]
 4008f4: f90023f9 str x25, [sp,#64] // store

 //================First array input =========
 4008f8: 97ffffd6 bl 400850 &lt;_ZStlsISt11char_traitsIcEERSt13basic_ostreamIcT_ES5_PKc@plt&gt; // print msg
 4008fc: 910143a0 add x0, x29, #0x50
 400900: 913fc3b5 add x21, x29, #0xff0
 400904: aa0003f3 mov x19, x0
 400908: 52800c94 mov w20, #0x64 // #100
 40090c: 97ffffd9 bl 400870 &lt;rand@plt&gt; // random function call
 400910: 1ad40c01 sdiv w1, w0, w20
 400914: 1b148020 msub w0, w1, w20, w0
 400918: b8004660 str w0, [x19],#4
 40091c: eb15027f cmp x19, x21
 400920: 54ffff61 b.ne 40090c &lt;main+0x44&gt; // branch to label 40090c if not equal
 400924: 90000001 adrp x1, 400000 &lt;_init-0x7b8&gt;
 400928: 91336021 add x1, x1, #0xcd8
 40092c: 91082300 add x0, x24, #0x208

 //================Second array input =========
 400930: 97ffffc8 bl 400850 &lt;_ZStlsISt11char_traitsIcEERSt13basic_ostreamIcT_ES5_PKc@plt&gt; // print msg
 400934: 913fc3a1 add x1, x29, #0xff0
 400938: 913e8034 add x20, x1, #0xfa0
 40093c: aa0103f3 mov x19, x1
 400940: 52800c96 mov w22, #0x64 // #100
 400944: 97ffffcb bl 400870 &lt;rand@plt&gt; // random function call
 400948: 1ad60c01 sdiv w1, w0, w22
 40094c: 1b168020 msub w0, w1, w22, w0
 400950: b8004660 str w0, [x19],#4
 400954: eb14027f cmp x19, x20
 400958: 54ffff61 b.ne 400944 &lt;main+0x7c&gt;// branch to label 400944 if not equal
 40095c: 91082316 add x22, x24, #0x208
 400960: 90000001 adrp x1, 400000 &lt;_init-0x7b8&gt;
 400964: aa1603e0 mov x0, x22
 400968: 91340021 add x1, x1, #0xd00
 40096c: 90000019 adrp x25, 400000 &lt;_init-0x7b8&gt;

 //================First array output =========
 400970: 97ffffb8 bl 400850 &lt;_ZStlsISt11char_traitsIcEERSt13basic_ostreamIcT_ES5_PKc@plt&gt; // print msg
 400974: 910143b3 add x19, x29, #0x50
 400978: 91346337 add x23, x25, #0xd18
 40097c: b8404661 ldr w1, [x19],#4
 400980: aa1603e0 mov x0, x22
 400984: 97ffff9b bl 4007f0 &lt;_ZNSolsEi@plt&gt; // print first array elements
 400988: aa1703e1 mov x1, x23
 40098c: d2800022 mov x2, #0x1 // #1
 400990: 97ffffbc bl 400880 &lt;_ZSt16__ostream_insertIcSt11char_traitsIcEERSt13basic_ostreamIT_T0_ES6_PKS3_l@plt&gt; // print new line
 400994: eb15027f cmp x19, x21
 400998: 54ffff21 b.ne 40097c &lt;main+0xb4&gt; // branch to label 40097c if not equal
 40099c: 90000001 adrp x1, 400000 &lt;_init-0x7b8&gt;
 4009a0: aa1603e0 mov x0, x22
 4009a4: 91348021 add x1, x1, #0xd20

 //================Second array output =========
 4009a8: 97ffffaa bl 400850 &lt;_ZStlsISt11char_traitsIcEERSt13basic_ostreamIcT_ES5_PKc@plt&gt; // print msg
 4009ac: 913fc3b3 add x19, x29, #0xff0
 4009b0: b8404661 ldr w1, [x19],#4
 4009b4: aa1603e0 mov x0, x22
 4009b8: 97ffff8e bl 4007f0 &lt;_ZNSolsEi@plt&gt;
 4009bc: aa1703e1 mov x1, x23
 4009c0: d2800022 mov x2, #0x1 // #1
 4009c4: 97ffffaf bl 400880 &lt;_ZSt16__ostream_insertIcSt11char_traitsIcEERSt13basic_ostreamIT_T0_ES6_PKS3_l@plt&gt; // print new line
 4009c8: eb14027f cmp x19, x20
 4009cc: 54ffff21 b.ne 4009b0 &lt;main+0xe8&gt; // branch to label 4009b0 if not equal
 4009d0: d2800000 mov x0, #0x0 // #0
 4009d4: 910143a3 add x3, x29, #0x50
 4009d8: 8b000062 add x2, x3, x0
 4009dc: 913fc3a3 add x3, x29, #0xff0

 //================Second array output =========
 4009e0: 8b000061 add x1, x3, x0
 4009e4: 4c407841 ld1 {v1.4s}, [x2]
 4009e8: d283f202 mov x2, #0x1f90 // #8080
 4009ec: 4c407820 ld1 {v0.4s}, [x1]
 4009f0: 8b1d0042 add x2, x2, x29
 4009f4: 8b000041 add x1, x2, x0
 4009f8: 4ea08420 add v0.4s, v1.4s, v0.4s
 4009fc: 91004000 add x0, x0, #0x10
 400a00: 4c007820 st1 {v0.4s}, [x1]
 400a04: f13e801f cmp x0, #0xfa0
 400a08: 54fffe61 b.ne 4009d4 &lt;main+0x10c&gt; // branch to label 4009d4 if not equal
 400a0c: 91082315 add x21, x24, #0x208
 400a10: 90000001 adrp x1, 400000 &lt;_init-0x7b8&gt;
 400a14: aa1503e0 mov x0, x21
 400a18: 9134e021 add x1, x1, #0xd38
 400a1c: d283f213 mov x19, #0x1f90 // #8080
 400a20: d285e616 mov x22, #0x2f30 // #12080

 //================Sum array output =========
 400a24: 97ffff8b bl 400850 &lt;_ZStlsISt11char_traitsIcEERSt13basic_ostreamIcT_ES5_PKc@plt&gt; // print msg
 400a28: 8b1d0273 add x19, x19, x29
 400a2c: 8b1d02d6 add x22, x22, x29
 400a30: 91346334 add x20, x25, #0xd18
 400a34: b8404661 ldr w1, [x19],#4
 400a38: aa1503e0 mov x0, x21
 400a3c: 97ffff6d bl 4007f0 &lt;_ZNSolsEi@plt&gt; // print sum elements
 400a40: aa1403e1 mov x1, x20
 400a44: d2800022 mov x2, #0x1 // #1
 400a48: 97ffff8e bl 400880 &lt;_ZSt16__ostream_insertIcSt11char_traitsIcEERSt13basic_ostreamIT_T0_ES6_PKS3_l@plt&gt; // print new line
 400a4c: eb16027f cmp x19, x22
 400a50: 54ffff21 b.ne 400a34 &lt;main+0x16c&gt;
 400a54: a94153f3 ldp x19, x20, [sp,#16]
 400a58: a9425bf5 ldp x21, x22, [sp,#32]
 400a5c: a94363f7 ldp x23, x24, [sp,#48]
 400a60: f94023f9 ldr x25, [sp,#64]
 400a64: a8c57bfd ldp x29, x30, [sp],#80
 400a68: 52800000 mov w0, #0x0 // #0
 400a6c: 913b83ff add sp, sp, #0xee0
 400a70: 91400bff add sp, sp, #0x2, lsl #12
 400a74: d65f03c0 ret

3. Review the vector instructions for AArch64. Find a way to scale an array of sound samples (see Lab 5) by a factor between 0.000-1.000 using SIMD. (Note: you may need to convert some data types). You DO NOT need to code this solution (but feel free if you want to!).

SIMD(Single Instruction Multiple Data) extensions simplify development of application software by offering a single tool-chain and processing device, when compared to architectures with separate programmable DSPs or accelerators. The single tool-chain environment speeds time-to-market as software plays an increasingly important role in product development. The SIMD extensions are completely transparent to the operating system (OS), allowing existing OS ports to be used. New applications running on the OS can be written to explicitly use the SIMD extensions, providing an additional power/performance advantage. ( https://www.arm.com/products/processors/technologies/dsp-simd.php)


SPO600 Lab 5 Algorithm Selection Lab

Hi all, I was assigned to work in my lab for my SPO600 class. Here is the outline of the specifications:

For this lab, we are designing two different approaches to simulate the process of adjusting the volume of a sequence of sound samples in the C language.

What is the impact of various optimization levels on the software performance?

  • The optimization levels have different impact on the software performance


Speed of compile Code size Execute speed Safety
-O0 fast average slow yes
-O1 medium medium medium yes
-O2 medium medium medium yes
-O3 slow big fast no


Does the distribution of data matter?

  • Yes, the distribution of data matter for the software.

If samples are fed at CD rate (44100 samples per second x 2 channels), can both algorithms keep up?

  • This many keep up but the time is it take to process is slow

What is the performance of each approach?

  • Xerxes and Betty have different performance profiles, so it’s not reasonable to compare performance between the machines, but it is reasonable to compare the relative performance of the two algorithms in each context. Do you get similar results?
  • On Betty with -O3 optimization level. We have the fastest measurement time on both approaches. Bit-shifting has a faster time than multiplication.
  • On Xerxes, both approaches had similar time.

SPO600 Lab4 Compiled C Lab

Hi all, I was assigned to work in my lab for my SPO600 class. Here is the outline of the specifications:

Create a C program to print “Hello Word” and compile with -g, -O0, -fno-builtin. Use objdump to analysis the object.

(1) Add the compiler option -static. Note and explain the change in size, section headers, and the function call.

With -static:
-rwxrwxr-x. 1 qdtran qdtran 682979 Feb 7 11:20 a.out
Without -static:
-rwxrwxr-x. 1 qdtran qdtran 8486 Jan 31 12:50 a.out
The size of the static file is much bigger than the original one. Because it imports the all the library that was in the program.

(2) Remove the compiler option -fno-builtin. Note and explain the change in the function call. When

When objdump the object we see that the function puts@plt is used instead of the printf@plt. The compiler tries to find differents function to increase the speed of the program.

(3) Remove the compiler option -g. Note and explain the change in size, section headers, and disassembly output. This will reduce the size of the file because we don’t enable the debugging option in the program.

This will reduce the size of the file because we don’t enable the debugging option in the program.

(4) Add additional arguments to the printf() function in your program. Note which register each argument is placed in. (Tip: Use sequential integer arguments after the first string argument. Go up to 10 arguments and note the pattern).

We note that there are more registers being use to store arguments for the program. The arguments are put on stack.

(5) Move the printf() call to a separate function named output(), and call that function from main(). Explain the changes in the object code.

Creating a function in the separate part and call it will be less efficient than coding them as in line format which will create high optimization and only gets called once.

(6) Remove -O0 and add -O3 to the gcc options. Note and explain the difference in the compiled code.

with -O0 the compiler has less debugging information and optimization. While The -O3 has more debugging information and optimization.