Problem 3 The algorithm to compute a 1-D convolution with a 1x4 kernel is specified by the following python code: data = [...] # previously defined, 1024 element kernel = [...] # previously defined, 4 elements output = list (range (len (data)-len (kernel))) for i in range (len (data)-len (kernel)): output = 0. for j in range (len (kernel)): output[i] += kernel[i+j]*kernel[j] You are given a RISC-V vector processor with the following vector instructions: . . vld v0, 0(to) vs vi, 0(t1) vmul v2,vi, f0 vadd v3, v1, v2 # load a vector of floats from address to to vector reg vo # store vector floats to address ti from vector reg v1 # multiply f0 by elements in vector vi, store result in v2 # compute v3[i]=v1[i]+v2 [i] for all i Write an assembly program that utilized the vector processing instructions to compute the convolution specified by the python program (Note: you may assume that the system is dynamically typed, preconfigured, and the vector length registered is configured to a length of 1020). 1.2 GHz and a CPU clock rate of fcpu = 3 GHz, what is the total run = Given a bus clock rate of fbus time of your program? Drawing a comparison with the basic convolution algorithm given in lecture which uses only scalar operations (no SIMD), what is the speedup?