[SOLVED] Interpolation Algorithm Speed - Help

Discussion about Parallella (and Epiphany) Software Development

Moderators: amylaar, jeremybennett, simoncook

[SOLVED] Interpolation Algorithm Speed - Help

Postby racerxdl » Fri Sep 26, 2014 8:13 pm

Hi all :D

I decided to give a try to port my HyperSignal project tile interpolator to epiphany cores. It is a pretty simple thing, it gets a 8x8 (usually) matrix and upscales it to 256x256 using either bilinear ou bicosine interpolation.

So since this is a easy thing to paralelize (I can either paralelize the tiles since there is a lot, or fragment a tile) I wanted to give a try on Epiphany.

So after lurking arround to see a easy way to get the epiphany and host exchanging infos for starting works, I finally got it working.


So here I have my epiphany code:
Code: Select all
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <stdint.h>

#include "common.h"
#include "tools.h"
#include "e_lib.h"

#define BUFSTART (0x8f000000)

int main(void) {
   e_coreid_t coreid;
  float x2, y2;
  unsigned char val;

  char workid = 0;

  HSWork *ext_works = (void *) BUFSTART;
  HSWork work;

  e_dma_copy(&work, ext_works+sizeof(HSWork)*workid, sizeof(HSWork));

  int *curx = (int *)CURRENT_POS, *cury = (int *)(CURRENT_POS + 4);

  *curx = 0;
  *cury = 0;

  for(int y=0;y<OUTPUT_HEIGHT;y++)  {
    *cury = y;
    for(int x=0;x<OUTPUT_WIDTH;x++) {
      *curx = x;
      x2 = ( x / ((OUTPUT_WIDTH)  * (work.sx) ))   * (SAMPLE_WIDTH-1);
      y2 = ( y / ((OUTPUT_HEIGHT) * (work.sy) ))   * (SAMPLE_HEIGHT-1);
      x2 += work.x0;
      y2 += work.y0;
      val = bilinear(work.sample,x2,y2,SAMPLE_WIDTH);
      setval(work.output,x,y,OUTPUT_WIDTH,val);
    }
  }
  *cury = -1;
  *curx = -1;
  work.done = 1;
  work.error = 0;

  e_dma_copy(ext_works+sizeof(HSWork)*workid, &work, sizeof(HSWork));
   return EXIT_SUCCESS;
}


For reference, this is my common.h file with HSWork struct

Code: Select all

#ifndef COMMON_H
#define COMMON_H

#define SAMPLE_WIDTH 8
#define SAMPLE_HEIGHT 8

#define OUTPUT_WIDTH 128
#define OUTPUT_HEIGHT 128

#define SAMPLE_SIZE (SAMPLE_WIDTH * SAMPLE_HEIGHT)
#define OUTPUT_SIZE (OUTPUT_WIDTH * OUTPUT_HEIGHT)

#define CURRENT_POS (0x2000)

#define NUM_WORKS 4
#define BUF_SIZE 32

#define ALIGN8 8

typedef struct __attribute__((aligned(ALIGN8))) {
   int workid;                            //   Work ID
   double x0, y0;                         //   Coordinates
   double sx, sy;                         //   ScaleX and ScaleY
   unsigned char sample[SAMPLE_SIZE];    //   Origin Matrix
   unsigned char output[OUTPUT_SIZE];    //   Output Matrix
   unsigned char done;                  //   Done Flag
   unsigned char error;                   //   Error Flag
   int coreid;                            //   Core ID
} HSWork;

#endif


And here is my tools methods:
Code: Select all

static inline float min(float v1, float v2)   {
   return v1 > v2 ? v2 : v1;
}

static inline unsigned char val(const unsigned char *data, int x, int y, int mw)   {
   return data[y * mw + x];
}


static inline void setval(unsigned char *data, int x, int y, int mw, unsigned char value)   {
   data[y * mw + x] = value;
}
unsigned char bilinear(const unsigned char *data, float x, float y, int mw)   {   
   int rx = (int)(x);
   int ry = (int)(y);
   float fracX = x - rx;
   float fracY = y - ry;
   float invfracX = 1.f - fracX;
   float invfracY = 1.f - fracY;
   unsigned char a = val(data,rx,ry,mw);
   unsigned char b = val(data,rx+1.f,ry,mw);
   unsigned char c = val(data,rx,ry+1.f,mw);
   unsigned char d = val(data,rx+1.f,ry+1.f,mw);
   return ( a * invfracX + b * fracX) * invfracY + ( c * invfracX + d * fracX) * fracY;
}


Ok so, the things are working fine, the problem is the speed. This is a pretty simple algorithm it should run fast, but epiphany is taking about 50 seconds to upscale 8x8 to 128x128. The ARM host does it in 650ms in single thread mode. I know there is somethign wrong, but I cant find where.
I also checked at e-objdump to make sure those methods was on epiphany ram:

Code: Select all
linaro-nano:/media/LINUX_DATA/MathStudies/HyperSignal/Parallella> e-objdump -t Build/epiphany/e_test.elf | grep "bilinear\|val"
00000240 l       .text   000000ba _setval
00000774 l       .text   000000b2 _val
00000828 g       .text   0000054c _bilinear


I dont know where is taking too much time to finish the task. For testing I read that curx and cury from the host every 1 second to see where it is on the array, so I'm sure that is taking few seconds to finish the task.

Thanks!

EDIT: Just for references, I'm using only one core to test it.


EDIT2: Ok, I remastered the task divider, and put a time counter there. It take 54 seconds to do an core task. Also now I changed the task size to 512x512, so it would use all 16 cores, it take 54 seconds too, so my task is very independent.

Just dont know why it is taking that long to do an task, and on the ARM side take few ms.
Last edited by racerxdl on Sat Sep 27, 2014 4:48 am, edited 1 time in total.
User avatar
racerxdl
 
Posts: 3
Joined: Thu Jan 16, 2014 5:34 am
Location: São Paulo, Brasil

Re: Interpolation Algorithm Speed - Help

Postby racerxdl » Sat Sep 27, 2014 4:07 am

Changing the bilinear function to static inline drops the time from 56 seconds to 22 seconds. Why is that? Its a BIG difference on speed o.O

EDIT: OK, Forgot one double metioned on common.h, changed to float, now is 8 seconds.

I think its a bit too high yet :/

EDIT: SOLVED FINALLY!

Ok so the biggest problem was: The GCC auto defaults the operations as DOUBLE not FLOAT. So this messes up the performance. I Just simplified the operations and did the typecasts and now its running instantaneously. I need to check the speed, but its FAST.

Code now of the epiphany:

Code: Select all

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <stdint.h>

#include "common.h"
#include "tools.h"
#include "e_lib.h"

#define BUFSTART (0x8f000000)

unsigned *vals = (unsigned *)CURRENT_POS; //  4 byte array: CURX, CURY, SX, SY - Last two written by HOST

int main(void) {
  e_coreid_t coreid = e_get_coreid();
  float x2, y2;
  unsigned char val;
  unsigned row, col;
  unsigned sx, sy;

  sx = vals[2];
  sy = vals[3];

  e_coords_from_coreid(coreid, &row, &col);


  char workid = row * sx + col;

  HSWork *ext_works = (void *) BUFSTART;
  HSWork work;

  e_dma_copy(&work, &ext_works[workid], sizeof(HSWork));

  int *curx = (int *)CURRENT_POS, *cury = (int *)(CURRENT_POS + 1);
  vals[0] = 0;
  vals[1] = 0;

  const float sw_ow = ((SAMPLE_WIDTH-1) /  (float)OUTPUT_WIDTH ) / work.sx;
  const float sh_oh = ((SAMPLE_HEIGHT-1) / (float)OUTPUT_HEIGHT) /  work.sy;

  for(int y=0;y<OUTPUT_HEIGHT;y++)  {
    vals[1] = y;
    for(int x=0;x<OUTPUT_WIDTH;x++) {
      vals[0] = x;
      x2 = sw_ow * x;
      y2 = sh_oh * y;
      x2 += work.x0;
      y2 += work.y0;
      val = bilinear(work.sample,x2,y2,SAMPLE_WIDTH);
      setval(work.output,x,y,OUTPUT_WIDTH,val);
    }
  }
 
  vals[0] = -1;
  vals[1] = -1;
  work.done = 1;
  work.error = 0;

  e_dma_copy(&ext_works[workid], &work, sizeof(HSWork));
  return EXIT_SUCCESS;
}
User avatar
racerxdl
 
Posts: 3
Joined: Thu Jan 16, 2014 5:34 am
Location: São Paulo, Brasil


Return to Programming Q & A

Who is online

Users browsing this forum: No registered users and 5 guests