VideoCore IV Programmers Manual

Official VideoCore IV 3D Architecture Reference Guide

Disclaimer:

This is a independent documentation project based on a combination of static
analysis and trial and error on real hardware.  This work is 100% independent
from and not sanctioned by or connected with Broadcom or its agents.
No Broadcom documents or materials were used beyond those publicly available.

This work was undertaken and the information provided for non commercial use on
the expectation that hobbyists of all ages will find the details useful for
understanding and working with their Raspberry Pi hardware.

The hope is that Broadcom will be flattered by the interest in the device and
understand the benefits of opening up understanding to a larger audience of
potential customers and developers.

Broadcom should be commended with making their SoC available for a project as
exciting as the Raspberry Pi.

No copyrighted materials are contained in this repository.  Diagrams from patents
have been included here but they are believed to not be covered by copyright.

See:
http://en.wikipedia.org/wiki/Copyright_on_the_content_of_patents_and_in_the_context_of_patent_prosecution).

Contributors:

Phire
Matt
Booto
Herman
David
thubble
mgottschlag
eizo-san
Sarah Ocean

Notes

NOTE: This is in a very early state and is of limited accuracy. This is an early release by request.
Mnemonics wont match official VideoCore IV due to this being an independent work. That said we have tried to match up names with the existing names in the Alphamosaic and Broadcom patents.
Not all instructions have been tested at this stage.
Mnemonics have been aligned vaguely with ARM, AVR given the expected audience (Raspberry Pi users). They are subject to change.
Vector instructions are a work in progress, and all information is very preliminary (even moreso than the rest of this document).
A good cross reference of instructions semantics is Mathias's emulation work at https://github.com/mgottschlag/vctools/blob/master/vcdb/_instructions.yaml and related files.

Programmer's Model

Data Formats

The scalar processor can load and store byte (b), half word (h or s16/u16) and word (u32) quantities to/from memory.

The scalar processor uses 2's complement encoding for arithmetic operations and IEEE 754 single precision floating point format (binary32) for floating point operations.

The vector (or multimedia) instructions can process up to 16 quantities at a time in a SIMD fashion. The vector unit has a 64x64 register bank - that can be addressed in horizontal or vertical fashion to select a horizontal or vertical slice of 16 adjacent values.

Processor Modes (provisional)

The processor has at least 3 states configured by bits 31 and 29 of the processor status register sr (r30):

Supervisor mode, active when bit 29 of sr is set.
Exception mode, active when bit 29 of sr is clear.
User mode, active when bit 31 of sr is set and bit 29 is set.

Memory Map

According to the BCM2835 ARM Peripherals manual, address ranges are:

  0x00000000 - 0x3fffffff  ; L1 and L2 cached
  0x40000000 - 0x7fffffff  ; L2 cache coherent (non allocating)
  0x80000000 - 0xbfffffff  ; L2 cached only
  0xc0000000 - 0xffffffff  ; Direct uncached

Responders are:

  0x00000000 - 0x1fffffff  ; SDRAM
  0x20000000 - 0x20007fff  ; Boot (OTP or) ROM.
  0x7e000000 - 0x7effffff  ; IO

L1 Instruction Cache

16Kb Icache, 32bytes per cache line. Involved in the branch prediction logic on the VPU.

L1 Data Cache

As far as I understand, the VPU does not cache a data cache. Register definitions seem to support this theory.

L2 Cache

128Kb 4 way set associative cache.

Scalar Register Configuration

General purpose 32-bit registers:

r0, r1, ..., r15 are used for both integer, logical and floating point operations.

Extended 32-bit registers:

r16, r17, ..., r24 are available for integer, logical and floating point operations in a range of instructions.
r24 (gp) the global pointer is typically used to refer to the base of the data section.

Special purpose 32-bit registers:

r25 (sp) the stack pointer.
r26 (lr) the link register for saving return address on execution of branch-and-link instructions.
r27
r28 (esp) the stack pointer for exception handlers. During an exception handler r25 refers to r28 and the original r25 is not accessible. In user mode r28 is not accessible and is mapped to r27.
r29 (tp) thread/core pointer. Typically holds a pointer to per thread/core data structure.
r30 (sr) the processor status register (UIS- ---- ---- ---- ---- ---- --cb ZNCV).
- V overflow flag (set by cmp, cmn, btest, and fcmp).
- C carry flag (set by cmp, cmn, btest, and fcmp).
- N negative flag (set by cmp, cmn, btest, and fcmp).
- Z zero flag (set by cmp, cmn, btest, and fcmp).
- CB Vector Register file column base/offset.
- S supervisor flag,
- I interrupt flag,
- U user flag.
r31 (pc) the program counter holding the current execution address.

Processor Control Registers (according to hardware_vc4.h in BRCM driver release):

p0 to p31, access via the mov p<d>, r<a> or mov r<d>, p<a> instructions.
p0 is PRFPXCS for scalar floating point exception control.
p1 is PRCANARY for stack canary (when enabled).
p2 to p9 are unassigned.
p10 [B0] is PRPOWCTL.
p11 [B0] is PRTIMCTL for Closely Coupled Timer Control (core and sleep timers).
p12 [B0] is PRCORTIM for Core Timer Result.
p13 is PRSLPTIM for Sleep Timer Result.
p14 is PROWCNT for Count of outstanding writes.
p15 is PRORCNT for Count of outstanding reads.
p16-p31 are single-bit mutexes, shared between the two VPUs.
p16 is PRSPINL used by spinlock, a lightweight mutex.
p17 is Used by vcos_quickslow_mutex on ThreadX in existing start.elf's.
p18 - p31 are unassigned.

Note: In contrast to many processors, only a small subset of instructions update the Condition Code flags. In particular cmp, cmn, btest, and fcmp set the flags. In contrast, addcmpb, ror, shl, lsr, asr do not set the flags.

The program counter isn't typically accessed directly, due to pipeline restrictions. Direct access using the standard instruction forms will typically raise an illegal instruction exception. Instead alternate ld/st/lea encodings exist to reference addresses relative to the pc.

Special instruction encodings for ld/st are availible for relative r0, r24, sp and pc addressing.

Recommended Calling Convention

Parameters in r0-r5
Return result in r0
Preserve registers r6+
Caller saves r0-r5 (if needed), Callee saves and restores r6+.

Vector Register Configuration

The Vector Register file is a 2d array of 64x64 8 bit integer cells or elements. A horizontal vector is composed of 16 horizontal adjacent cells, whilst a vertical vector is made of 16 vertically adjacent cells. The vectors are just "names" for collections of the underlying integer cells, so it should be understood that the same cells may be appear in multiple different different horizontal and vertical vectors.

Cells / Elements

A single cell P(y,x) is located at horizontal position x and vertical position y:

Vector Register File (as byte cells):
    P(0,0)  P(0,1) ... P(0,63)
    P(1,0)  P(1,1) ... P(1,63)
    ...     ...        ...
    P(63,0) P(63,1)... P(63,63)

To support 16 and 32 bit quantities, the underlying 8-bit cells can be combined:

The name P(y,x) denotes a single 8-bit cell at coordinates (y,x).
The name PX(y,x) denotes a 16-bit value at coordinates (y,x). It is composed of two 8-bit cells.
The name PY(y,x) denotes a 32-bit value at coordinates (y,x). It is composed of four 8-bit cells.

These values are mapped (in a 16-byte strided little endian manner) as follows:

P(y,x) an 8-bit cell, value: P(y,x)
PX(y,x) a 16-bit cell, value: P(y,x) | (P(y,x+16)<<8)
PY(y,x) a 32-bit cell, value: P(y,x) | (P(y,x+16)<<8) | (P(y,x+32)<<16) | (P(y,x+48)<<24)

As the 16 and 32 bit values come from multiple cells, there are limitations on their horizontal positions:

PX(y,x) is limited to 0 <= x <= 15, or 32 <= x <= 63
PY(y,x) is limited to 0 <= x <= 15

The array can be imagined as wrapping, so that P(y,x) should be seen as referring to P(y&63,x&63). This is important when it comes to considering vectors extending beyond the edge of the array (such as H(63,0)).

Another consideration for PX() is that whilst we talk about cell PX(y,32), strictly speaking it is encoded with x=16, but we write PX(y, 32) for simplifying visualizing the arrangement.

Vector Registers

Register addressing modes exist to reference horizontal, or vertical slices of 16 values to be operated on in parallel in a SIMD manner. The 16 values may be accessed as either 8, 16, or 32 quantities. Note that the vector unit operates on 16 input values, the only difference being the bit width of each value.

H(y,x)   Horizontal row of int8           [P(y,x), P(y,x+1), ..., P(y,x+15)]
V(y,x)   Vertical column int8             [P(y,x), P(y+1,x), ..., P(y+15,x)]

HX(y,x)  16-bit Horizontal row of int16   [PX(y,x), PX(y,x+1), ..., PX(y,x+15)]
VX(y,x)  16-bit Vertical column of int16  [PX(y,x), PX(y+1,x), ..., PX(y+15,x)]

HY(y,x)  32-bit Horizontal row of int32   [PY(y,x), PY(y,x+1), ..., PY(y, x+15)]
VY(y,x)  32-bit Vertical column if int32  [PY(y,x), PY(y+1,x), ..., PY(y+15,x)]

Possible alternate names for registers are:

H(y,x),  as HB(y,x), or H8(y,x)
HX(y,x), as HH(y,x), or H16(y,x)
HY(y,x), as HL(y,x), or H32(y,x)
V(y,x),  as VB(y,x), or V8(y,x)
VX(y,x), as VH(y,x), or V16(y,x)
VY(y,x), as VL(y,x), or V32(y,x)

Here B means byte (8 bits), H means halfword (16 bits), and L means longword (32 bits).

Vector Register Notation

When referring to vector registers in instructions we may use the convenience notation D, A, B or Rd, Ra, Rb, or R(yd, xd) etc to refer to a vector register in the Destination, Source A, or Source B position. eg:

vadd D, A, B
vadd Rd, Ra, Rb
vadd R(yd,xd), R(ya, xa), R(yb, xb)
vadd V(yd, xd), H(ya, xa), H(yb, xb)

Replicating Alias

When performing a read access to V(y,x)+0x1000, the vector to be read as:
 [P(y,x), P(y,x), ..., P(y,x)], rather than [P(y,x), P(y+1,x), ..., P(y+15,x)]

Likewise H(y,x)+0x1000, the vector is read as:
 [P(y,x), P(y,x), ..., P(y,x)], rather than [P(y,x), P(y,x+1), ..., P(y, x+15)]

This replicating alias applies to 16 and 32 bit quantities as well. It provides and efficient mechanism for loading all active lanes with the same value. The lower order 12 bits of of the vector position still contribute a valid offset, eg:

mov r0, ( (1<<12) | (2<<6) | (1<<0) )
vmov HX(0,0), HX(0, 16)+r0

results in HX(0,0) = [ PX(2,17), PX(2,17), ..., PX(2,17) ]

1Kb Lookup Table

For efficient implementation of palettized pixel formats or gamma conversions etc, the vector unit supports a 1024 byte lookup table. This allows a 16 element vector to lookup values in parallel. It has sufficient size for 4x 256 byte lookup tables for mapping 8 bit values to 8 bit, or 2x 512 byte tables for mapping 8 bit values to 16 bits, or 1x 1024 byte table for mapping 8 bit values to 32 bits.

Instructions

Instruction Formats

Instructions are encoded in 16, 32, 48 or 80 bits. The basic pattern is:

1111 1xxx xxxx xxxx y:32 z:32    vector80   
1111 0xxx xxxx xxxx y:32         vector48
1110 xxxx xxxx xxxx y:32         scalar48
1xxx xxxx xxxx xxxx y:16         scalar32
0xxx xxxx xxxx xxxx              scalar16

Instructions can be thought of as sequences of little endian 16 bit words. The byte order is thus:

vector80  1 0,  3 2,  5 4,  7 6,  9 8
vector48  1 0,  3 2,  5 4
scalar32  1 0,  3 2,
scalar16  1 0

The exception is the scalar48 instructions which can be thought of as a little endian 16 bit word, followed by a little endian 32 bit word:

scalar48  1 0,  5 4 3 2

Instruction Encodings

Condition Codes

Condition codes for conditional instruction execution are encoded as a 4-bit field:

Bits    Comment     Meaning        Name 

0000    Simple      Equal          eq
0001    Simple      Not Equal      ne
0010    Unsigned    C set, <       cs/lo    Note inverted sense of carry from ARM.
0011    Unsigned    C clear, >=    cc/hs    Note inverted sense of carry from ARM.
0100    Simple      N set          mi
0101    Simple      N clear        pl
0110    Simple      V set          vs
0111    Simple      V clear        vc
1000    Unsigned    >              hi 
1001    Unsigned    <=             ls
1010    Signed      >=             ge
1011    Signed      <              lt
1100    Signed      >              gt
1101    Signed      <=             le 
1110    Simple      Always 
1111    Simple      Never          f

Arithmetic and Logical Operations

Arithmetic and logical operations are encoded as a 4, 5, or 6 bit field depending on the instruction. The table below gives the full 6-bit operation codes; to map an instruction to an operation code, use the following:

4-bit  oooo  -> 0oooo0, ie. only 16 operations are available (mov, add, mul, and so forth)
5-bit ooooo  -> 0ooooo, ie. only the first 32 operations are available
6-bit oooooo -> oooooo, ie. all operations are available

The operations are encoded as follows:

000000    mov      rd, ra              rd = ra                               Move.
000001    cmn      rd, ra              ZNCV = cc(rd + ra)                    Compare values by addition.
000010    add      rd, ra              rd = rd + ra                          Add
000011    bic      rd, ra              rd = rd & ~ra                         Logical bit clear.
000100    mul      rd, ra              rd = rd * ra                          Multiply.
000101    eor      rd, ra              rd = rd ^ ra                          Logical Exclusive Or.
000110    sub      rd, ra              rd = rd - ra                          Subtract.
000111    and      rd, ra              rd = rd & ra                          Logical And.
001000    not      rd, ra              rd = ~ra                              Not / 1s Complement.
001001    ror      rd, ra              rd = rotate_right(rd, ra)             Rotate right.
001010    cmp      rd, ra              ZNCV = cc(rd - ra)                    Compare values.
001011    rsub     rd, ra              rd = ra - rd                          Reverse subtract.
001100    btest    rd, ra              Z = (rd & bit(ra)) == 0               Bit test.
001101    or       rd, ra              rd = rd or ra                         Logical or.
001110    bmask    rd, ra              rd = rd & mask(ra)                    Mask left most bits.
001111    max      rd, ra              rd = max(ra, rd)                      Maximum.
010000    bitset   rd, ra              rd = rd | bit(ra)                     Bit set.
010001    min      rd, ra              rd = min(ra, rd)                      Minimum.
010010    bitclear rd, ra              rd = rd & ~bit(ra)                    Bit clear.
010011    addscale rd, ra << 1         rd = rd + (ra << 1)                   Add scaled by 2.
010100    bitflip  rd, ra              rd = rd ^ bit(ra)                     Bit flip.
010101    addscale rd, ra << 2         rd = rd + (ra << 2)                   Add scaled by 4.
010110    addscale rd, ra << 3         rd = rd + (ra << 3)                   Add scaled by 8.
010111    addscale rd, ra << 4         rd = rd + (ra << 4)                   Add scaled by 16.
011000    signext  rd, ra              rd = sign_extend(rd & mask(ra))       Mask left most bits and sign extend.
011001    neg      rd, ra              rd = -ra                              Negate.
011010    lsr      rd, ra              rd = rd >> ra                         Logical shift right.
011011    msb      rd, ra              rd = msb(ra)                          Position of most significant 1 bit, (msb(0)==-1)
011100    shl      rd, ra              rd = rd << ra                         Logical shift left.
011101    brev     rd, ra              rd = reverse(rd) >> (32-ra)           Reverse bits and shift right.
011110    asr      rd, ra              rd = rd >> ra                         Arithmetic shift right.
011111    abs      rd, ra              rd = abs(ra)                          Absolute value.
100000    mulhd.ss rd, ra, rb          rd = (ra * rb) >> 32                  High half of 32x32 -> 64 bit product; ra, rb are both signed
100001    mulhd.su rd, ra, rb          rd = (ra * rb) >> 32                  High half of 32x32 -> 64 bit product; ra is signed, rb is unsigned
100010    mulhd.us rd, ra, rb          rd = (ra * rb) >> 32                  High half of 32x32 -> 64 bit product; rb is signed, ra is unsigned
100011    mulhd.uu rd, ra, rb          rd = (ra * rb) >> 32                  High half of 32x32 -> 64 bit product; ra, rb are both unsigned
100100    div.ss   rd, ra, rb          rd = (ra / rb)                        Divide; ra, rb are both signed
100101    div.su   rd, ra, rb          rd = (ra / rb)                        Divide; ra is signed, rb is unsigned
100110    div.us   rd, ra, rb          rd = (ra / rb)                        Divide; rb is signed, ra is unsigned
100111    div.uu   rd, ra, rb          rd = (ra / rb)                        Divide; ra, rb are both unsigned
101000    adds     rd, ra, rb          rd = saturate32(ra + rb)
101001    subs     rd, ra, rb          rd = saturate32(ra - rb)
101010    shls     rd, ra, rb          rd = saturate32(ra << rb)
101011    clipsh   rd, rb              rd = saturate16(rb)
101100    addscale rd, ra, rb << 5     rd = ra + (rb << 5)
101101    addscale rd, ra, rb << 6     rd = ra + (rb << 6)
101110    addscale rd, ra, rb << 7     rd = ra + (rb << 7)
101111    addscale rd, ra, rb << 8     rd = ra + (rb << 8)
110000    count    rd, rb              rd = count(rb)
110001    subscale rd, ra, rb << 1     rd = ra - (rb << 1)
110010    subscale rd, ra, rb << 2     rd = ra - (rb << 2)
110011    subscale rd, ra, rb << 3     rd = ra - (rb << 3)
110100    subscale rd, ra, rb << 4     rd = ra - (rb << 4)
110101    subscale rd, ra, rb << 5     rd = ra - (rb << 5)
110110    subscale rd, ra, rb << 6     rd = ra - (rb << 6)
110111    subscale rd, ra, rb << 7     rd = ra - (rb << 7)
111000    subscale rd, ra, rb << 8     rd = ra - (rb << 8)

Where:
  Unary operations ignore ra, ie '<op> rd, ra, rb' behaves as 'rd = <op> rb'
  Operations 57 (111001) to 63 (111111) are undefined and raise a 0x03 exception.
  All bit shifts counts are modulo 32 (ie. masked with &31).
  bit(x) =  1 << x
  mask(x) = bit(x)-1
  saturate16(x) = min(0x7fff, max(-0x8000, x))
  saturate32(x) = min(0x7fffffff, max(-0x80000000, x))
  count(x) = number of 1 bits in x.

Load/Store Operation Width

The width of memory operations is encoded as follows:

  ww is 00 for ld/st for u32 operations.
  ww as 01 for ldh/sth for u16 operations.
  ww as 10 for ldb/stb for u8 operations.
  ww as 11 for ldsh/ldsb for s16/s8 load operation.

NOTE: What would apparently decode as the nonsensical stsh is infact ldsb (load 8 bit signed).

Scalar 16 Bit Instruction Word

0000 0000 0000 0000    bkpt                Halt/Breakpoint

0000 0000 0000 0001    nop                 No operation

0000 0000 0000 0010    sleep               (provisional)
0000 0000 0000 0011    user                (provisional) Enter user mode.

0000 0000 0000 0100    ei                  Enable interrupts
0000 0000 0000 0101    di                  Disable interrupts

0000 0000 0000 0110    cbclr               Clear vector column base ie SR[5:4] = 0
0000 0000 0000 0111    cbadd1              Add 1 to vector column base ie SR[5:4]+=1
0000 0000 0000 1000    cbadd2              Add 2 to vector column base ie SR[5:4]+=2
0000 0000 0000 1001    cbadd3              Add 3 to vector column base ie SR[5:4]+=3

0000 0000 0000 1010    rti                 Return from interrupt, ie. sr = *sp++; pc = *sp++;

0000 0000 001d dddd    swi rd              Software Interrupt.  Raise interrupt 0x20+(rd&0x1f)

0000 0000 010d dddd    b rd                Branch to address in rd, ie. pc = rd.
0000 0000 011d dddd    bl rd               Save pc to lr and branch to target, ie. pc = rd.

0000 0000 1000 dddd    switch.b rd         Table branch byte.  pc = pc + 2*(*(s8 *)(pc+rd))
0000 0000 1010 dddd    switch rd           Table branch short. pc = pc + 2*(*(s16 *)(pc+rd))    

0000 0000 111d dddd    version rd          Get cpu version and core number.

0000 0001 11uu uuuu    swi u               Software Interrupt.  Raise interrupt 0x20+u&0x1f

Memory Operations:

0000 0010 0bbm mmmm    ldm rb-rm,(sp++)    Load registers from stack (highest first).
0000 0010 1bbm mmmm    stm rb-rm,(--sp)    Store registers to stack (lowest first).
0000 0011 0bbm mmmm    ldm rb-rm,pc,(sp++) Load registers from stack and final value into pc.
0000 0011 1bbm mmmm    stm rb-rm,lr,(--sp) Store lr followed by registers onto stack.

                                           where: 
                                           - rb is r0, r6, r16, or r24 for bb == 00, 01, 10, 11.
                                           - rm = (rb+m)&31

                                           If sp is stored, then the value after the store is stored.
                                           If mmmmm is 31 and pc/lr are stored/loaded, then no register
                                           but pc/lr is stored/loaded ("stm lr/ldm pc"). The same
                                           applies at least to "stm r24-r7, lr, (--sp)".

0000 010o oooo dddd    ld   rd, (sp+o*4)   Load from memory relative to stack pointer.
0000 011o oooo dddd    st   rd, (sp+o*4)   Store to memory relative to stack pointer.

0000 1ww0 ssss dddd    ld<w>   rd, (rs)    Load from memory.
0000 1ww1 ssss dddd    st<w>   rd, (rs)    Store to memory.

0001 0ooo oood dddd    add  rd, sp, o*4    rd = sp + o*4

0001 1ccc cooo oooo    b<cc>  $+o*2        Branch on condition to target.

0010 uuuu ssss dddd    ld  rd, (rs+u*4)    rd = *(rs + u*4)
0011 uuuu ssss dddd    st  rd, (rs+u*4)    *(rs + u*4) = rd

Arithmetic and Logical Operations:

010 op:5 s:4 d:4     Rd = Rd op Ra         Register with Register logical and arithmetic operations.
011 op:4 u:5 d:4     Rd = Rd op u          Register with Immediate logical and arithmetic operations.

Scalar 32 Bit Instruction Word

Add an increment then branch conditionally based on a comparison:

1000 cccc aaaa dddd 00ss ssoo oooo oooo    addcmpb<cc> rd, ra, rs, $+o*2   rd += ra; if (rd <cc> rs) branch
1000 cccc iiii dddd 01ss ssoo oooo oooo    addcmpb<cc> rd, i, rs, $+o*2    rd += i;  if (rd <cc> rs) branch
1000 cccc aaaa dddd 10uu uuuu oooo oooo    addcmpb<cc> rd, ra, u, $+o*2    rd += ra; if (rd <cc> u) branch
1000 cccc iiii dddd 11uu uuuu oooo oooo    addcmpb<cc> rd, i, u, $+o*2     rd += i;  if (rd <cc> u) branch

Long conditional branch and branch and link:

1001 cccc 0ooo oooo oooo oooo oooo oooo    b<cc> $+o*2                     if (<cc>) branch to target.
1001 oooo 1ooo oooo oooo oooo oooo oooo    bl $+o*2                        Save pc to lr and branch to target.

Memory access:

# Conditional with indexed
1010 0000 ww0 d:5 a:5 c:4 00 b:5           ld<w>.<cc> rd, (ra + rb << w)   if (<cc>) rd = *(ra+rb*width)
1010 0000 ww1 d:5 a:5 c:4 00 b:5           st<w>.<cc> rd, (ra + rb << w)   if (<cc>) *(ra+rb*width) = d

# 12 bit displacement
1010 001o ww0 d:5 a:5 o:11                 ld<w> rd, (ra+o)                rd = *(ra+o)
1010 001o ww1 d:5 a:5 o:11                 st<w> rd, (ra+o)                *(ra+o) = rd

# Conditional with pre decrement
1010 0100 ww0 d:5 a:5 c:4 00 0 0000        ld<w>.<cc> rd, (--ra)           if (<cc>) rd = *--ra
1010 0100 ww1 d:5 a:5 c:4 00 0 0000        st<w>.<cc> rd, (--ra)           if (<cc>) *--ra = rd

# Conditional with post increment
1010 0101 ww0 d:5 a:5 c:4 00 0 0000        ld<w>.<cc> rd, (ra++)           if (<cc>) rd = *ra++
1010 0101 ww1 d:5 a:5 c:4 00 0 0000        st<w>.<cc> rd, (ra++)           if (<cc>) *ra++ = rd

# 16 bit displacement
1010 1000 ww 0 d:5 o:16                    ld<w> rd, (r24+o)              rd = *(r24+o)
1010 1000 ww 1 d:5 o:16                    st<w> rd, (r24+o)              *(r24+o) = rd
1010 1001 ww 0 d:5 o:16                    ld<w> rd, (sp+o)               rd = *(sp+o)
1010 1001 ww 1 d:5 o:16                    st<w> rd, (sp+o)               *(sp+o) = rd
1010 1010 ww 0 d:5 o:16                    ld<w> rd, (pc+o)               rd = *(pc+o)
1010 1010 ww 1 d:5 o:16                    st<w> rd, (pc+o)               *(pc+o) = rd
1010 1011 ww 0 d:5 o:16                    ld<w> rd, (r0+o)               rd = *(r0+o)
1010 1011 ww 1 d:5 o:16                    st<w> rd, (r0+o)               *(r0+o) = rd

Arithmetic and Logical Operations

1011 00 op:5 d:5 i:16                      <op> rd, i                     rd = rd op i
1011 01 s:5 d:5 i:16                       add rd, rs, i                  rd = rs + i
1011 1111 111 d:5 o:16                     add rd, pc, o                  rd = pc + o

Conditional Arithmetic and Logical Operations:

1100 0 op:6 d:5 a:5 c:4 00 b:5             <op>.<cc> rd, ra, rb           if (<cc>) rd = rd op ra
1100 0 op:6 d:5 a:5 c:4 1 i:6              <op>.<cc> rd, ra, i            if (<cc>) rd = ra op i

where 'op' is taken from the arithmetic operations table above and 'cc' from the condition code table. 'i' is treated as either signed or unsigned, depending on the operation.

Floating Point Operations:

1100 100 fop:4 d:5 a:5 c:4 00 b:5          <fop>.<cc> rd, ra, rb          if (<cc>) rd = ra fop rb
1100 100 fop:4 d:5 a:5 c:4 1 i:6           <fop>.<cc> rd, ra, i           if (<cc>) rd = ra fop float6(i)
Where:
  fop is an operation according to the 4 bit floating point operation table:
    fadd, fsub, fmul, fdiv, fcmp, fabs, frsub, fmax, 
    frcp, frsqrt, fnmul, fmin, fceil, ffloor, flog2, fexp2

  cc is a condition code test (see condition code table);
    eq,ne ,cs/lo, cc/hs, mi, pl, vs, vc, hi, ls, ge, lt, gt, le 

  float6(x) converts a 6 bit binary number to float (1 sign bit, 3 bit exponent, 2 bit mantissa)
      float float6(imm) {
        uint32_t b = 0;
        if (imm & 0x20) {
            b |= 0x80000000;
        }
        int exponent = (imm >> 2) & 0x7;
        if (exponent != 0) {
            b |= (exponent + 124) << 23;
            int mantissa = imm & 0x3;
            b |= mantissa << 21;
        }
        return *(float *)&b;
      }

Floating Point Integer Conversion:

1100 101 0 000 d:5 a:5 c:4 00 b:5          ftrunc.<cc> rd, ra, sasl rb    if (<cc>) rd = (int)(ra*pow(2,rb))
1100 101 0 000 d:5 a:5 c:4 1 i:6           ftrunc.<cc> rd, ra, sasl i     if (<cc>) rd = (int)(ra*pow(2,i))
1100 101 0 001 d:5 a:5 c:4 00 b:5          floor.<cc> rd, ra, sasl rb     if (<cc>) rd = (int)floor(ra*pow(2,rb))
1100 101 0 001 d:5 a:5 c:4 1 i:6           floor.<cc> rd, ra, sasl i      if (<cc>) rd = (int)floor(ra*pow(2,i))
1100 101 0 010 d:5 a:5 c:4 00 b:5          flts.<cc> rd, ra, sasr rb      if (<cc>) rd = ((float)(int)ra)/pow(2,rb)
1100 101 0 010 d:5 a:5 c:4 1 i:6           flts.<cc> rd, ra, sasr i       if (<cc>) rd = ((float)(int)ra)/pow(2,i)
1100 101 0 011 d:5 a:5 c:4 00 b:5          fltu.<cc> rd, ra, sasr rb      if (<cc>) rd = ((float)(uint)ra)/pow(2,rb)
1100 101 0 011 d:5 a:5 c:4 1 i:6           fltu.<cc> rd, ra, sasr i       if (<cc>) rd = ((float)(uint)ra)/pow(2,i)

Where:
  sasl is an arithmetic shift left by a signed amount (ie negative shifts, shift to the right).
  sasr is an arithmetic shift right by a signed amount (ie negative shifts, shift to the left).

Processor Control Register Access:

1100 1100 000 d:5 0000 0000 000 a:5        mov pd, ra                     pd = ra
1100 1100 001 d:5 0000 0000 000 a:5        mov rd, pd                     rd = pa

Scalar 48 Bit Instruction Word

Memory access:

1110 0000 0000 0000 u:32                   j u                                Jump to target.
1110 0001 0000 0000 o:32                   b $+o                              Branch to target.
1110 0010 0000 0000 u:32                   jl u                               Save pc to lr and jump to target.
1110 0011 0000 0000 o:32                   bl $+o                             Save pc to lr and branch to target.

1110 0101 000 d:5 o:32                     add rd, pc, o                      rd = pc + o

1110 0110 ww 0 d:5 s:5 o:27                ld<w>  rd, (rs+o)                  Load from address rs+o to rd+o
1110 0110 ww 1 d:5 s:5 o:27                st<w>  rd, (rs+o)                  Store from rd to address rs+o

1110 0111 ww 0 d:5 11111 o:27              ld<w>  rd, (pc+o)                  Load pc relative.
1110 0111 ww 1 d:5 11111 o:27              st<w>  rd, (pc+o)                  Store pc relative.

Arithmetic and Logical operations with 32 bit immediate:

1110 10 p:5 d:5 u:32                       <op> rd, u                         rd = rd op u
1110 11 s:5 d:5 u:32                       add rd, rs, u                      rd = rs + u

Reminder: the scalar48 instructions are best thought of as a little endian 16 bit word, followed by a little endian 32 bit word. This is in contrast to all other instruction encodings which are best thought of as a series of little endian 16 bit words.

Vector Instructions

There are two major groups of vector instructions - Vector Memory operations and Vector Data operations.

Vector memory operations allow the transfer of register contents to and from memory or the 1 Kb lookup table.

Vector data operations allow computation between two 16 element vectors writing the result back to a third 16 element vector. Vector data operations may also perform scalar register update (SRU), accumulation (ACC), repeated execution (REP2, ..., REP r0) and lane selection based on condition codes.

Vector instructions are encoded in a full 80-bit format or a compact 48-bit format. Instructions in the compact format support fewer options, but generate less icache pressure.

Vector Instruction Execution Model

Refer to the diagram below which has been taken from Patent application EP1320029. Vector Instruction Execution

Simplified View of Vector Instruction Execution

Two 16 element long vectors A and B are fetched from the vector register file.
The Pixel Processing Units ( PPUs ) execute the same operation in parallel ( SIMD ) on the 16 element pairs. D[i] = A[i] op B[i] for i=0...15. Despite the name they may be used for a variety of purposes in addition to graphics.
The result is written back to the register file.

Optional: Instruction Repetition

Instructions can be repeated 2, 4, 8, 16, 32, 64 or a conditional number of times. A single instruction repeated 64 times (using REP 64 ) would be working on a total of 64x16, or 1024 elements.

Optional: Predication / Conditions

Each Pixel Processing Unit has a set of 3 condition flags, Z, N, C which can be updated by executing an instruction (using the SETF flag). In subsequent instructions Pixel Processing Units may be directed to disable themselves based on the state of the condition flags. eg. The IFNZ modifier would mean only enable PPUs with a zero flag that is clear.

Optional: Accumulation

After calculation the PPUs may accumulate results over a number of instructions using an accumulator or ACC located in each PPU. Result values may either be added or subtracted to the accumulator, and the accumulated value may be used to replace the result value for write back to the vector register file. Vector Instruction Execution

Optional: Scalar Result Unit

The scalar result unit is used to optionally update a scalar register (r0...r7) with a value computed from the vector result. For instance SUM, IMIN, IMAX, MAX modifiers may be used to calculate the sum, or index of minimum element, or index of maximum element, or the maximum value from the 16 vector result values.

Example Vector Instructions

Quoting from US Patent Application 20030159023 ( http://www.google.com/patents/US20030159023 ):

EXAMPLE 1

vadd H(0,0), H(0,0), #1

16 horizontally contiguous 8-bit pixels (“H”) are read from the top left (0,0) of the VRF. The PPUs add the immediate value 1 to each and the results are written back to the same location (only the least significant 8 bits are returned because the destination register is an 8-bit register). No flags are changed, and no accumulation occurs.

EXAMPLE 2

vadd −, H(0,0), #0 SETF vadd H(0,0), H(0,0), #1 IFN

These instructions process sixteen horizontally contiguous 8-bit values (“H”) located at (0,0) in the VRF. The first instruction sets the PPU flags according to the values there (but does not need to write the values back anywhere—hence “−” as the destination); the second instruction adds one only to the negative values (“IFN”), leaving the others alone.

EXAMPLE 3

vadd H(0++, 16), H(0++, 0), #0 REP 16

The 16×16 block of 8-bit pixels at (0,0) in the VRF is copied to (0,16) (adding zero is being used here just to copy registers). The H indicates that we read and write 16 horizontally contiguous pixels from the VRF. Upon each of the 16 repetitions (“REP 16”), the x-component of the source and destination are each incremented by 1 (“++”), that is 1 byte.

NOTE: This looks like a typo, I think it should say "...the y-component of the source and destination [position] are each increment by 1 ("++"), ...

EXAMPLE 4

vdist −, H(0++,0), H(0++,16) REP 16 CLRA ACC SUM r0

Computes the distance (absolute difference) between two 16×16 pixel blocks. Each repetition computes the 16 distances between two sets of 16 values and each PPU adds the answer to its own accumulator (“ACC”), which also means these accumulated values become the PPU outputs. Finally we sum these 16 values and write the result, now the total distance between the two blocks, into scalar register r0 (“SUM r0”). The CLRA modifier clears the accumulators on the initial repetition, and is the only modifier to behave differently on subsequent repetitions (on which it does nothing).

NOTE: "CLRA ACC" and "SUM r0" seem mutually exclusive given the bit encodings below. It may be the ability to perform both has been removed since the patent was written.

EXAMPLE 5

vsub −, H(0++,0)+r0, r1 REP 16

A constant value taken from register r1 (in the SRF) is subtracted from every pixel in an arbitrary 16×16 block. The scalar register r0 (in the SRF) supplies the address of the top left corner of the block to be processed. When used in this way, the 6 least significant bits of r0 indicate the x-offset of the block, and the next 6 bits indicate the y-offset.

Putting it All Together

This is much easier to understand than it first appears...

Values Baddr, Aaddr and Daddr are determined from the instruction and the contents of referenced scalar registers. Addressing modes include:

R(y, x), absolute vector register reference. Also R(y++, x), and R(y, x++).
R(y, x) + rs, relative vector register reference where an offset from an absolute position is encoded as a 6:6 bit number in rd. Also R(y++,x)+rs, and R(y,x++)+rs.
R(0, 0) + rs, absolute vector register reference in scalar register, encoded as 6:6 as above. Also R(0++,0)+rs, and R(0,0++)+rs.
rb, a 32 bit value from a scalar register (only available for vector operand B).
rb+#s9, a 32 bit value calculated from the scalar register and constant (only available for vector operand B).
#s16, a 16 bit immediate value (only available for vector operand B).
- , discard value (only available for vector operand D).

For vector load instructions, MEM is fetched from the 32 bit address Baddr: MEM = memory[ Baddr ].
Vectors A and B are taken from the vector register file: A = VRF[ Aaddr ], B = VRF[ Baddr ].
For vector store instructions, A is written to memory: memory[ Baddr ] = A.
OP2 is selected as either B or Baddr : OP2 = choose(B, { Baddr, Baddr, ..., Baddr}).
The pixel processing units (PPU) perform their calculations: D = vop(A, OP2, MEM)
If this is the first cycle of a repeated instruction the accumulator ACC is optionally reset to 0.
Accumulation occurs as specified by the instruction, one of:

No accumulation.
Accumulation: tmp = ACC + D, optional: D = tmp, optional: ACC = tmp.
Subtraction: tmp = ACC - D, optional: D = tmp, optional: ACC = tmp.

Optional: Vector result write back: VRF[ Daddr ] = D.
Optional: Scalar Result Unit: rs = SRU(D), where SRU maps a vector to a scalar value, eg. sum of values: SUM(), index of minimum value: IMIN(), index of maximum value: IMAX().
If the instruction has the SETF bit set, then the conditional codes of each PPU is updated.
Increment Aaddr, Baddr, Daddr according to the instruction specification: this is often an optional step to the adjacent vector position below (horizontal vectors), or to the right (vertical vectors). When Baddr is used as a memory address, it is typically a memory stride to step to the next storage position (a value from a scalar register): eg. (rb+#offset+=rs) would mean start at rb+offset and add rs for each instruction repetition.
If the repeat count or condition is not yet exhausted then goto 2.

Note the expression D = vop(A, OP2, MEM), means the evaluation of a 16 element wide vector expression. In the case of conditional execution: IFZ, IFNZ, IFN, IFNN, IFC, IFNC, then individual pixel processing units may be disabled based on the state of each units condition flags (zero, negative, carry).

Argument format

Vector instruction arguments are normally 10 bits long. The first 4 bits describe the nature of the argument as well as the 2 high order bits of X coordinate; the last 6 bits give the exact Y coordinate (row) in the case of horizontal vectors, or the 2 bits of high order Y coordinate and 4 bits of low order X coordinate (column) in the case of vertical vectors:

0 x5 x4 v y5 y4 xy3 xy2 xy1 xy0

Horizontal:
0000 y5 y4 y3 y2 y1 y0     H(y, 0)
0010 y5 y4 y3 y2 y1 y0     H(y, 16)
0100 y5 y4 y3 y2 y1 y0     H(y, 32)
0110 y5 y4 y3 y2 y1 y0     H(y, 48)
1000 y5 y4 y3 y2 y1 y0     HX(y, 0)
1010 y5 y4 y3 y2 y1 y0     HX(y, 32)
1100 y5 y4 y3 y2 y1 y0     HY(y, 0)

Vertical:
0001 y5 y4 x3 x2 x1 x0     V(y, 0+x)
0011 y5 y4 x3 x2 x1 x0     V(y, 16+x)
0101 y5 y4 x3 x2 x1 x0     V(y, 32+x)
0111 y5 y4 x3 x2 x1 x0     V(y, 48+x)
1001 y5 y4 x3 x2 x1 x0     VX(y, 0+X)
1011 y5 y4 x3 x2 x1 x0     VX(y, 32+x)
1101 y5 y4 x3 x2 x1 x0     VY(y, 0+x)

Other:
1110      For destination discard result, as first source ignore value, 
          as second source use coordinate as scalar value.
1111

Vector 48 Bit Instruction Word

The limitations on the 48 bit instructions are:

vector references are either all horizontal or all vertical, ie the direction of the A and B vectors matches the destination D.
Accumulation (ACC) is not possible,
Scalar Register Update (SRU) is not possible,
Vector coordinates are limited.

Vector Memory operations

Operations supported: vld (vector load), vst (vector store), vlookup (vector gather), vindexwrite (vector gather), vreadlut, vwritelut.

Memory Instructions variants (48 bit) are:

v<mop><width> H(yd,xd)[+rs], H(ya,xa)[+rs], (rb) [SETF] 
v<mop><width> V(yd,xd)[+rs], V(ya,xa)[+rs], (rb) [SETF]
v<mop><width> H(yd,xd)[+rs], H(ya,xa)[+rs], H(yb,xb)[+rs] 
v<mop><width> V(yd,xd)[+rs], V(ya,xa)[+rs], H(yb,xb)[+rs]
v<mop><width> H(yd,xd)[+rs], H(ya,xa)[+rs], #immediate [SETF] [IFZ|IFNZ|IFN|IFNN|IFC|IFNC]
v<mop><width> V(yd,xd)[+rs], V(ya,xa)[+rs], #immediate [SETF] [IFZ|IFNZ|IFN|IFNN|IFC|IFNC]

Encoded as:

1111 00 mop:5 width:2 rs:3 d:10 a:10 z0 111 F:1 rb:6   v<mop><width> Rd[+rs], Ra[+rs], (rb) [SETF]
1111 00 mop:5 width:2 rs:3 d:10 a:10 z0 b:10           v<mop><width> Rd[+rs], Ra[+rs], Rb[+rs]
1111 00 mop:5 width:2 rs:3 d:10 a:10 z1 P:3 F:1 i:6    v<mop><width> Rd[+rs], Ra[+rs], #i [SETF] [IFZ|IFNZ|IFN|IFNN|IFC|IFNC]

Where is B (8), H (16), or L (32), and determines the size of transfers to memory.

  width:2 is 00 to read/write vector elements as bytes.
  width:2 is 01 to read/write vector elements as 16 bits.
  width:2 is 10 to read/write vector elements as 32 bits.
  width:2 is 11 acts as 00, but is probably reserved for future use.

What would in an 80 bit instruction be the direction bits of A and B are used to choose if a single scalar register should be added to the location ie vadd H(0,0), H(1,0)+rs, H(2,0)+rs;
For D, bit 11 of the instruction decides if should be added. eg. vadd H(0,0)+rs, H(1,0), H(2,0)
When B is an immediate value, bits P:3 are used to select a lanes via flag predication.
When B is an address from a scalar register value, or an immediate value, bit F:1 is used to update ALU flags.

Seven types of memory operation have been determined:

Load/Store/Scatter/Gather:

00000  ld            Load vector from memory.  D[i] = *(rb+i*width)
00001  lookupmh      Gather values.            D[i] = *(rb+ACCH[i]*width)
00010  lookupml      Gather values.            D[i] = *(rb+ACC[i]*width)

00100  st            Store vector to memory.   *(rb+i*width) = A[i]
00101  indexwritemh  Scatter values.           *(rb+ACCH[i]*width) = A[i].
00110  indexwriteml  Scatter values.           *(rb+ACC[i]*width) = A[i].

Lookup Table - A 1024 byte lookup-table for fast lookups.

01000  readlut       Lookup values in LUT.     D[i] = lut(B[i]*width). eg. readlut V(0,0), -, V(16,0)
01001  writelut      Write values to LUT.      lut(B[i]*width) = A[i]. eg. writelut 0, V(0,0), (r0)

Accumulator Access:

11000  readacc       Read from accumulator.    D[i] = ACC[i]>>> B[i]

Width:
  ld, lookup (gather), st, indexwrite (scatter), readlut, writelut,  
  all support 8, 16, 32 bit operations (00, 01, 10).  Width (11) seems to be a no-op.

For readacc, the width field behaves differently.
  00  reads acc without saturation
  01  reads acc saturating to 32 bits
  10  no-op
  11  reads acc saturating to 16 bits

Note that load/store, like all vector operations, always operate on 16 values. So loading 32-bit values will load 64 bytes from memory at once.

Vector data operations

Data Instructions (48 bit) have a 6-bit opcode, and have the following variants:

v<op><X> H(yd,xd)[+rs], H(ya,xa)[+rs], (rb) [SETF] 
v<op><X> V(yd,xd)[+rs], V(ya,xa)[+rs], (rb) [SETF]
v<op><X> H(yd,xd)[+rs], H(ya,xa)[+rs], H(yb,xb)[+rs] 
v<op><X> V(yd,xd)[+rs], V(ya,xa)[+rs], H(yb,xb)[+rs]
v<op><X> H(yd,xd)[+rs], H(ya,xa)[+rs], #immediate  [SETF] [IFZ|IFNZ|IFN|IFNN|IFC|IFNC]
v<op><X> V(yd,xd)[+rs], V(ya,xa)[+rs], #immediate  [SETF] [IFZ|IFNZ|IFN|IFNN|IFC|IFNC]

Encoded as:

1111 01 X vop:6 rs:3 d:10 a:10 z0 111 F rb:6   v<op><X> Rd[+rs], Ra[+rs], (rb) [SETF]
1111 01 X vop:6 rs:3 d:10 a:10 z0 b:10         v<op><X> Rd[+rs], Ra[+rs], Rb[+rs]
1111 01 X vop:6 rs:3 d:10 a:10 z1 P:3 F:1 i:6  v<op><X> Rd[+rs], Ra[+rs], #i [SETF] [IFZ|IFNZ|IFN|IFNN|IFC|IFNC]

Where i is a 6-bit immediate value; zero-extended and supplied to each of the 16 parallel operations

The 1-bit field 'X' serves one of two purposes depending on the value of vop:

If vop < 48, then <X> is either 16 (aka. 'H', corresponding to X=0 in the instruction) or 32 (aka. 'L', corresponding to X=1), and determines the sign extension behaviour of source operands. ie. H (16), will sign extend the 16th bit when widening source operands. The operations <op> encoded with vop < 48 are:

000000    vmov     Move.
000001    vbitplanes Make mask.  Rd[i] = (Rb[i]&(1<<i)) ? -1 : 0.
000010    veven    Pluck elements at even index. Rd[i] = (i<8) ? Ra[i*2] : Rb[(i-8)*2].
000011    vodd     Pluck elements at odd index.  Rd[i] = (i<8) ? Ra[1+i*2] : Rb[1+(i-8)*2].
000100    vinterl  Interleave (alternate) lower 8 cells from Va & Vb. (Va[0], Vb[0], Va[1], Vb[1], ...)
000101    vinterh  Interleave (alternate) upper 8 cells from Va & Vb. (Va[8], Vb[8], Va[9], Vb[9], ...)
000110    vbitrev  Reverse bits and shift right. Rd[i] = reverse(Ra[i]) >> ((32-Rb[i])&31)     
000111    vror     Rotate right. Rd[i] = rotate_right(Ra[i], Rb[i]).
001000    vshl     Logical shift left.  Rd[i] = Ra[i] << (Rb[i]&31)
001001    vshls    Saturating shift left.  Rd[i] = shift_left_saturate(Ra[i], Rb[i]&31).
001010    vlsr     Logical shift right. Rd[i] = Ra[i] >>> (R_b[i]&31)
001011    vasr     Arithmetic shift right. Rd[i] = Ra[i] >> (Rb[i]&31) 
001100    vsignshl Logical shift left by signed amount (ie logical shift right if Rb negative).  
001101    vop13    Unused.       R_d[i] = 0
001110    vsignasl Arithmetic shift left by signed amount (ie arithmetic shift right if Rb negative).
001111    vsignasls Saturating Arithmetic shift left by signed amount (ie arithmetic shift right if Rb negative).
010000    vand     Bitwise AND   R_d[i] = R_a[i] &amp; R_b[i]
010001    vor      Bitwise OR    R_d[i] = R_a[i] | R_b[i]
010010    veor     Bitwise XOR   R_d[i] = R_a[i] ^ R_b[i]
010011    vbic     Bitwise Clear R_d[i] = R_a[i] &amp; ~(R_b[i])
010100    vcount   Population count   R_d[i] = (# of nonzero bits in R_a[i]) + (# of nonzero bits in R_b[i])
010101    vmsb     Most significant bit position in Rd[i] = msb(Ra[i]|Rb[i])
010110    vop22    Unused.       R_d[i] = 0
010111    vop23    Unused.       R_d[i] = 0
011000    vmin     Minimum       R_d[i] = MIN(R_a[i], R_b[i])
011001    vmax     Maximum       R_d[i] = MAX(R_a[i], R_b[i])
011010    vdist    Distance      R_d[i] = ABS(R_a[i] - R_b[i])
011011    vdists   Distance      R_d[i] = MIN(MAX_INT<WIDTH>, ABS(R_a[i] - R_b[i]))
011100    vclip    Unsigned Clamp  R_d[i] = MAX(0, MIN(R_a[i], R_b[i])) ; eg R_d[i] = clamp(R_a[i], 0, R_b[i])
011101    vsign    Sign.         Rd[i] = sgn(Ra[i]) + Rb[i].
011110    vclips   Signed Clamp  R_d[i] = MAX(-R_b[i], MIN(R_a[i], R_b[i])) ; eg R_d[i] = clamp(R_a[i], -R_b[i], R_b[i])
011111    vtestmag Compare GE    R_d[i] = (R_a[i] >= R_b[i]); 
100000    vadd     Add           R_d[i] = R_a[i] + R_b[i]
100001    vadds    Add with saturation  R_d[i] = saturate(R_a[i] + R_b[i]) ; saturate to max/min int.
100010    vaddc    Add with carry R_d[i] = R_a[i] + R_b[i] + C[i]
100011    vaddsc   Add with saturation and carry R_d[i] = saturate(R_a[i] + R_b[i] + C[i]) ; saturate to max/min int.
100100    vsub     Subtract      R_d[i] = R_a[i] - R_b[i]
100101    vsubs    Subtract with saturation  R_d[i] = saturate(R_a[i] - R_b[i]) ; saturate to max/min int.
100110    vsubc    Subtract with carry. R_d[i] = R_a[i] - R_b[i] - C[i]
100111    vsubsc   Subtract with saturation and carry R_d[i] = saturate(R_a[i] - R_b[i]) ; saturate to max/min int.
101000    vrsub    Rev. Subtract R_d[i] = R_b[i] - R_a[i] 
101001    vrsubs   Rev. Subtract with saturation R_d[i] = R_b[i] - R_a[i] ; saturate to max/min int.
101010    vrsubc   Rev. Subtract with carry R_d[i] = R_b[i] - R_a[i] - C[i] ;
101011    vrsubsc  Rev. Subtract with saturation and carry R_d[i] = R_b[i] - R_a[i] - C[i] ; saturate to max/min int.
101100    vop44    Unused.       R_d[i] = 0
101101    vop45    Unused.       R_d[i] = 0
101110    vop46    Unused.       R_d[i] = 0
101111    vop47    Unused.       R_d[i] = 0

For vop >= 48, <X> is not part of the instruction name; X instead seems to choose between two separate instruction tables, both based around different usages of a 16x16 -> 32 bit multiplier:

For X=0, the operations are:

110000    vmull.ss    Multiply      R_d[i] = R_a[i] * R_b[i]
110001    vmulls.ss   Saturated Multiply (signed - saturated to max/min int)
110010    vmulm.ss    Multiply middle word result (bytes 2:1) (only byte 2 signed)
110011    vmulms.ss   Saturated Multiply middle word result (signed - saturated to max/min int)
110100    vmulhd.ss   Multiply high word result (round-to-minus-inf/floor) (Ra and Rb signed)
110101    vmulhd.su   Multiply high word result (round-to-minus-inf/floor) (Ra signed, Rb unsigned)
110110    vmulhd.us   Multiply high word result (round-to-minus-inf/floor) (Ra unsigned, Rb signed)
110111    vmulhd.uu   Multiply high word result (round-to-minus-inf/floor) (Ra and Rb unsigned)
111000    vmulhn.ss   Multiply high word result with round-to-nearest/round (Ra and Rb signed)
111001    vmulhn.su   Multiply high word result with round-to-nearest/round (Ra signed, Rb unsigned)
111010    vmulhn.us   Multiply high word result with round-to-nearest/round (Ra unsigned, Rb signed)
111011    vmulhn.uu   Multiply high word result with round-to-nearest/round (Ra and Rb unsigned)
111100    vmulhdt.ss  Multiply high word result with round-to-zero/truncation (Ra and Rb signed)  
111101    vmulhdt.su  Multiply high word result with round-to-zero/truncation (Ra signed, Rb unsigned)
111110    vop62.0     Unused. R_d[i] = 0
111111    vop63.0     Unused. R_d[i] = 0

For X=1, the operations multiply two 16-bit values (32-bit operands are truncated) and gives a full-width 32-bit result:

110000    vop48.1     Unused. R_d[i] = 0
110001    vop49.1     Unused. R_d[i] = 0
110010    vop50.1     Unused. R_d[i] = 0
110011    vop51.1     Unused. R_d[i] = 0
110100    vmul32.ss   Full-width multiply (Ra and Rb are 16-bit signed)
110101    vmul32.su   Full-width multiply (Ra is 16-bit signed, Rb is 16-bit unsigned)
110110    vmul32.us   Full-width multiply (Ra is 16-bit unsigned, Rb is 16-bit signed)
110111    vmul32.uu   Full-width multiply (Ra and Rb are 16-bit unsigned)
111000    vop56.1     Unused. R_d[i] = 0
111001    vop57.1     Unused. R_d[i] = 0
111010    vop58.1     Unused. R_d[i] = 0
111011    vop59.1     Unused. R_d[i] = 0
111100    vop60.1     Unused. R_d[i] = 0
111101    vop61.1     Unused. R_d[i] = 0
111110    vop62.1     Unused. R_d[i] = 0
111111    vop63.1     Unused. R_d[i] = 0

Note: When running code under the standard R-Pi firmware, the instructions with X=1 and vop >= 56 seem to cause the VPU to hang. What happens when running these instructions under custom firmware is unconfirmed.

Vector 80 Bit Instruction Word

Format of the first 48 bits is very similar to 48-bit (above). The main difference is that there are flags for repeating an instruction a specified number of times, and additional vector addressing modes.

Vector Memory operations

v<mop><width> -, A(ya,xa)[+ra], imm16(rs+=rd) [REP] [IFxx] [SETF] 
v<mop><width> D(yd,xd)[+rd], -, imm16(rs+=ra) [REP] [IFxx] [SETF] 
v<mop><width> D(yd,xd)[+rd], A(ya,xa)[+ra], B(yb,xb)[+rb] [REP] [IFxx] [SETF]
v<mop><width> D(yd,xd)[+rd], A(ya,xa)[+ra], #immediate [REP] [IFxx] [SETF]

In the first two forms, rd or ra is added to the memory address at each repetition.
(But the instruction doesn't update the original register value).

Encoded as:

1111 10 mop:5 width:2 r:3 1110 rd:6 a:10 F 0 111 l:7 f_d:6 f_a:6 Ra_x:4 P:3 i:7 rs:4 i:2
1111 10 mop:5 width:2 r:3 d:10 1110 ra:6 F 0 111 l:7 f_d:6 f_a:6 Ra_x:4 P:3 i:7 rs:4 i:2
1111 10 mop:5 width:2 r:3 d:10 a:10 F 0 b:10 f_d:6 f_a:6 Ra_x:4 P:3 f_i:7 f_b:6
1111 10 mop:5 width:2 r:3 d:10 a:10 F 1 l:10 f_d:6 f_a:6 Ra_x:4 P:3 f_i:7 j:6

Where:
  imm = (i<<7) | l, or (j<<10) | l
  r:3    encodes the repetition modifier

Vector data operations

6-bit opcode is denoted by 'v'. Same opcodes as 48-bit (see above).

v<op><X> D, A, B    [REP] [SETF] [IFxx] [SRU|ACC]
v<op><X> D, A, #imm [REP] [SETF] [IFxx] [SRU|ACC]

  REP  = ( REP2 | REP4 | REP8 | REP16 | REP32 | REP64 | REP r0)
  IFxx = ( IFZ | IFNZ | IFN | IFNN | IFC | IFNC )
  SRU  = ( SUMU | SUMS | IMIN | IMAX | MAX) ( r0 | r1 | ... | r7 )
  ACC  = [CLRA] ( UACC | UDEC | SACC | SDEC | UACCH | UDECH | SACCH | SDECH 
                | UADD | USUB | SADD | SSUB | UADDH | USUBH | SADDH | SSUBH )

Encoded as:

1111 11 X v:6 r:3 d:10 a:10 F 0 b:10 f_d:6 f_a:6 Ra_x:4 P:3 f_i:7 f_b:6
1111 11 X v:6 r:3 d:10 a:10 F 1 k:10 f_d:6 f_a:6 Ra_x:4 P:3 f_i:7 j:6

Where:
  imm = (j<<10) | k
  r:3    encodes the repetition modifier
  X:1    encodes between 16 and 32 bit 
  f_i:7  encodes the SRU or ACC settings.
  P:3    encodes lane predication.

Repetition r:3 is encoded:

000                 Do not repeat
001      REP2       Repeat 2x
010      REP4       Repeat 4x
011      REP8       Repeat 8x
100      REP16      Repeat 16x
101      REP32      Repeat 32x
110      REP64      Repeat 64x
111      REP r0     Repeat count comes from scalar register r0.

Each argument, in addition to the 10 bits described above, has an additional 6 bits which are flags describing additional options for the argument (repeat/increment, scalar, offset, etc.) The mnemonics (f_d, f_a, f_b) denote the flags for each argument. The usage of each bit is as follows:

5:2      Scalar register r0-r14, if required; otherwise, all 1's (ie r15 is not available)
1        If set, increment argument by 1 for each instruction repetition.
         Y coordinate is incremented if argument is horizontal; X coordinate if vertical.
0        If set use bits 5:4 of SR as a column offset for the vector reference.  
         ie. P(y,x) becomes P(y,(x+cb*16)&63).

Ra_X, provides additional bits for the position of the Ra vector, allowing it to be fully specified.

3:0      Low order bits for xa in Ra(ya, xa).

So, the complete register encoding takes the form:

Vector destination D is defined by d:10, f_d:6
Vector source      A is defined by a:10, f_a:6 Ra_x:4
Vector source      B is defined by b:10, f_b:6
 
Generally d, a, b are of the form:
 
Horizontal:
  0000 y5 y4 y3 y2 y1 y0     H(y, 0)
  0010 y5 y4 y3 y2 y1 y0     H(y, 16)
  0100 y5 y4 y3 y2 y1 y0     H(y, 32)
  0110 y5 y4 y3 y2 y1 y0     H(y, 48)
  1000 y5 y4 y3 y2 y1 y0     HX(y, 0)
  1010 y5 y4 y3 y2 y1 y0     HX(y, 32)
  1100 y5 y4 y3 y2 y1 y0     HY(y, 0)
  111                        -

Vertical:
  0001 y5 y4 x3 x2 x1 x0     V(y, 0+x)
  0011 y5 y4 x3 x2 x1 x0     V(y, 16+x)
  0101 y5 y4 x3 x2 x1 x0     V(y, 32+x)
  0111 y5 y4 x3 x2 x1 x0     V(y, 48+x)
  1001 y5 y4 x3 x2 x1 x0     VX(y, 0+x)
  1011 y5 y4 x3 x2 x1 x0     VX(y, 32+x)
  1101 y5 y4 x3 x2 x1 x0     VY(y, 0+x)
  111                        -
 
Vector source A vectors take all their y bits from a:10, and their x bits from Ra_x, so this becomes:
 
Horizontal:
  0000 y5 y4 y3 y2 y1 y0     H(y, 0+Ra_x)
  0010 y5 y4 y3 y2 y1 y0     H(y, 16+Ra_x)
  0100 y5 y4 y3 y2 y1 y0     H(y, 32+Ra_x)
  0110 y5 y4 y3 y2 y1 y0     H(y, 48+Ra_x)
  1000 y5 y4 y3 y2 y1 y0     HX(y, 0+Ra_x)
  1010 y5 y4 y3 y2 y1 y0     HX(y, 32+Ra_x)
  1100 y5 y4 y3 y2 y1 y0     HY(y, 0+Ra_x)
  111                        -
 
Vertical:
  0001 y5 y4 y3 y2 y1 y0     V(y, 0+Ra_x)
  0011 y5 y4 y3 y2 y1 y0     V(y, 16+Ra_x)
  0101 y5 y4 y3 y2 y1 y0     V(y, 32+Ra_x)
  0111 y5 y4 y3 y2 y1 y0     V(y, 48+Ra_x)
  1001 y5 y4 y3 y2 y1 y0     VX(y, 0+Ra_x)
  1011 y5 y4 y3 y2 y1 y0     VX(y, 32+Ra_x)
  1101 y5 y4 y3 y2 y1 y0     VY(y, 0+Ra_x)
  111                        -

P:3, controls which vector lanes/elements are active.

2:0      Conditionally enable vector elements: ALL, NONE, IFZ, IFNZ, IFN, IFNN, IFC, IFNC

In addition to the arguments, there are 7 bits of flags which alter the behaviour of the instruction. The mnemonic f_i denotes these flags. The usage of each bit is as follows:

Accumulation

There is a signed 48 bit Accumulator associated with each lane. It saturates to MAX 0x7fffffffffff and MIN 0x800000000000.

6        0             Enable ACC (accumulation) modifier
5        ENA           Enable normal accumulation function.
4        HIGH          Set to work with the top 32 bits of the accumulator (accumulator is 48 bits).
3        SIGN          Sign extend before accumulating. 
2        CLRA          If set, clear accumulator on first iteration.
1        WBA           If set, write new value back to accumulator.
0        SUB           If set, subtract value from accumulator (A=A-x), otherwise add (A=A+x).

Common Accumulation Functions

Clearing accumulator (can be combined with other operations)
   CLRA                                Clear accumulator on initially entering instruction.
 
Accumulation/Decumulation:
   UACC    (ENA|WBA)                  Accumulate with unsigned value. 
   UDEC    (ENA|WBA|SUB)              Decumulate with unsigned value.
   SACC    (ENA|SIGN|WBA)             Accumulate with signed value.
   SDEC    (ENA|SIGN|WBA|SUB)         Decumulate with unsigned value.
   UACCH   (ENA|HIGH|WBA)             Accumulate with unsigned value to high word of accumulator.
   UDECH   (ENA|HIGH|WBA|SUB)         Decumulate with unsigned value to high word of accumulator.
   SACCH   (ENA|HIGH|SIGN|WBA)        Accumulate with signed value to high word of accumulator.
   SDECH   (ENA|HIGH|SIGN|WBA|SUB)    Decumulate with signed value to high word of accumulator.
 
Add/Sub Accumulator to result value (without accumulator update):
   UADD    (ENA)                     D'[i] = ACC[i]+unsigned(D[i]).
   USUB    (ENA|SUB)                 D'[i] = ACC[i]-unsigned(D[i]).
   SADD    (ENA|SIGN)                D'[i] = ACC[i]+signed(D[i]).
   SSUB    (ENA|SIGN|SUB)            D'[i] = ACC[i]-signed(D[i]).
   UADDH   (ENA|HIGH)                D'[i] = ACCH[i]+unsigned(D[i]).
   USUBH   (ENA|HIGH|SUB)            D'[i] = ACCH[i]-unsigned(D[i]).
   SADDH   (ENA|HIGH|SIGN)           D'[i] = ACCH[i]+signed(D[i]).
   SSUBH   (ENA|HIGH|SIGN|SUB)       D'[i] = ACCH[i]-signed(D[i]).

The parts of the names are:
  U, S  Unsigned, Signed value
  ACC   Accumulate:                  ACC[i] += unsigned(D[i]), or ACC[i] += signed(D[i])), D'[i] = ACC[i]
  DEC   Decumulate:                  ACC[i] -= unsigned(D[i]), or ACC[i] -= signed(D[i])), D'[i] = ACC[i]
  ADD   Add accumulator:             D'[i] = ACC[i] + un/signed(D[i]), ACC[i] unchanged
  SUB   Sub accumulator:             D'[i] = ACC[i] - un/signed(D[i]), ACC[i] unchanged
  H     Use high 32 bits of accumulator.

Scalar Register Update

If the top flag bit is set, the subset of 16 active elements are aggregated to a scalar register, and the behaviour of the flags is changed:

6        1             Enable SRU (Scalar Register Update) modifier
5:3      SRU[2:0]      Scalar Register Update function: SUMU, SUMS, (max), IMIN, (max), IMAX, (max), MAX.
2:0      SRU_SX[2:0]   Indicates which scalar register to write to r0, r1 .. r7.

or viewed another way:

  1 000sss  SUMU rs    rs = unsigned sum of ALU results. (0 if no lanes active).
  1 001sss  SUMS rs    rs = signed sum of ALU results. (0 if no lanes active).
  1 010sss  max2 rs
  1 011sss  IMIN rs    rs = index of ALU lane containing smallest signed value (-1 if no lanes active).
  1 100sss  max4 rs
  1 101sss  IMAX rs    rs = index of ALU lane containing largest signed value (-1 if no lanes active).
  1 110sss  max6 rs
  1 111sss  MAX  rs    rs = max signed element of ALU result (MIN_INT if no lanes active).

All calculations occur on signed quantities except SUMU. The functions (max) appear to find the max value like MAX, but the latter is preferred in existing binaries.

examples:

  vmov -, H(0,0) IMIN r0       ; r0 = i, 0<=i<=15, where P(0, 0+i) <= P(0, 0+j) forall 0<=j<=15
  vmov H(1,0), H(0,0), SUMS r1 ; r1 = P(0,0)+P(1,0)+...+P(15,0)

Interrupts

The chip contains a vectored interrupt controller which is programmed through the registers at 0x7e002000. The interrupt vector table contains 128 4-byte entries which contain the address of the interrupt handler for each interrupt.

Interrupt numbers:

0-31: processor internal exceptions
    0: zero
    1: misaligned memory access
    2: division by zero
    3: undefined instruction
    4: forbidden instruction
    5: illegal memory
    6: bus error
    7: floating point
    8: isp
    9: dummy
    10: icache
    11: vec core
    12: bad l2 alias
    13: breakpoint
    14: unknown
    ....
    31: unknown

32-63: software triggered interrupts (swi instruction)

64-127: external hardware interrupts
   64: INTERRUPT_TIMER0
   65: INTERRUPT_TIMER1
   66: INTERRUPT_TIMER2
   67: INTERRUPT_TIMER3
   68: INTERRUPT_CODEC0
   69: INTERRUPT_CODEC1
   70: INTERRUPT_CODEC2
   71: INTERRUPT_JPEG
   72: INTERRUPT_ISP
   73: INTERRUPT_USB
   74: INTERRUPT_3D
   75: INTERRUPT_TRANSPOSER
   76: INTERRUPT_MULTICORESYNC0
   77: INTERRUPT_MULTICORESYNC1
   78: INTERRUPT_MULTICORESYNC2
   79: INTERRUPT_MULTICORESYNC3
   80: INTERRUPT_DMA0
   81: INTERRUPT_DMA1
   82: INTERRUPT_DMA2
   83: INTERRUPT_DMA3
   84: INTERRUPT_DMA4
   85: INTERRUPT_DMA5
   86: INTERRUPT_DMA6
   87: INTERRUPT_DMA7
   88: INTERRUPT_DMA8
   89: INTERRUPT_DMA9
   90: INTERRUPT_DMA10
   91: INTERRUPT_DMA11_12_13_14
   92: INTERRUPT_DMA_ALL
   93: INTERRUPT_UART_SPI0_SPI1
   93: INTERRUPT_AUXIO
   94: INTERRUPT_ARM
   95: INTERRUPT_DMA_VPU
   96: INTERRUPT_HOSTPORT (INTERRUPT_HOSTINTERFACE)
   97: INTERRUPT_VIDEOSCALER
   98: INTERRUPT_CCP2TX
   99: INTERRUPT_SDC
   100: INTERRUPT_DSI0
   101: INTERRUPT_AVE
   102: INTERRUPT_CAM0 (INTERRUPT_CCP2)
   103: INTERRUPT_CAM1 (INTERRUPT_CSI2)
   104: INTERRUPT_HDMI0
   105: INTERRUPT_HDMI1
   106: INTERRUPT_PIXELVALVE1
   107: INTERRUPT_I2C_SLV
   108: INTERRUPT_DSI1
   109: INTERRUPT_PWA0
   110: INTERRUPT_PWA1
   111: INTERRUPT_CPR
   112: INTERRUPT_SMI
   113: INTERRUPT_GPIO0
   114: INTERRUPT_GPIO1
   115: INTERRUPT_GPIO2
   116: INTERRUPT_GPIO3
   117: INTERRUPT_I2C
   118: INTERRUPT_SPI
   119: INTERRUPT_I2SPCM
   120: INTERRUPT_SDIO (INTERRUPT_SDCARDHOST)
   121: INTERRUPT_UART
   122: INTERRUPT_SLIMBUS
   123: INTERRUPT_VEC
   124: INTERRUPT_CPG
   125: INTERRUPT_RNG
   126: INTERRUPT_ASDIO
   127: INTERRUPT_AVSPMON

On an interrupt, the interrupt mode is entered (r25 is mapped to r28), then pc and then sr are pushed onto the stack.

If the least significant bit of the table entry is set, then the supervisor bit (bit 29) or sr is set when the handler is executed. If the least significant bit is not set, an exception is raised if user code (sr bit 31) tries to execute swi.

VideoCore IV Programmers Manual

VideoCore IV Programmers Manual

Notes

Programmer's Model

Data Formats

Processor Modes (provisional)

Memory Map

L1 Instruction Cache

L1 Data Cache

L2 Cache

Scalar Register Configuration

Recommended Calling Convention

Vector Register Configuration

Cells / Elements

Vector Registers

Vector Register Notation

Replicating Alias

1Kb Lookup Table

Instructions

Instruction Formats

Instruction Encodings

Condition Codes

Arithmetic and Logical Operations

Load/Store Operation Width

Scalar 16 Bit Instruction Word

Scalar 32 Bit Instruction Word

Scalar 48 Bit Instruction Word

Vector Instructions

Vector Instruction Execution Model

Simplified View of Vector Instruction Execution

Optional: Instruction Repetition

Optional: Predication / Conditions

Optional: Accumulation

Optional: Scalar Result Unit

Example Vector Instructions

EXAMPLE 1

EXAMPLE 2

EXAMPLE 3

EXAMPLE 4

EXAMPLE 5

Putting it All Together

Argument format

Vector 48 Bit Instruction Word

Vector Memory operations

Vector data operations

Vector 80 Bit Instruction Word

Vector Memory operations

Vector data operations

Accumulation

Common Accumulation Functions

Scalar Register Update

Interrupts

Clone this wiki locally