![]() |
Technote 1: Decoding and Encoding BUFR messages |
Minimum once per year |
---|---|---|
$Revision: 3$ | $Date: 24/07/2007 13:48:03$ | |
$Source: /usr/lpp/internet/MetoRoot/WWW01/metdb/documentation/technotes/RCS/dmtn1.html,v $ |
INTRODUCTION 1 TABLES 1.1 Table B 1.2 Table D 1.3 Local tables B & D 1.4 Code figures and flags 1.5 Descriptor representation 2 DECODING 2.1 BUFR message structure and decoding strategy 2.2 Replication 2.3 Basic BUFR operations and structure of decode 2.4 Bit manipulation to construct values 2.5 Output and display 2.6 Coordinates and instrumentation elements 2.7 Increments 3 ENCODING 3.1 Compression 3.2 Setting up descriptor sequences 3.3 Preparation of values to be encoded 3.4 Run-length encoding 4 QUALITY OPERATIONS 4.1 Bit maps 4.2 Bit maps and operators 4.3 Programming strategy 4.4 Use of decode output in application programs 5 TO SET UP A BUFR SYSTEM 5.1 Table access 5.2 Bit-handling programs 5.3 Calls to encode & decode Crown Copyright 1990, 1993, 1995, 2004 Meteorological Office, Fitzroy Road, Exeter, EX1 3PB Note: This paper has not been published. Permission to quote from it should be obtained from the Met Office.
BUFR is universal in that it contains a description of the data as well as the values. The description gives a list of the elements whose values follow. It does this in a coded form that requires a set of tables to interpret it. BUFR was developed for meteorological data, but can transmit whatever elements have table entries.
BUFR is binary in that values are not confined to some number of decimal digits, as with a character-based code , or to a machine-dependent word-length, but coded in a number of bits given by one of the above tables, which can be changed if necessary by appropriate operations.
The simplest BUFR message consists of a number of descriptors followed by the values of the corresponding elements. Not all BUFR descriptors correspond to elements: some descriptors represent operations to change the way a value is coded; others make the description more concise, by repeating descriptors (replication) or getting sequences of descriptors from a table, rather than including them in the message.
The most essential component of a BUFR system is the table of elements, Table B. Less essential, in that messages can be made without them, are the table of sequences (Table D) and the set of possible operations.
For each element, the entry in Table B gives a name, the SI units, the number of bits in which to code a value, a scale representing precision (see 1.1), and a reference value to be subtracted from the scaled value to leave a positive number to be encoded.
The operations enable the number of bits, the scale and the reference value to be changed. They also make it possible to add quality control flags, values and differences, to skip fields and so on.
Space taken up by the description can be saved by replication and the use of Table D sequences. Space in the data section can be saved by compression if several similar sets of values are coded together: the set of values for each is expressed as a minimum, an increment width and a set of increments (in a reduced number of bits) to be added to that minimum.
The sequence of descriptors can arrange data in ways not covered by existing code forms. Space and time (coordinate) elements locate the values that follow them. Space and time increments can be defined, so time sequences of regularly occurring values can be encoded. Run-length encoding is provided for images.
The sections which follow describe how the tables have been set up and the various operations encoded at the UK Met Office. These notes are to be read in conjunction with the section on FM 94 BUFR in the WMO Manual on Codes, and concentrate on points which could cause confusion. (See Vol 1.2 Part B Binary Codes)
It has 64 element classes, each with room for 256 entries for elements in the class. A class contains e.g. temperatures, or various humidity elements, or year, month, day... second. No information is conveyed by the choice of class for an element (but see 2.6: the distinction between coordinate classes and others is important): it just groups related or similar elements together to show at a glance what entries already exist in that field.
An entry consists of: descriptor, name, units, scale, reference value
and number of bits used to encode a value. The value actually encoded is
  value*(10**scale)-refval.
A positive scale is the number of figures after the decimal point;
a negative scale the number of zeros before the point.
The reference value ensures that a positive value is always encoded.
Names are limited to 64 characters - though, confusingly, some names in the Manual on Codes are longer. Units of 'code table', 'flag table' and 'CCITT IA5' (simply 'character' in CREX) call for action from a decoder; the rest are only for display.
There is no limit to the number of descriptors in a sequence, although a limit of 16 was assumed for a long time and hence longer sequences were split into components which are unlikely to be useful in other contexts.
Formally Table D is just a way of cutting down the length of the description in a BUFR message, so the only information about a sequence in Table D is a list of descriptors. But for whole-message sequences the annotations on the right in the Manual on Codes may provide essential information for interpreting the data, despite the aim to make BUFR messages self-explanatory.
(And despite the Backus-Naur definition and "Format for international exchange" of Table D in the Manual on Codes, which do list only descriptors. Note that the exchange format only works if the form of the sequence descriptor is kept different from the descriptors defining the sequence - which can be sequence descriptors (F=3) themselves - by e.g. leaving spaces between F, X & Y. Otherwise counts are needed.)
Class 0 descriptors can transmit another centre's Table B entries in BUFR messages. In principle these could override existing entries for the message in which they occur - but this has never yet been done, so there is no code to implement it. (Entries have been transmitted in separate messages for test purposes; but this can't be automated, as there's no way of telling which following messages the overriding entries should be used for! The local table version in section 1 would have to be used together with the originating centre...)
Input is as in the example below, from a file to be read by a call to LOCALD, which keeps the sequence for later calls.
On the IBM mainframe, the Local D file should be given the DDNAME LOCALSEQ e.g.
//GO.LOCALSEQ DD DSN=SDB.BUFR.LOCALSEQ(AIREPS),DISP=SHROn a HP, T3E or IBMSP unix machine, the Local D file should be given the name or symbolic link LOCALSEQ in the run directory (the directory the BUFR executable will run in), or if using the environment variable BUFR_LIBRARY, put LOCALSEQ in the library pointed to by $BUFR_LIBRARY.
Example of LOCALSEQ file:
309255 UPPER AIR SIGNIFICANT TEMPERATURES AND WINDS 001001, 001002, 001011 STATION NUMBER OR CALL SIGN 005002, 006002, 007001 LATITUDE & LONGITUDE, STATION HEIGHT 004001, 004002, 004003 DATE (YEAR, MONTH, DAY) 004004, 004005 HOUR & MINUTE (IF KNOWN) OF LAUNCH 002011, 002014, SONDE TYPE, TRACKING SYSTEM 002013 RADIATION CORRECTION 022042 WATER TEMPERATURE 104000, 031001, 008002 CLOUD DATA (LOW, MIDDLE, HIGH) 020012, 020011, 020013 CLOUD TYPE, AMOUNT & BASE FOR EACH LEV 008001, 106000, 031001 SEMI-STANDARD LEVELS (775MB ETC) 010004, 010003 PRESSURE & HEIGHT 012001, 012003 TEMPERATURE & DEW POINT 011001, 011002 WIND SPEED & DIRECTION 008001, 103000, 031001 SIGNIFICANT TEMPERATURES 010004, 012001, 012003 PRESSURE, TEMPERATURE & DEW POINT 008001, 103000, 031001 SIGNIFICANT WINDS 010004, 011001, 011002 PRESSURE, WIND SPEED & DIRECTION
But one further table can usefully be made, for decoding purposes only, or rather for displaying data coded in BUFR: it consists of brief descriptions corresponding to the code figures and flags. It seems best to avoid (as far as possible) displaying the code figures themselves: even where these correspond to existing WMO codes, not all users can be expected to know the codes, and many code and flag tables have been made specially for BUFR, either from scratch or by combining existing tables.
As there is no stated limit for their length, descriptions of code figures can be very long, especially where effectively several code figures and flags have been combined, as for present weather. This means that a concise form, say 12 characters, displayable in a table column, is not always easy to achieve. Despite this, most descriptions of code figures have been abbreviated into a reasonably meaningful 12-character form: the remainder will appear as figures in a display, leaving the user to look them up in the Manual on Codes.
Descriptors appear as 6-figure numbers in the BUFR documentation. But if F, XX & YYY in FXXYYY are fields of 2, 6 & 8 bits respectively, then the numerical value of a descriptor is not F*100000+X*1000+Y, but F*16384+X*256+Y.
We therefore need several functions for converting from one form to
another:
from a 16-bit field in section 3 of a BUFR message to separate
values of F, X & Y and hence the 6-figure displayable form as above
(for, say, error messages);
from a readable 6-figure form as in the documentation, to the 16-bit form
used in encoding and decoding.
The function DESFXY (DESCR,F,X,Y) converts a 16-bit descriptor to values of F, X &
Y (all integer)
the function IDES (FXXYYY) converts from 6-figure form F*100000+X*1000+Y to
the 16-bit form.
But note that to find a given meteorological element in a message it is generally not enough to find a single descriptor: to find an element like tropopause temperature means finding two descriptors, not necessarily consecutive: 008002 with a value of 3 for the tropopause and only then a temperature descriptor. So in practice a data base interface is needed between a BUFR decode as described here and a meteorological user.
The most important fields are:
(applies to Edition (Ed) 4 with differences listed for earlier editions)
BUFR + (Ed 2 onwards) total length of message Section 1 octets item 1-3 length of section (in octets) 5-6 originating centre 7-8 originating sub-centre 10 flag for inclusion of section 2 11 data type 12 data subtype 16-22 date/time (down to seconds)
Section 3 octets item 1-3 length of section 5-6 number of reports 7 compression flag and observed data flag 8-9 descriptors 10-11 descriptors etc
Section 4 octets item 1-3 length of section 5-> data bit string
Section 1 (Ed 2-3) octets item 1-3 length of section 5 originating centre 6 (Ed 3) originating sub-centre 8 flag for inclusion of section 2 9 data type 10 data subtype 13-17 date/time (down to minutes)
The task of decoding as defined here is to achieve a correspondence between descriptors and bits in the data section, so that we know how many bits make up a value, what element it is a value of, any scale changes etc, and then return arrays of descriptors and values in such a way that it remains clear to a calling program which value corresponds to which descriptor.
Conceptually this is a matter of taking Table B entries, perhaps with modified scale figure etc, and adding a further column to give the corresponding value. But in fact there is no need to set up the whole of such an array. If the aim is to display the contents of the message, then lines can be output as they are set up rather than held in core; if not, then what is wanted as output is an array of values with all operations performed and a corresponding array of descriptors to identify the elements (the other columns are only used while an element is being handled and can be discarded when the next element is reached - except when quality operations are possible.
So, although at first it might seem convenient to separate expansion of the description, that is the process of looking up sequences, performing replications, adding quality control fields etc, from the bit manipulation involved in finding the corresponding values, this may be better avoided for reasons of space.
But there are more fundamental reasons for combining expansion of descriptor sequences and bit manipulation. To see why, we need further consideration of the replication operation.
First we must distinguish between explicit and delayed replication. A replication descriptor says how many descriptors to repeat. It may also say how many times to repeat them, but this count (in the descriptor) may be set to zero, in which case it has to be found in the data. This makes sense where, say, the number of levels in a profile is not known beforehand and may vary from profile to profile: delayed replication enables the same sequence of descriptors to be used for all profiles (though obviously not with compression if the count varies).
So a descriptor sequence which includes delayed replication cannot be expanded in isolation from the data. It would be possible to find the replication counts before the values of the elements (by adding up the number of bits to skip) and so keep the two processes more or less separate - but there are further complications.
(The rest of this section is concerned with run-length encoding of images, a rarely-used feature, and its aim is to justify the decoding strategy adopted; it can be skipped by readers interested in neither of these things.)
Replication originally applied only to descriptors: the descriptor sequence was abbreviated to save space and has to be expanded to match the data. But when a replication operator is followed by a data repetition count, rather than an ordinary delayed replication, the data value itself must be repeated the same number of times. This is for run-length encoding of images consisting of a fixed number of values of a given element, the precision being such that many successive values (pixels) may be the same.
For instance, any line of a radar image can be broken up into segments consisting of identical pixel values and segments where the values vary. The first kind of segment calls for data repetition, a descriptor and a value both encoded once to be repeated N times in the output; the second requires replication, N values to be coded in the message and one descriptor repeated N times in the output to correspond. Clearly such a descriptor sequence cannot be expanded in isolation from the data.
The third complication is the replication of coordinate increments. An element in one of the time or place classes immediately before a replication operator is taken to be included in the N-fold replication as an increment to be added N times, but without any further value in the data. There can be increments for more than one coordinate element.
Now consider nested replications, say for coding an image line by line: an outer replication for the number of lines in the image and inner replications to describe each line. The outer replication is preceded by, say, a latitude increment, the inner by a longitude increment; no pixel values occur except inside the inner replication.
Clearly the increment before the outer replication must be distinguished during the decoding process from that before the inner replication, or else it will be replicated again: it must be flagged as already replicated, and only unflagged when the expansion is complete.
In other words, there are descriptor sequences which cannot be reduced to sequences of element descriptors without destroying vital features of their relationship to the data. Hence sections 3 and 4 must be handled together.
F=0: element (class X, element Y in Table B) an element can be character or numeric, a numeric element can be a number, code figure or flag(s), and any element not in Class 31 can have associated fields F=1: replication (of the following X descriptors Y times) Y>0: explicit (count in descriptor) Y=0: delayed (count in data, either ordinary replication or data repetition) F=2: operations X=1: change field width (by Y-128 bits) X=2: change the scale, i.e. multiply by a power of ten (by 10**(Y-128)) X=3: change reference values X=4: add Y-bit quality control field X=5: insert string of Y characters X=6: hide local descriptor [X=8 is assigned to an operation which combines 1, 2 & 3, but is not operational yet) [for quality operations see 4.2] F=3: sequence (category X, sequence Y in Table D) F=1 If replication is delayed, the count is found in the data. Increments immediately before the replication operator are counted and the increment descriptors added to the end of the sequence of descriptors to be replicated. Space is made (as for a sequence) and the replication carried out. The values of any replicated increments will be copied in the output value array. If a count in the data is zero, delete all the descriptors that would have been replicated, including the increments, as well as the replication operator and count. If the count in the data indicates run-length encoding, flag the element descriptor (asssuming that only one element at a time can be run-length encoded) and repeat it, leaving the operation to be completed by repeating the values in the value array. We also need a flag to be set when the descriptors are repeated and then unset when the value has been got from the bit string, to avoid looking in the bit string for further values. F=2,X=1,2,4 Width increment, scale increment and stacks of Q/C field width and field meanings are set accordingly and used whenever values of an element are found. Each value is then preceded in the output by the meaning of each field and the field itself, for as many pairs of meaning and value as are currently nested. F=2,X=3 Changed reference values are listed (in parallel arrays of descriptor and reference value) and the list consulted whenever values of an element are found. F=2,X=5 Inserted characters are put in the same string as character values. F=2,X=6 The descriptor and value are skipped. [208YYY will be like 202YYY, but change width & reference value accordingly, as well as scale.] F=3 Insertion of a sequence is simple. Space is made by moving the remaining descriptors down; the inserted descriptors overwrite the sequence descriptor itself, and scanning of the descriptors continues with no adjustment to the pointer, i.e. with the first descriptor in the inserted sequence.
There are several ways of doing this. It can be done a bit at a time, testing whether a bit is set in the bit string and building up the value by doubling and either adding one or not adding accordingly.
Our Fortran program takes a slightly more complicated (but faster?) approach, working an octet at a time. We start in octet N=I/8. In this octet NINIT=I-N*8, i.e. MOD(I,8), bits have already been used. The value will extend over NOCTET=(WIDTH+NINIT+7)/8 octets, and in the last of these octets NLAST=WIDTH+NINIT-(NOCTET-1)*8 bits will be used.
The value is segmented in this way, bits being shifted in an octet by multiplying or dividing by powers of 2. A value that fits into one octet is treated as a special case.
A character value is encoded one octet at a time.
A value which is all ones, i.e. equal to 2**(WIDTH-1), is missing except in the case of a one-bit element or associated field, which is simply a flag set on or off.
(An IBM Assembler program is faster, working one 32-bit integer at a time: skip I/32 words, load two words, shift left MOD(I,32) bits to get rid of unwanted bits in previous values and right 32-W bits to align, losing any bits from following values.)
No numerical value has more than 31 bits, so integer values are precise. But real output loses precision if elements have more than 24 bits (IBM single precision). This is a serious problem with flag tables, a few of which have more than 24 flags: an imprecise value can imply a completely misleading flag combination. So until a double-precision version is implemented only the first 24 flags are output: if more are wanted, a full representation (e.g. in octal, as for CREX) of the flag table could be put in the character output. (This means the value corresponding to the n-th flag in a w-flag table is not 2**(w-n) but 2**(min(w,24)-n)).
Example: a 13-bit value split between octets as follows (3+8+2 bits marked +, so NINIT=8-3=5, NLAST=2)):
=====+++ ++++++++ ++====== octet 1 octet 2 octet 3 Build up the value V as follows: in this case: V1=MOD(OCTET(1),TWOTO(8-NINIT)) V1=MOD(OCTET(1),8) V2=V1*256+OCTET(2) V2=V1*256+OCTET(2) V =V2*TWOTO(NLAST)+OCTET(3)/TWOTO(8-NLAST) V=V2*4+OCTET(3)/64 where TWOTO is an array of powers of 2.
For character elements the corresponding value (from our decoder) points to a character string: the value is length*2**16 plus pointer.
Ideally the N-th descriptor in the output would correspond to the N-th value or N-th row of values, i.e. all operators would have been used and then deleted, leaving only element descriptors. But unfortunately this is not generally so.
In the expansion of the BUFR descriptor sequence the following aims at first sight seem reasonable: (1) to leave a valid sequence of descriptors after any operation, (2) to end up with a sequence in one-to-one correspondence with the values, i.e. with no operators left in it, (3) to end up with a sequence that can be used to re-encode selected subsets of values (reports) from a compressed message, (4) to end up with a sequence which can be used to decode another subset (if there are several subsets in the message with no compression).
Of these aims (3) is questionable, because what is wanted in section 3 of a BUFR message is more likely to be the original than the expanded sequence, (2) requires decisions about whether delayed replication counts are to be put in the output value array and what descriptors should correspond to quality control fields, (1) is unattainable for reasons like those described in 2.2, and (4) is internal to the decoding process, so better abandoned - it's simpler to keep the original sequence and repeat the expansion.
In fact aim (2) is inconsistent with (1) and (3): if our aim is correspondence with the values, and therefore operators are deleted after use, then we may be left with replication counts with no replication operators; if the operators were left, then the descriptor count (X) would have to be adjusted during subsequent operations, which would be difficult.
So the best we can aim for is some correspondence between descriptors and values (essential - though some descriptors may have to be skipped) and the possibility of reencoding starting with the original descriptor sequence (though this would depend on the operations used).
So our output descriptor and value arrays depart from one-to-one correspondence and immediate re-encodability in the following ways:
These decisions, designed to avoid any repetition of descriptor manipulation in the calling program, may seem arbitrary, especially the first one: they meet our current needs (Jan 2004) but clearly the handling of replication may seem unsatisfactory - a better general solution might be 1XX000 in the descriptor array (with XX adjusted to describe the number of output descriptors now replicated - not an easy task!) and Y, the corresponding count, in the values array.
Our BUFR decode provides an optional display of the values (one line each: element name, units, value - if the value is a code figure, then if possible it is replaced by a brief description).
Example of display:
WMO BLOCK NUMBER NUMERIC 33 WMO STATION NUMBER NUMERIC 946 LATITUDE (COARSE ACCURACY) DEGREES 45.00 LONGITUDE (COARSE ACCURACY) DEGREES 34.00 HEIGHT OF STATION M 205 TYPE OF STATION CODE TABLE MANNED YEAR YEAR 1996 MONTH MONTH 4 DAY DAY 21 HOUR HOUR 0 3 6 9 12 15 WIND DIRECTION AT 10M DEGREES TRUE 170 0 30 60 50 230 WIND SPEED AT 10M NUMERIC M/S 3.1 ********* 2.1 4.1 3.1 5.1 CLOUD TYPE NO CL CLOUD NO CL CLOUD NO CL CLOUD CU CAL NO CL CLOUD CU CAL CLOUD TYPE AC TR LEVEL AC TR LEVEL AC TR LEVEL AC TR LEVEL AC TR LEVEL AC TR LEVEL CLOUD TYPE NO CH CLOUD CI FIB (UNC) CI SPI SHEAF CI SPI SHEAF NO CH CLOUD NO CH CLOUD
But it may be useful to say that a coordinate no longer applies. This can be done by coding a missing value of the coordinate. This may be useful for instrumentation (class 2) as well as time & place: there are no instrumentation elements (though there could be) for traditional observations like SYNOPs, so the instrumentation specified for a non-traditional instrument may have to be cancelled, rather than superseded by an appropriate value for the next element.
(Note also that the proliferation of instrumentation data in BUFR has made some early element names inappropriate: 002003, "type of measuring equipment used", is clearly meant only for PILOTs when the code figures are examined.).
Clearly the current position is obtained by adding the increment, if there is one, to the original position. But what if there is more than one increment for the same element? The general BUFR rules would say the second overrides the first, so add the second increment to the original value; but increments before replications are clearly meant to take effect cumulatively, i.e. the value before the replication count is added repeatedly to the original value.
We must then assume that if a new original position is given, any increment is cancelled. If, for instance, we reach the end of a row in scanning an image, restating the original longitude will take us back to the start of the next row. Until the longitude is restated the increments remain in force, even outside the replication which added them, so that a run-length-encoded row, consisting of several segments, each with its own replication, will accumulate increments along the whole row, rather than go back to the original value at the start of each segment.
So we must assume that increments involved in replications always (not just within the replication) take effect cumulatively: that an increment can be cancelled by resetting the original coordinate at the start of a row, but then each step is always added to the current value of the increment, however many segments there are in the row.
Our decode program returns replicated increments explicitly (without incrementing the coordinate concerned) if an increment descriptor appears before a replication operator: the increments can then be converted to incremented values of the coordinate in a further pass through the output array.
Increments before replication operators are recognised by the presence of the word 'increment' in the name. The matching up of increments and elements incremented is (fortunately) an operation that can be left to be handled outside the basic decode. We suggest incrementing an element only if an element in the same class (in classes 4-7) and with the same units is found with the same name as far as 'increment', or at least with the word 'increment' in its name, so as not to tie the increment recognition process to the word order of English (other centres may use translated element names, and the equivalent of 'increment' could come at the start rather than the end of a name!) - but one day there may be an element with 'increment' in its name which despite that is not an increment in the sense of this section), so this is still not a satisfactory proposal.
One BUFR rule about increments is clearly stated: 94.5.4.3 says that a replicated increment is added the first time to give the coordinate of the first set of replicated data, so the original coordinate in the BUFR message must be the first position or time minus the increment.
Compression is done by scanning the values to find the maximum and minimum, allowing for missing data. Find the number of bits needed to code maximum minus minimum plus one (from the next highest power of 2, the smallest M such that max-min+1<2**M). This defines the increment width. One is added because all ones (ie. all bits set) would imply missing data; so if max-min=(2**M)-1 for some M, the number of bits needed is not M, but M+1. Missing values are ignored in finding the minimum, but a flag is set if missing values exist: max=min with no values missing means no increments to be coded, but max=min with missing values means one-bit increments, set to 1 if the value is missing.
If a value cannot be encoded in the field width, it is set to missing before it can affect the range of values.
Character values are not compressed. If they are all the same, the increment
width is zero, otherwise it is the original width and the values are the
original character strings.
Examples
(1) values to be coded: 45, 37, 19, 22, 17 min=17, max=45, max-min=28
requires 5 bits
(2) values to be coded: 21, 3, 13, 34, 5, 8   min=3, max=34,
max-min=31 requires 6 bits,
because an increment of 31 in 5 bits would have all 5 bits set and therefore
mean missing data.
This effect is not overridden by replication: if the coordinates in a group of replicated descriptors don't come first, they apply to the first values of the elements which follow in the replicated group and then the second values of the elements before them - then comes a further coordinate change, and so on.
Of course a user who wants all the data in a message knows how to interpret it and won't connect the values and coordinates wrongly. But a general retrieval program (a very ambitious project: we have tried twice & failed!) going through data of different kinds might well look for values of a certain element at given places and times, ignoring other elements, and return wrong data if the coordinates are out of place.
The BUFR package contains a program SCRIPT which will show (without a message to give values) how a sequence will expand: it puts a blank line in front of any coordinate element (or sequence of successive coordinates), hoping that an unexpected break will warn a user that coordinates are misplaced.
Real (decimal) input requires values in the units specified. The scale can be taken as a warning about what rounding will be done in the course of encoding, but the precision of the data should already have been reflected in the description chosen by the user at an earlier stage (eg. code temperatures in tenths, or in hundredths, or in whole degrees, with a change of scale if necessary). At this stage the user needs only to ensure that temperatures are in Kelvin, rather than Celsius.
The reference value in Table B is also of no concern to the user. For temperature it was possible to choose units (degrees Kelvin) which always give positive values, so no non-zero reference value was needed; for latitude and longitude, however, this is not possible; so the encoding process must subtract a sufficiently large negative number to ensure the number to be encoded is always positive. However, this requires no action by the user.
An example may help. A temperature is normally stored in degrees Kelvin with a scale factor of 2, i.e. in hundredths. (Temperatures in tenths are now avoided because conversions between C & K may make it difficult to get back to the original value.) So real input requires a value such as 287.61 (14.46C); this number will be multiplied by 100 during encoding to give 28761 and this value goes into the bit string (unless there is compression).
Beware that if the scale is changed and the reference value is not zero, it may be necessary for the user to change the reference value to go with the new scale. However a change is not essential if the scale change leads to less precision; and the expected range of values may be such that even for greater precision no change is needed - the reference value only needs to be a large enough negative number. Later editions of BUFR should have an operation which changes width & reference value together with scale, making scale changes much easier.
Beware also of scale changes for precipitation, where negative values are actually code figures and so the reference value should remain constant, irrespective of scale changes. So a trace is always -1 or -2. The encode and decode both assume that a negative value of any class 13 element with a reference value of -1 or -2 is a trace and therefore never scaled.
For character values we make the corresponding number in the value array a pointer to a character string (details). There is no need for a length, which is given by Table B. Widths of character values are adjustable in Edition 4. "Inserted characters" (operation 5, which gives the length) simply follow on in the input character string with no pointer in the value array.
The first is for straightforward delayed replication, which is explained clearly enough in the documentation. The second is for "run-length encoding" of images: if the range of pixel values is small, so that, when an image is scanned, many successive values will be the same, it is convenient to give the number of identical values rather than encoding the value that many times.
A descriptor pattern which makes this possible without requiring a different sequence of descriptors for each image is as follows. Any row can be broken up into a set of "parcels" each consisting of a number of strings of identical values followed by a string of different ones. In this way an image can be described by a general sequence of 15 descriptors (see below), to be expanded using the counts in the data.
The basic BUFR software can encode an image in this way if passed the counts and told to use this descriptor pattern. But this is not the only possible approach to image encoding, so the sequence of descriptors is not embedded in the basic programs, and the above outline can be implemented in various ways: for instance, greater compression could be achieved (at the expense of more elaborate programming) by treating values repeated only 2 or 3 times as if they were different (the values themselves take up less space than the extra counts required).
Our preferred method of encoding is to provide a preliminary call which takes a 2-dimensional array representing an image and returns a sequence of values with counts inserted, ready to be encoded with the descriptors which are likewise returned by the program (with the element concerned, e.g. pixel value, and increments inserted). This is only one way of run-length encoding an image: the user can, of course, replace the call to RUNLEN by any program which produces valid sequences of values and descriptors to be passed to the encoding program.
1 005001 initial latitude (minus increment) 2 005011 latitude increment from row to row 3 113000 replicate the rows of the image 4 031002 number of rows 5 006001 initial longitude (minus increment) 6 110000 replicate "parcels" of different and same in row 7 031002 number of parcels in row 8 006011 longitude increment along row 9 101000 repeat a string of different values 10 031002 number of different values 11 030001 descriptor for pixel element itself 12 104000 replicate runs of identical values 13 031002 number of runs 14 006011 longitude increment along row 15 101000 replicate a string of identical values 16 031012 number of identical values 17 030001 descriptor for pixel element itself
A bit map is a set of values of the one-bit flag element 031031 (0 - data present, 1 - data not present). An N-bit map defines a subset of the N elements (elements rather than descriptors!) preceding an operator of the form 2XX000, where XX=22, 23, 24, 25 or 32. Elements here means effectively values in the data section, i.e. any delayed replication counts are included.
If M bits in a bit map are zero, then values of the corresponding M elements will follow in the data section as the result of any operation which uses this bit map. These values will be corrections, original values, differences, statistics etc as indicated by XX (together with 008023 or 008024 if XX is 24 or 25) or Class 33 elements in the case of 222000. But the values may not follow immediately and may not be consecutive; their positions in the data that follows will be shown by M place-holders of the form 2XX255 or M Class 33 descriptors. The I-th place-holder corresponds to a value of the I-th of the M elements with zeros in the bit map, encoded with its scale, data width & reference value as modified by any operations in force for the original value.
The set of operators finally accepted has redundancies resulting from the different versions the proposal went through. Of the four operators added later, 236000, 237000, 237255 & 235000, only 235000 is essential as the proposal now stands, and its definition is too restrictive.
236000 defines a bit map for use later, but a bit map can be recognised without it. 237000 reuses a bit map, but only one bit map can be currently defined, so again the descriptor is unnecessary. 237255 cancels a bit map, but a new bit map, taken to supersede the old one, would have the same effect. Only 235000 is essential: it unsets the end of the set of values referred back to by a bit map, leaving the next 2XX000 (where XX is 22 to 25 or 32) to reset it. Without this all quality operations would refer back to the same point.
Our decode also allows the same bit map to be used for different sets of elements. This possibility is, strictly speaking, ruled out by the operations as currently defined, but taking the least restrictive approach we see no reason why 235000 should cancel the bit map at the same time as changing the set of elements referred back to. If a new bit map follows, it will override the previous one; if not, the previous bit map can be left in force.
The only alternative is to stop the decode because a rule has been broken, whereas it may well be possible to continue successfully. But remember that, while this may be a useful feature, messages should still be encoded to follow the rules as closely as possible, or more restrictive decodes may fail!
Given this log, we need action to carry out quality operations at the following points:
In the case of quality operations, if the message contains several temperatures and a correction to one of them, the decode as described above would print out a temperature but not make it clear which original value was being corrected. Rather than leave higher-level programs with the same manipulation of bit maps to repeat, we need pointers to link original value and correction in the output descriptor array. This array already needs to include scale change and (modified) replication operators as well as element descriptors, because (as explained in 2.3) information which may be needed would otherwise be lost.
As pointers we use the place-holders (because XX gives information about the value added by the quality operation) with numbers set in the top bits. Each place-holder was replaced above by a descriptor; to set these pointers we keep a list of descriptors to be inserted in the sequence before completing the decode. The n-th insertion in this list puts a place-holder with n in the top bits after the original value and an identical place-holder with n set after the correction or whatever value is added. More than one such pointer can follow the original value. We can then get from original value to correction or vice versa by searching for a uniquely identified descriptor.
The three main tables can be accessed using the following calls:
CALL TABLEB(X,Y,SCALE,REFVAL,WIDTH,FORMAT,NAME,UNITS)returns the fields of the Table B entry for 0XXYYY, where X & Y (integers) are input arguments and the rest (3 integers and 3 character strings) are returned.
CALL TABLED (X,Y,SEQ,NSEQ)returns the sequence 3XXYYY in Table D, where X & Y are input arguments and NSEQ is the number of descriptors returned in SEQ. All arguments are integer.
CALL CODE(DESCR,VALUE,WORDS)returns in WORDS a description of upto 12 characters length corresponding to the code figure VALUE of the descriptor DESCR (both integers).
The following bit-handling programs are unlikely to be needed by users:
VALUE(STRING,IBEFOR,WIDTH) gets a value in WIDTH bits after IBEFOR bits of STRING, where STRING is section 4 of a BUFR message (starting with the length). VALOUT(STRING,IBEFOR,WIDTH,VALUE) puts VALUE in WIDTH bits after IBEFOR bits of STRING.
ENCODE A VERSION 2 OR 3 BUFR MESSAGE ====================================
CALL ENBUFV2(DESCR,VALUES,NDESCR,NELEM,NOBS,NAMES,DATIME,MESAGE,CMP,L, EDITION,MASTERTABLE,ORIGCENTRE,DATATYPE,DATASUBTYPE, VERMASTAB,VERLOCTAB,EXTRASECT1,CHARSECT1,EXTRASECT2, CHARSECT2,SECT3TYPE) where DESCR Integer in/out : An array of BUFR descriptors, whose length is of sufficient size to accomodate any expansion needed. Length is defined by NDESCR below. The array is changed following a BUFR encode, so it needs to be reset if another encode is to be attempted with the original descriptors. VALUES Real input : An array length NOBS*NELEM of values to be encoded They should be supplied in the units given by Table B. Missing values should be set to -9999999.0 NDESCR Integer in/out : Number of descriptors. If this is zero, the descriptor sequence in MESAGE will be used; if the string needs expansion, NDESCR will be found changed on return. NELEM Integer in/out : Number of values implied by the descriptor sequence (not always the final value of NDESCR, because the output descriptors include some operators NOBS Integer input : Number of sets of values (reports) to be encoded together NAMES Character input : A string that contains any character values for which there is a corresponding subscript in array VALUES that points to the start of a field in this string (the length comes from Table B) DATIME Integer input : Date/time array, length 5 (year, month, day, hour, minute) MESAGE Character output : A string that holds the BUFR message as binary data CMP Logical input : Is TRUE if compression is required, FALSE if not L Integer output : Length of the BUFR message in octets EDITION Integer input : The BUFR edition number (section 1). Code -99 for the default (=3) MASTERTABLE Integer input : The BUFR master table (section 1). Code -99 for the default (=0) ORIGCENTRE Integer input : Originating centre (section 1). Code -99 for the default (=74) DATATYPE Integer input : Data category type (section 1). Code -99 for the default (=255) DATASUBTYPE Integer input : Data category subtype (section 1). Code -99 for the default (=0) VERMASTAB Integer input : Version number of master tables (section 1). Code -99 for the default (=11 in Jan 2004, but will change from year to year) VERLOCTAB Integer input : Version number of local tables (section 1). Code -99 for the default (=0) EXTRASECT1 Logical input : Code TRUE if there is extra data to be added to the end of section 1. If so, the data in CHARSECT1 will be added. CHARSECT1 Character input : Extra data to add to the end of section 1. EXTRASECT2 Logical input : Code TRUE if there is data to be to put in section 2. If so, the data in CHARSECT2 will be added. CHARSECT2 Character input : Extra data to put in section 2. SECT3TYPE Integer input : section 3, byte 7 (type of data). Code 1 for observed, 0 otherwise. Code -99 for default (=1)Note: The length of MESAGE cannot be much more than the total length of the three inputs DESCR, VALUES & NAMES. The dimension of DESCR may have to be greater than NELEM, because some manipulations expand before deleting.
DECODE ANY BUFR MESSAGE ======================= CALL DEBUFR(DESCR,VALUES,NAMES,NDESCR,NOBS,MESAGE,DSPLAY) where DESCR integer output : contains a list of descriptors in 16-bit form ref) VALUES real output : Array size NOBS*NDESCR of values in the units given by Table B. NAMES character output : A string containing any character values returned, for each of which the VALUES array will contain length*(2^16) plus a subscript pointing to the start of a field in this string, the corresponding descriptor being flagged by adding 2^17. NDESCR integer in/out : must be the length of DESCR and will be returned as the output descriptor count. This must be at least twice the number of descriptors actually returned, as some workspace is needed by the DECODE routine, NOBS integer input : must be set to the length of VALUES and will be returned as the number of sets of values (reports, profiles) MESAGE character input : this string is the BUFR message to be decoded. DSPLAY logical input : is set to TRUE for a display of element names and values.Unfortunately there is no way of telling how big DESCR, VALUES and NAMES must be without first decoding the message, hence dimensions are passed in NDESCR and NOBS to avoid overwriting.