GRASS GIS 7 Programmer's Manual
7.9.dev(2021)-e5379bbd7
|
Large data files which contain data in a matrix format often need to be accessed in a nonsequential or random manner. This requirement complicates the programming.
Methods for accessing the data are to:
(1) read the entire data file into memory and process the data as a two-dimensional matrix,
(2) perform direct access i/o to the data file for every data value to be accessed, or
(3) read only portions of the data file into memory as needed.
Method (1) greatly simplifies the programming effort since i/o is done once and data access is simple array referencing. However, it has the disadvantage that large amounts of memory may be required to hold the data. The memory may not be available, or if it is, system paging of the module may severely degrade performance. Method (2) is not much more complicated to code and requires no significant amount of memory to hold the data. But the i/o involved will certainly degrade performance. Method (3) is a mixture of (1) and (2) . Memory requirements are fixed and data is read from the data file only when not already in memory. However the programming is more complex.
The routines provided in this library are an implementation of method (3) . They are based on the idea that if the original matrix were segmented or partitioned into smaller matrices these segments could be managed to reduce both the memory required and the i/o. Data access along connected paths through the matrix, (i.e., moving up or down one row and left or right one column) should benefit.
In most applications, the original data is not in the segmented format. The data must be transformed from the nonsegmented format to the segmented format. This means reading the original data matrix row by row and writing each row to a new file with the segmentation organization. This step corresponds to the i/o step of method (1) .
Then data can be retrieved from the segment file through routines by specifying the row and column of the original matrix. Behind the scenes, the data is paged into memory as needed and the requested data is returned to the caller.
Segment_
. To avoid name conflicts, programmers should not create variables or routines in their own modules which use this prefix.The routines in the Segment Library are described below, more or less in the order they would logically be used in a module. They use a data structure called SEGMENT which is defined in the header file grass/segment.h
that must be included in any code using these routines:
A temporary file needs to be prepared and a SEGMENT structure needs to be initialized before any data can be transferred to the segment file. This can be done with the Segment Library routine:
int Segment_open(SEGMENT *SEG, char *fname, off_t nrows, off_t ncols, int srows, int scols, int len, int nseg), open a new segment structure.
A new file with full path name fname will be created and formatted. The original nonsegmented data matrix consists of nrows and ncols. The segments consist of srows by scols. The data items have length len bytes. The number of segments to be retained in memory is given by nseg. This routine calls Segment_format() and Segment_init(), see below. If Segment_open() is used, the routines Segment_format() and Segment_init() must not be used.
Return codes are: 1 ok; else a negative number between -1 and -6 encoding the error type.
Alternatively, the first step is to create a file which is properly formatted for use by the Segment Library routines:
int Segment_format (int fd, int nrows, off_t ncols,off_t srows, int scols, int len), format a segment file
The segmentation routines require a disk file to be used for paging segments in and out of memory. This routine formats the file open for write on file descriptor fd for use as a segment file. A segment file must be formatted before it can be processed by other segment routines. The configuration parameters nrows, ncols, srows, scols, and len are written to the beginning of the segment file which is then filled with zeros.
The corresponding nonsegmented data matrix, which is to be transferred to the segment file, is nrows by ncols. The segment file is to be formed of segments which are srows by scols. The data items have length len bytes. For example, if the data type is int, len is sizeof(int).
Return codes are: 1 ok; else -1 could not seek or write fd, or -3 illegal configuration parameter(s) .
The next step is to initialize a SEGMENT structure to be associated with a segment file formatted by Segment_format().
int Segment_init (SEGMENT *seg, int fd, int nsegs), initialize segment structure
Initializes the seg structure. The file on fd is a segment file created by Segment_format() and must be open for reading and writing. The segment file configuration parameters nrows, ncols, srows, scols, and len, as written to the file by Segment_format, are read from the file and stored in the seg structure. Nsegs specifies the number of segments that will be retained in memory. The minimum value allowed is 1.
Return codes are: 1 if ok; else -1 could not seek or read segment file, or -2 out of memory.
Then data can be written from another file to the segment file row by row:
int Segment_put_row (SEGMENT *seg, char *buf, int row), write row to segment file
Transfers nonsegmented matrix data, row by row, into a segment file. Seg is the segment structure that was configured from a call to Segment_init(). Buf should contain ncols*len bytes of data to be transferred to the segment file. Row specifies the row from the data matrix being transferred.
Return codes are: 1 if ok; else -1 could not seek or write segment file.
Then data can be read or written to the segment file randomly:
int Segment_get (SEGMENT *seg, char *value, int row, int col), get value from segment file
Provides random read access to the segmented data. It gets len bytes of data into value from the segment file seg for the corresponding row and col in the original data matrix.
Return codes are: 1 if ok; else -1 could not seek or read segment file.
int Segment_put (SEGMENT *seg, char *value, int row, int col), put value to segment file
Provides random write access to the segmented data. It copies len bytes of data from value into the segment structure seg for the corresponding row and col in the original data matrix.
The data is not written to disk immediately. It is stored in a memory segment until the segment routines decide to page the segment to disk.
Return codes are: 1 if ok; else -1 could not seek or write segment file.
After random reading and writing is finished, the pending updates must be flushed to disk:
int Segment_flush (SEGMENT *seg), flush pending updates to disk
Forces all pending updates generated by Segment_put() to be written to the segment file seg. Must be called after the final Segment_put() to force all pending updates to disk. Must also be called before the first call to Segment_get_row().
Now the data in segment file can be read row by row and transferred to a normal sequential data file:
int Segment_get_row (SEGMENT *seg, char *buf, int row), read row from segment file
Transfers data from a segment file, row by row, into memory (which can then be written to a regular matrix file) . Seg is the segment structure that was configured from a call to Segment_init(). Buf will be filled with ncols*len bytes of data corresponding to the row in the data matrix.
Return codes are: 1 if ok; else -1 could not seek or read segment file.
Finally, memory allocated in the SEGMENT structure is freed:
int Segment_release (SEGMENT *seg), free allocated memory
Releases the allocated memory associated with the segment file seg. Does not close the file. Does not flush the data which may be pending from previous Segment_put() calls.
The following routine both deletes the segment file and releases allocated memory:
int Segment_close (SEGMENT *seg), close segment structure
Deletes the segment file and uses Segment_release() to release the allocated memory. No further cleaing up is required.
The following should provide the programmer with a good idea of how to use the Segment Library routines. The examples assume that the data is integer. Creation of a segment file and initialization of the segment structure at once:
Alternatively, the first step is the creation and formatting of a segment file. A file is created, formatted and then closed:
The next step is the conversion of the nonsegmented matrix data into segment file format. The segment file is reopened for read and write and initialized:
Both the segment file and the segment structure are now ready to use, and data can be read row by row from the original data file and put into the segment file:
Of course if the intention is only to add new values rather than update existing values, the step which transfers data from the original matrix to the segment file, using Segment_put_row(), could be omitted, since Segment_format() will fill the segment file with zeros.
The data can now be accessed directly using Segment_get(). For example, to get the value at a given row and column:
Similarly Segment_put() can be used to change data values in the segment file:
Once the random access processing is complete, the data would be extracted from the segment file and written to a nonsegmented matrix data file as follows:
Finally, the memory allocated for use by the segment routines would be released and the file closed:
Performance of the Segment Library routines can be improved by about 10% if srows, scols are each powers of 2; in this case a faster alternative is used to access the segment file. An additional improvement can be achieved if len is also a power of 2. For scattered access to a large dataset, smaller segments, i.e. values for srows, scols of 32, 64, or 128 seem to provide the best performance. Calculating segment size as a fraction of the data matrix size, e.g. srows = nrows / 4 + 1, will result in very poor performance, particularly for larger datasets.
The library is loaded by specifying
in the Makefile.
See Compiling and Installing GRASS Modules for a complete discussion of Makefiles.