Geosoft Databases
What is a Geosoft database?
A Geosoft database stores 3D spatial data in a form that allows for both very-large volume data storage and very efficient processing. Geosoft databases are stored on a file system in a file that has extension .gdb
.
The Geosoft database was first designed in 1992 and has evolved to become a de facto standard for airborne and ground geophysical data, geochemical data, and drillhole data of all kinds. All Geosoft programs and third-party applications that use the Geosoft API to read databases are able to work with databases created by any version of Geosoft since 1992, though database features and capabilities that are newer than the reading program will not be available to older programs. This cross-version stability of data has been key to the emergence of Geosoft databases as an industry standard, especially for airborne geophysical surveys.
Fundamentally, Geosoft databases store located data which has either an (x, y) location for 2D data, or an (x, y, z) location for 3D data such that each location may have any number of fields of associated information. Information fields are referred to as 'channels' in the context of a Geosoft Database, and channels may contain numeric data, strings or arrays of numeric data to store things like time-series. For example an airborne magnetic survey may have (x, y, elevation, altimeter, mag) data, and a ground geochemical survey may contain (x, y, elevation, Cu, Au, Hg).
Data in a database is grouped into the concept of 'lines' of data such that each 'line' contains the data from a single airborne survey line, a single ground survey line, a marine ship track, a set of data from a single drill hole, or from any other grouping construct that is suitable for the stored data. Geosoft's desktop application, Oasis montaj, displays a single line of data in a spreadsheet-like form to the user, allowing a user to change lines and change which channels are displayed and the order channels are displayed. Simple databases might place all data into a single line. All lines in a Geosoft database share the channel definitions, of course with data different for each line.
As you work through the exercises in this tutorial you will be introduced to other concepts of Geosoft databases.
Create a new database from a CSV-format data file
A very common form for exchanging located data is in a simple ASCII Comma-Separated-Values (CSV) file. In this exercise we will create a new Geosoft database and import an example CSV file.
The CSV file mag_data.csv
contains located data as follows:
# (X, Y) are on NAD83, UTM Zone 15N X,Y,Z,mag 330000,6629200,211.055369153511,5535 330050,6629200,210.654999327171,5500 330100,6629200,210.286555185181,5476 330150,6629200,209.952679145607,5479 330200,6629200,209.655742280775,5496 330250,6629200,209.397817201372,5544 etc...
In this example, the first line is simply a comment line that identified the spatial coordinate system for the located data. The second line identified the names of the data columns, and data follows starting in line 3. Note that columns of information after the first line are comma-delimited, and each line has 4 columns of information.
Let's write a Python script that will import this data into a new geosoft database. Our initial expectation is to import all the data into a single line, and establish the coordinate system of the spatial channels:
import numpy as np import geosoft.gxpy as gxpy import geosoft.gxpy.gdb as gxdb #create context gxc = gxpy.gx.GXpy() # Open csv-format data file and skip the first line, which is a comment line f = open('mag_data.csv', 'r') f.readline() # the second line contains the channel/field names, from which we create a list of channel names channel_names = f.readline().strip().split(',') #the rest of the file contains data, which we load into a numpy float array data = np.loadtxt(f, delimiter=',') #create a new database from list of channels and numpy data. All data is stored in a single line. gdb = gxdb.Geosoft_gdb.new('mag_data', overwrite=True) line_name = gxdb.create_line_name(0, gxdb.LINE_TYPE_RANDOM, 0) gdb.write_line(line_name, data, channel_names) # set the coordinate system to 'NAD 83 / UTM zone 15N' gdb.coordinate_system = 'NAD83 / UTM zone 15N' # set the mag data units to 'nT' gxdb.Channel(gdb, 'mag').unit_of_measure = 'nT' exit()
line 1 | numpy is the standard library used to work with numerical multi-dimensioned data in Python, and Geosoft modules are designed to most easily work with numpy data. |
---|---|
line 3 | the geosoft.gxpy.gdb module works with Geosoft database. |
line 9 | open the csv file and skip over the first line |
line 13 | the second line contains the names of the data in each column. We will use these names as the channel names in the Geosoft database. Here we read the line, strrip the new-line character and create a list of names by splitting the line using the ',' character as the delimiter. |
line 16 | the numpy module has a versatile method named loadtxt that can be used to import simple ASCII text files. This function returns all the data stored in a 2D numpy array, which will be dimensioned (12670, 4), which means there are 1267 rows of data with 4 data items per row. |
line 19 | here we create a new empty database named 'mag_data', which will result in a file named 'mag_data.gdb'. |
line 20 | all the data will be stored in a single line, and we will accept the default line name generated by the create_line_name function, which by default will be 'L0'. |
line 21 | the write_line method will create channels from the channel_names list and store data from each column of numpy array into each channel respectively. |
line 24 | in Geosoft it is important to identify the coordinate system of all spatial data as soon as data is imported. Once identified, Geosoft applications will automatically convert spatial locations to the coordinate systems of a required context. |
line 27 | Best practice it to identify data units of measurement in the database. |
Opening this database from a Geosoft Desktop application will show the 'L0' line:
Split the data into lines
If we plot the data as a line on a map we see that the data in Line 0 (L0E) would be better organized as separate survey lines. In this case the lines are horizontal, 200 m apart with sampled every 50 m along each line:
Lets modify the script by adding some processing that will use a Geosoft method to split a line into separate lines based on a change in line direction. To split a line into sections we will use a processing function from the GXAPI, which provides the base-level access to the full Geosoft API. The GXAPI function interfaces are not "pythonic", and are more verbose and demand more care when called.
We will use the geosoft.gxapi.GXDU.split_line_xy2 function (GXDU contains many Database Utility functions):
The geosoft.gxpy.gdb module has a Line class and a Channel class that makes is easier to work with lines and channels from a database. Here is the script to import the data and spit the imported line into multiple lines:
import numpy as np import geosoft.gxpy as gxpy import geosoft.gxpy.gdb as gxdb import geosoft.gxapi as gxapi #create context gxc = gxpy.gx.GXpy() # Open csv-format data file and skip the first line, which is a comment line f = open('mag_data.csv', 'r') f.readline() # the second line contains the channel/field names, from which we create a list of channel names channel_names = f.readline().strip().split(',') #the rest of the file contains data, which we load into a numpy float array data = np.loadtxt(f, delimiter=',') #create a new database from list of channels and numpy data. All data is stored in a single line. gdb = gxdb.Geosoft_gdb.new('mag_data_split', overwrite=True) line_name = gxdb.create_line_name() gdb.write_line(line_name, data, channel_names) # set the coordinate system to 'NAD 83 / UTM zone 15N' gdb.coordinate_system = 'NAD83 / UTM zone 15N' # set the mag data units to 'nT' gxdb.Channel(gdb, 'mag').unit_of_measure = 'nT' print(list(gdb.list_lines())) # ['L0'] print(list(gdb.list_channels())) # ['mag', 'X', 'Y', 'Z'] print(gdb.xyz_channels) # ('X', 'Y', 'Z') # split the line into sections knowing lines are E-W, and separated by 200 m. # see https://geosoftinc.github.io/gxpy/9.2/python/GXDU.html?highlight=split_line_xy2#geosoft.gxapi.GXDU.split_line_xy2 split_line_number_start = gxapi.int_ref() split_line_number_start.value = 1 # create instances to the lines and channels needed by the split_line_xy2 function line = gxdb.Line(gdb, 'L0') x_channel = gxdb.Channel(gdb, 'X') y_channel = gxdb.Channel(gdb, 'Y') # lock items as required line.lock = gxdb.SYMBOL_LOCK_READ x_channel.lock = gxdb.SYMBOL_LOCK_WRITE y_channel.lock = gxdb.SYMBOL_LOCK_WRITE # split the original line into segments, based on a lateral distance tolerance of 100 m. gxapi.GXDU.split_line_xy2( gdb.gxdb, line.symbol, gxdb.Channel(gdb, 'X').symbol, gxdb.Channel(gdb, 'Y').symbol, 1, 100.0, gxapi.rDUMMY, gxapi.DU_SPLITLINE_SEQUENTIAL, split_line_number_start, 1, 1) #delete the original line as it is no longer needed gdb.delete_line('L0') # print a list of the new lines print(list(gdb.list_lines())) # ['L1', 'L2', 'L3', 'L4', 'L5', 'L6', ... exit()
line 4 | Includethe geosoft.gxapi module, which will be used at line 49 |
---|---|
line 37 | argument 9 of the split_line_xy2 function (see line 58) requires a geosoft.gxapi.int_ref instance. This is because this argument is passed by reference such that the returned value will be the next line number in the sequence of split lines. Passing by reference is not part of Python so we use an instance of the int_ref class, which has a property value that will hold the value to pass, and will be changed by called function. The gxapi module has predefined classes int_ref (64-bit integers), float_ref (64-bit floating point) and str_ref (unicode strings) specifically for this purpose. |
lines 41-48 | Here we use the Line and Channel classes to create instances to the required line and channels, and we use the instances to set the required locks. |
line 51 | Call the GXDU function to split the line. |
line 65 | Delete the original line as it is no longer required. |
line 70 | When a script terminates any locks on the database are cleared. If you were to continue to use the symbols in the same script it is good practice to call gdb.unlock_all() to unlock all symbols and then re-lock symbols as required. |
After running this script, plotting the database as a line maps produces the following map:
Read, modify and save data
Using numpy
Geosoft databases and data models are designed around the concept of data vector arrays, which support very high performance data processing. Each channel in line holds a single 1-dimensional array of data. Python's standard library for working with data arrays is numpy, and this is a standard part of any scientifically-oriented Python environment. The Geosoft_gdb class provides for reading and writing data directly to/from numpy arrays as well as Geosoft VV vector arrays. This example shows a simple use of numpy and the next section shows the same exercise using a Geosoft VV.
If we open the mag_data_split.gdb database we see that the mag data has values around 5000 nT.
The following script will subtract 5000 from the data in the 'mag' channel and save it to a new channel named 'mag_base':
import geosoft.gxpy as gxpy import geosoft.gxpy.gdb as gxdb gxc = gxpy.gx.GXpy() # open the database, best practice is to use a 'with ...' construct with gxdb.Geosoft_gdb.open('mag_data_split') as gdb: # make a new channel for the output, duplicate properties of 'mag' channel new_mag_channel = gxdb.Channel.new(gdb, 'mag_base', dup='mag', replace=True) # work through each line for line in gdb.list_lines(): print ('processing line {}'.format(line)) # read data from the line. # The read_channel method returns the data as a numpy array, and the fiducial mag_data, fid = gdb.read_channel(line, 'mag') # use simple numpy math to subtract 5000, then save to the new_mag_channel mag_data = mag_data - 5000 gdb.write_channel(line, new_mag_channel, mag_data, fid) exit()
line 10 | In this example we will write the processed data to a new channel named 'mag_base', which will duplicate the properties of the 'mag' channel, such as inheriting the type, units and display properties. |
---|---|
line 13 | The standard pattern for processing data in a database is to work process each line in a loop. Individual lines can hold huge amounts of data, and an entire database can be unreasonably large. By processing data in line segments we are able to handle extremely large data sets. |
line 17 | Here we read a single channel of data from a line and return it in a numpy array. All read functions that return numpy arrays also return the fiducial of the data. The fiducial references the start point and separation between data rows relative to the reference sequence of the data, and reading to numpy arrays will always ensure the data is sampled to the full range of the data at the smallest data increment based on the data requested. Always pass this fiducial back when writing data, in this case at line 23 |
line 22 | numpy is the standard array processing library for scientific Python applications, and as such it provides for very high performance operations on numpy data. You should always use optimal numpy operations to process data, in this case simply subtracting 5000 from the data. |
Open the modified database and display the 'mag_base' channel:
Using Geosoft VV
This example does the same thing as the previous example, but uses the Geosoft vector processing functions from the Geosoft VV library. Geosoft VV functions will deliver even higher performance than numpy, and while you may be limited by the capabilities of the Geosoft API, there are many functions that perform geoscience-specific operations on data (see gxapi.GXVVU). Note that data in a Geosoft VV can also be exposed as a numpy array to provide numpy access when needed.
import geosoft.gxpy as gxpy import geosoft.gxpy.gdb as gxdb import geosoft.gxapi as gxapi gxc = gxpy.gx.GXpy() # open the database, best practice is to use a 'with ...' construct with gxdb.Geosoft_gdb.open('mag_data_split') as gdb: # make a new channel for the output, duplicate properties of 'mag' channel new_mag_channel = gxdb.Channel.new(gdb, 'mag_base', dup='mag', replace=True) # work through each line for line in gdb.list_lines(): print ('processing line {}'.format(line)) # read data from the line. # The read_channel method returns the data in a geosoft VV mag_data = gdb.read_channel_vv(line, 'mag') # use Geosoft GXVVU.translate function to subtract 5000. gxapi.GXVVU.translate(mag_data.gxvv, -5000, 1) gdb.write_channel_vv(line, new_mag_channel, mag_data) exit()
line 11 | returns the data in a geosoft.gxpy.VV instance, in which the fiducial is a property. |
---|---|
line 23 | calling into the gxapi to translate the data. |
Apply an expression to a database
In this exercise we calculate the distance along each line from the starting point of each line. The Pythagorean expression is simple and can be implemented in a single line of numpy code. This example reads multiple channels from the database and uses numpy slicing to work with individual channel columns from a 2D numpy array:
import numpy as np import geosoft.gxpy as gxpy import geosoft.gxpy.gdb as gxdb gxc = gxpy.gx.GXpy() # open the database, best practice is to use a 'with ...' construct with gxdb.Geosoft_gdb.open('mag_data_split') as gdb: # make a distance channel dist_channel = gxdb.Channel.new(gdb, 'distance', dup='x', replace=True) # work through each line for line in gdb.list_lines(): print ('processing line {}'.format(line)) # read data from the line, returns in a 2D numpy array. xy_data, channels_read, fid = gdb.read_line(line, ('x', 'y')) # get the first point (x0, y0) x0 = xy_data[0, 0] y0 = xy_data[0, 1] # use numpy array math to calculate distance in a 1D array dist_array dist_array = np.sqrt((xy_data[:, 0] - x0)**2 + (xy_data[:, 1] - y0)**2) # save the distance gdb.write_channel(line, dist_channel, dist_array, fid) exit()
line 19 | Here we are reading multiple channels from a single line. If we were to omit the list of channels, here expressed as tuple ('x', 'y'), all channels in the line would be returned in a 2D numpy array. But in this case we will get a 2D numpy array that contains (x, y) on each row. The returned channels_read is a list of the channels read. |
---|---|
line 26 | This expression uses numpy vector operations and is therefore very fast. You could also iterate through the array, but this is much slower: calculate distance element-by-element dist_array = np.zeros(len(xy_data)) for i in range(len(dist_array)): dx = xy_data[i, 0] - x0 dy = xy_data[i, 1] - y0 dist_array[i] = math.sqrt(dx **2 + dy **2) |