Recent Articles

Featured

New Brunswick’s Pride of Place and Labour Gaps: A Research Summary

VitalSignsNB & P2P+: Using Data to Investigate Health and Healthcare in New Brunswick

Rethinking Inequality and Young Adult Care In New Brunswick

The People Behind the Numbers: A Look at New Brunswick’s Economic Status and New Brunswickers’ Quality of Life

There are 3 main benefits to keeping your data within a database

It is a point source of truth
You input your data, then never touch it again. All of your work is done by querying the database to pull out the information that you need for a particular task in any order you want, with any filtering you want, and in any format you want, all without changing the underlying database.
It is relational
This means that all of the tables have a column with unique numbers for each row which can link to all of the other tables. A relational database allows for something like fish movements to link to other important variables such as when the sun was setting that day, what the water temperature was, or how much water was being moved through a nearby hydroelectric dam.
It is spatial
Any time there is a latitude and longitude coordinate, the database converts that coordinate into a spatial reference which can be plotted by any Geographic Information System (GIS) such as ArcGIS or QGIS. There is one more benefit – you can do all of this for free.

Developing your Database

Here are the steps and links for the software to download:

This software produces an underlying database called PostgreSQL (SQL=Structured Query Language). Go through the installation prompts, and write down important information, including:

Host: localhost
Post: 5432
Database: choose_a_name (keep everything lowercase and without spaces)
Password: something easily remembered, and appropriate for others to see
Schema: choose_a_name

This software is a manager called PgAdmin for the database you just created. There are multiple managers, but this one works well for creating tables and importing data. Go through the prompts and input the important information from when you created the database.

Structure your data

There are a couple rules to follow when organizing data into tables and columns in a csv (comma separated value) file. The irony is that this will look like Excel at first. Open an Excel file, then save it as a .csv file. CSV files do not allow multiple tabs or hidden formatting which is often an issue in Excel. Some conventions to follow are naming the csv file the same as the table you want to create, and to keep blank csv files with just the column headers which match the database table so that you can easily fill it in as more data comes along to be appended to the existing dataset. Also, make sure all of your table and column names are lowercase and with underscores instead of spaces, and include units when appropriate (ex. detection_datetime_utc, water_temperature_c, weight_g, etc.).

The first column is your primary key which has unique row IDs that allow your database to be relational. Name this column tablename_id. If you want to relate two tables, then include the primary key of table 1 in table 2. The rest of the column header names are up to you and your data – just make sure that any data you put under those headers follows the same formatting. It’s a good idea to end every table with a ‘notes’ column so that you have a place to write in whatever information you need without strict formatting.

Any time you feel like you want to copy and paste the same information from one row to the next with just one or two columns being different – don’t! This is an indicator that you need another table. For example, I have a table with all of the biological information for each fish, but sometimes fish were tagged with two different kinds of tags. Instead of duplicating all of the information about the same fish, I simple created another table for all of my tag information which relates back to the fish table through the primary key. Here is an example of these tables:

You do not need to include the primary keys from all tables into all of your other tables. If I wanted to relate the fish table to another table called receiver which had all of the location information each time a tag was detected. The receiver table would have the tag_id primary key, and then it would also be able to relate to the fish table because the tag table has the fish_id primary key.

Next, create tables in the database with the same structure as the csv files. Do this by right-clicking on the schema and creating a new table and columns. When creating the columns, specify what the datatype will be for each. Here are some common data types:

Once a table has been created, right click on it to import the csv file. Specify that there are headers, and set the delimiter to a comma (because it’s a Comma Separated Value file). Now that the tables have been created and the data imported, I recommend using another database manager called DBeaver to actually work with your data through queries. DBeaver is better than PgAdmin for querying because it reminds you of what columns are in the tables and is overall easier to use. Again, go through the prompts and input the information about the database. Everything that was done in PgAdmin will automatically be in DBeaver because they’re managing the same underlying database.

Querying a Database

Now the fun begins! Right-click on one of your tables and click on ‘Read data in SQL console’ to open a window where you can query your data. Here is the basic structure of a query:

select table.column

from schema.table1

inner join schema.table2 on table1.table2_id=table2.table2_id

where table.column= and/or table.column in (1,2,3) and/or like (‘%string%’), etc.

order by table.column;

Click CTRL+Enter to run query

To further explain, the first line uses ‘select’ to output any column that you would like from any table. This could be things like tag numbers, datetimes, water temperature, etc. Simple math can also be done here, such as count(tag_number) or avg(water_temperature). The second line uses ‘from’ to determine which is the parent table that you’re starting with, but of course you can select columns from other tables as long as they join to the parent table. This is what happens in the third line using ‘inner join’, where you specify the primary key column in both tables which match to create a join. You can join multiple tables by adding more ‘inner join’ lines, which can go between the parent table and table3, or between table2 and table3. The fourth line uses ‘where’ to specific selection criteria. This can be a direct equals sign ‘=’ if the criteria is a single number, or ‘in’ with a list of values in parentheses for a list of numbers, or if the criteria is a string then use ‘like’ with a wild card ‘%’. Many criteria can be applied to the selection by using ‘and’ between each criterion. Finally, always be sure to order your data using ‘order by’, and end a query with a semi-colon ‘;’.

Here is an example:

select fish.tag_datetime_utc, fish.weight_g, fish.notes

from doctorate.fish

inner join doctorate.tag on fish.fish_id=tag.fishid

where fish.lifestage=’adult’ and fish.weight_g between 20 and 30

order by fish.fish_id

You can also connect the database to statistics software such as R.

To connect to your database, type this in R

install.packages(“RPostgreSQL”)

library(RPostgreSQL)

pw = 'your password'

drv <- dbDriver('PostgreSQL')

con <- dbConnect(drv,dbname='your database name', host='localhost’, port=5432, user='postgres', password=pw)

###write whatever script you want###

on.exit(dbUnloadDriver(drv),add=TRUE)

dbDisconnect(con)

Utilizing Spatial Data with GIS Software

Lastly, take advantage of the spatial aspect of the database by connecting to ArcGIS, or the free software QGIS. For these programs to plot latitude and longitude, first convert the coordinates into a new column with the datatype geometry by running the following script/query for each table with latitude and longitude columns:

select AddGeometryColumn('schema’,'table','geom',4326,'POINT',2);

update schema.table set geom=st_setsrid(st_makepoint(longitude,latitude),4326);

To connect to ArcGIS, download this extension then click on New Connection and fill in the important information from when you created your database.

To connect to QGIS, find PostGIS in the menu on the left, right click on New Connection and fill in the important information from when you created your database.

Once you connect, any tables with the geom column will be available for you to add as a GIS layer.

I hope you found this to be a helpful introduction to creating and working with your own database. Remember, you have a friend in me (amanda@babin-labs.com) and also in Google – just copy and paste any errors you come across in quotes with the word “postgresql” and someone on the internet has probably already found a solution.

Good luck and happy querying!

Guest Author

Amanda Babin is a PhD candidate in Biology studying at the University of New Brunswick, working on the Mactaquac Aquatic Ecosystem Study (MAES) with the Canadian Rivers Institute. Amanda grew up in Nova Scotia on the Shubenacadie River. The highest tides in the world – the Bay of Fundy and it’s rivers easily led to a dedication toward aquatic biology. Her exploration started in the intertidal zone, being a boomerang student of St. Andrews, NB. Amanda completed a masters in marine mammology and ocean noise, and is currently working toward a PhD on hydropower reservoir impacts on salmon migration. Supporting water management requires interdisciplinary efforts, integrating scientific results into management policy.

If this guide interests you or if you have further question, please visit Amanda’s Website or send her an Email.

Featured Researcher

Featured

Emma Giberson

Hi! My name is Emma Giberson, and I am entering my Masters of Applied Psychology, Research, and Evaluation (MAPRE) at UNB – Saint John. My Masters is the first step toward getting my Ph.D. in Clinical Psychology.

Nominate a Researcher