A Guide to Database Management & Big Data
Recent Articles
Categories
These are the days of ‘Big Data’, but even relatively small datasets produced by biological research can be cumbersome to work with. Biologists typically work in Microsoft Excel which is useful for basic organization and graphing, but has its limitations. Excel has a maximum row count of 1,048,576, which sounds sufficient until you consider how easily you can obtain data, such as hourly water temperature recordings, with today’s technology. My research involves using telemetry technologies to locate individual animals through time and space. I’ve tagged 475 fish during my doctorate, each being detected by a network of over 100 receivers throughout the Saint John River at least once a minute! This has produced over 4 million rows of detection data, not to mention all of the environmental variables which might help explain the fish movements.
If a dataset is less than a million rows, why consider using a software other than Excel? Imagine working in one file, then realizing that another version of this file exists with slightly different formatting which now too needs to be updated in the same way..Herein lies the problem, it’s too easy to make mistakes. In research, managing multiple files with multiple tabs, all colour-coded and formatted for different purposes, inevitably means that something will get left behind, confused, or forgotten. We’ve all been there – I completed my masters using Excel. By the end of it, I had so many repetitive tasks like the situation above that I’m now forced to work every day with wrist braces to prevent carpal tunnel.
At the beginning of my doctorate, I knew that using Excel was not going to be an option for working with telemetry data. So with the help of some close friends and Google, I taught myself how to build and work with a database, and this is an introduction for you to do the same.
There are 3 main benefits to keeping your data within a database
It is a point source of truth
You input your data, then never touch it again. All of your work is done by querying the database to pull out the information that you need for a particular task in any order you want, with any filtering you want, and in any format you want, all without changing the underlying database.
It is relational
This means that all of the tables have a column with unique numbers for each row which can link to all of the other tables. A relational database allows for something like fish movements to link to other important variables such as when the sun was setting that day, what the water temperature was, or how much water was being moved through a nearby hydroelectric dam.
It is spatial
Any time there is a latitude and longitude coordinate, the database converts that coordinate into a spatial reference which can be plotted by any Geographic Information System (GIS) such as ArcGIS or QGIS. There is one more benefit – you can do all of this for free.
Developing your Database
Here are the steps and links for the software to download:
This software produces an underlying database called PostgreSQL (SQL=Structured Query Language). Go through the installation prompts, and write down important information, including:
Host: localhost
Post: 5432
Database: choose_a_name (keep everything lowercase and without spaces)
Password: something easily remembered, and appropriate for others to see
Schema: choose_a_name
This software is a manager called PgAdmin for the database you just created. There are multiple managers, but this one works well for creating tables and importing data. Go through the prompts and input the important information from when you created the database.
Structure your data
There are a couple rules to follow when organizing data into tables and columns in a csv (comma separated value) file. The irony is that this will look like Excel at first. Open an Excel file, then save it as a .csv file. CSV files do not allow multiple tabs or hidden formatting which is often an issue in Excel. Some conventions to follow are naming the csv file the same as the table you want to create, and to keep blank csv files with just the column headers which match the database table so that you can easily fill it in as more data comes along to be appended to the existing dataset. Also, make sure all of your table and column names are lowercase and with underscores instead of spaces, and include units when appropriate (ex. detection_datetime_utc, water_temperature_c, weight_g, etc.).
The first column is your primary key which has unique row IDs that allow your database to be relational. Name this column tablename_id. If you want to relate two tables, then include the primary key of table 1 in table 2. The rest of the column header names are up to you and your data – just make sure that any data you put under those headers follows the same formatting. It’s a good idea to end every table with a ‘notes’ column so that you have a place to write in whatever information you need without strict formatting.
Any time you feel like you want to copy and paste the same information from one row to the next with just one or two columns being different – don’t! This is an indicator that you need another table. For example, I have a table with all of the biological information for each fish, but sometimes fish were tagged with two different kinds of tags. Instead of duplicating all of the information about the same fish, I simple created another table for all of my tag information which relates back to the fish table through the primary key. Here is an example of these tables:
You do not need to include the primary keys from all tables into all of your other tables. If I wanted to relate the fish table to another table called receiver which had all of the location information each time a tag was detected. The receiver table would have the tag_id primary key, and then it would also be able to relate to the fish table because the tag table has the fish_id primary key.
Next, create tables in the database with the same structure as the csv files. Do this by right-clicking on the schema and creating a new table and columns. When creating the columns, specify what the datatype will be for each. Here are some common data types:
Once a table has been created, right click on it to import the csv file. Specify that there are headers, and set the delimiter to a comma (because it’s a Comma Separated Value file). Now that the tables have been created and the data imported, I recommend using another database manager called DBeaver to actually work with your data through queries. DBeaver is better than PgAdmin for querying because it reminds you of what columns are in the tables and is overall easier to use. Again, go through the prompts and input the information about the database. Everything that was done in PgAdmin will automatically be in DBeaver because they’re managing the same underlying database.
Querying a Database
Now the fun begins! Right-click on one of your tables and click on ‘Read data in SQL console’ to open a window where you can query your data. Here is the basic structure of a query:
select table.column
from schema.table1
inner join schema.table2 on table1.table2_id=table2.table2_id
where table.column= and/or table.column in (1,2,3) and/or like (‘%string%’), etc.
order by table.column;
Click CTRL+Enter to run query
To further explain, the first line uses ‘select’ to output any column that you would like from any table. This could be things like tag numbers, datetimes, water temperature, etc. Simple math can also be done here, such as count(tag_number) or avg(water_temperature). The second line uses ‘from’ to determine which is the parent table that you’re starting with, but of course you can select columns from other tables as long as they join to the parent table. This is what happens in the third line using ‘inner join’, where you specify the primary key column in both tables which match to create a join. You can join multiple tables by adding more ‘inner join’ lines, which can go between the parent table and table3, or between table2 and table3. The fourth line uses ‘where’ to specific selection criteria. This can be a direct equals sign ‘=’ if the criteria is a single number, or ‘in’ with a list of values in parentheses for a list of numbers, or if the criteria is a string then use ‘like’ with a wild card ‘%’. Many criteria can be applied to the selection by using ‘and’ between each criterion. Finally, always be sure to order your data using ‘order by’, and end a query with a semi-colon ‘;’.
Here is an example:
select fish.tag_datetime_utc, fish.weight_g, fish.notes
from doctorate.fish
inner join doctorate.tag on fish.fish_id=tag.fishid
where fish.lifestage=’adult’ and fish.weight_g between 20 and 30
order by fish.fish_id
You can also connect the database to statistics software such as R.
To connect to your database, type this in R
install.packages(“RPostgreSQL”)
library(RPostgreSQL)
pw = 'your password'
drv <- dbDriver('PostgreSQL')
con <- dbConnect(drv,dbname='your database name', host='localhost’, port=5432, user='postgres', password=pw)
###write whatever script you want###
on.exit(dbUnloadDriver(drv),add=TRUE)
dbDisconnect(con)
Utilizing Spatial Data with GIS Software
Lastly, take advantage of the spatial aspect of the database by connecting to ArcGIS, or the free software QGIS. For these programs to plot latitude and longitude, first convert the coordinates into a new column with the datatype geometry by running the following script/query for each table with latitude and longitude columns:
select AddGeometryColumn('schema’,'table','geom',4326,'POINT',2);
update schema.table set geom=st_setsrid(st_makepoint(longitude,latitude),4326);
To connect to ArcGIS, download this extension then click on New Connection and fill in the important information from when you created your database.
To connect to QGIS, find PostGIS in the menu on the left, right click on New Connection and fill in the important information from when you created your database.
Once you connect, any tables with the geom column will be available for you to add as a GIS layer.
I hope you found this to be a helpful introduction to creating and working with your own database. Remember, you have a friend in me (amanda@babin-labs.com) and also in Google – just copy and paste any errors you come across in quotes with the word “postgresql” and someone on the internet has probably already found a solution.
Good luck and happy querying!
Guest Author
Amanda Babin is a PhD candidate in Biology studying at the University of New Brunswick, working on the Mactaquac Aquatic Ecosystem Study (MAES) with the Canadian Rivers Institute. Amanda grew up in Nova Scotia on the Shubenacadie River. The highest tides in the world – the Bay of Fundy and it’s rivers easily led to a dedication toward aquatic biology. Her exploration started in the intertidal zone, being a boomerang student of St. Andrews, NB. Amanda completed a masters in marine mammology and ocean noise, and is currently working toward a PhD on hydropower reservoir impacts on salmon migration. Supporting water management requires interdisciplinary efforts, integrating scientific results into management policy.
If this guide interests you or if you have further question, please visit Amanda’s Website or send her an Email.