- Published on
IBDP Chem: Data Analysis Skills
- Authors
- Name
- Jon Chui
Roadmap
The following flowchart shows the path to learning data skills.
The core skills are necessary for every kind of IA. For database IAs, I recommend you to be confident in at least some optional skills. These will open up interesting, less cookie-cutter research questions.
Finally, advanced skill are not taught in class and you would need to follow separate tutorials (online or in small groups) to learn them. These skills not only support unique IAs and Extended Essays, but also will help your personal statements / teacher references glow.
In the next parts you can find the resources we use in our classroom.
Table of Contents
Spreadsheet
This series describes and teaches you all the things you can do with spreadsheets. There is a standard section with skills that all of you should acquire, and a speciality section with skills you can learn to improve your capacity or help tackle an IA.
General skills
Spreadsheet 101
Basic formatting, organization best practices, simple formula, and graphing. You should start with a blank spreadsheet for this.
Excel version of Spreadsheet 101
An introduction to spreadsheet data analysis in Excel, roughly equivalent to 101 above.
Spreadsheet 102
Conditional formatting, complex formulas, data validation, INDEX()
and MATCH()
to build a query engine.
You can create a copy of this spreadsheet to follow along.
Solver
In the sciences, we often acquire tables of data like the following:
Table 1
x | y |
---|---|
1 | 3 |
2 | 5 |
3 | 7 |
4 | 9 |
Table 2
p | q | r |
---|---|---|
1 | 1 | 1 |
2 | 3 | 6 |
3 | 5 | 15 |
4 | 7 | 28 |
Plotting Table 1
as an x-y graph is straightforward and we can easily obtain y = 2x + 1 as the linear regression. In general, spreadsheets Excel in fitting functions of the kind y = f(x)
.
On the other hand, the data in Table 2
is simply r = p x q, but Sheets/Excel wouldn't know what to do with it. The reason is that --- despite the simplicity --- this is now of the form r = f(p, q)
where the output depends on two inputs.
In the context of chemistry, boiling point bp
of a molecule is a function not only of molar mass (MW
), but also of its dipole μ
and hydrogen bond capacity (H
). In other words, bp = f(MW, μ, H)
and we may want to find out their relative contributions. Non-linear regression is a general technique we can use in these cases, and the following tutorial shows how to do this using Excel Solver.
A starting template and the step-by-step solution are available.
Data Visualization
Tableau 101: Story-telling with Data
To illustrate what a data exploration platform like Tableau can do, we will use air quality data collected by the Environmental Protection Department of Hong Kong. Start by downloading the following two data files:
In these files, each row corresponds to data at a particular time (day/hour), and each column a pollutant:
DATE | STATION | CO | FSP | NO2 | NOX | O3 | RSP | SO2 |
---|---|---|---|---|---|---|---|---|
01-05-2021 | SHATIN | N.A. | 15 | 47 | 52 | 85 | 21 | 12 |
02-05-2021 | SHATIN | N.A. | 19 | 45 | 50 | 69 | 27 | 11 |
All pollutant unit were reported in μg/m3 except CO which is in 10μg/m3.
The pollutants monitored were:
CO
= Carbon MonoxideFSP
= Fine Suspended ParticulatesNO2
= Nitrogen DioxideNOX
= Nitrogen OxidesO3
= OzoneRSP
= Respirable Suspended ParticulatesSO2
= Sulphur Dioxide
We will use the data to answer questions such as:
- When do people go to work?
- How much more traffic is in Central than Tai Po?
- Which pollutants are related to one another?
- What pollutants vary with season / time of day?
- How much are people willing to pay to shorten their commute?
Getting Tableau
Please download, install, and request an education license for the Tableau application: https://www.tableau.com/academic/students. Do this now; it takes some days for the company to process your request. When you register, you'll need to enter which school you go to. I don't know why, but the school search does not work in English for LPCUWC; you will need to copy-paste this as the school name: 李寶椿聯合世界書院
.
Examining the data files
The data files are in a .csv
format. csv
stands for Comma Separated Values and are plain text files that spreadsheet programs can import.
Open the Shatin file in Excel or Google Sheets and examine the content.
Drag-n-Drop graphing
Filtering and Coloring
Calculated Fields
DATETIME(STR([DATE]) + " " + STR([HOUR])+":00:00")
Tableau 102: Transforming Data
Tools more sophisticated than spreadsheets not only demand data in rigorous uniform columns, but also in a structure where each row strictly represents one piece of data. This may be different from how you have been structuring your data, or how your data acquisition software (e.g., Logger Pro) gives you data series. In this section we will look at how you can normalize your data for Tableau to consume. We will transform a set of UV-vis spectra for easy exploration. A more in-depth explanation can be found in the Tableau tutorials for Structuring Data.
First, download the UV-vis spectra series csv
, and open this in Sheets/Excel.