Published on

IBDP Chem: Data Analysis Skills

Authors
  • avatar
    Name
    Jon Chui
    Twitter

Roadmap

The following flowchart shows the path to learning data skills.

The core skills are necessary for every kind of IA. For database IAs, I recommend you to be confident in at least some optional skills. These will open up interesting, less cookie-cutter research questions.

Finally, advanced skill are not taught in class and you would need to follow separate tutorials (online or in small groups) to learn them. These skills not only support unique IAs and Extended Essays, but also will help your personal statements / teacher references glow.

Roadmap

In the next parts you can find the resources we use in our classroom.


Table of Contents

Spreadsheet

This series describes and teaches you all the things you can do with spreadsheets. There is a standard section with skills that all of you should acquire, and a speciality section with skills you can learn to improve your capacity or help tackle an IA.

General skills

Spreadsheet 101

Basic formatting, organization best practices, simple formula, and graphing. You should start with a blank spreadsheet for this.

Excel version of Spreadsheet 101

An introduction to spreadsheet data analysis in Excel, roughly equivalent to 101 above.

Spreadsheet 102

Conditional formatting, complex formulas, data validation, INDEX() and MATCH() to build a query engine.

You can create a copy of this spreadsheet to follow along.

Solver

In the sciences, we often acquire tables of data like the following:

Table 1

x y
1 3
2 5
3 7
4 9

Table 2

p q r
1 1 1
2 3 6
3 5 15
4 7 28

Plotting Table 1 as an x-y graph is straightforward and we can easily obtain y = 2x + 1 as the linear regression. In general, spreadsheets Excel in fitting functions of the kind y = f(x).

On the other hand, the data in Table 2 is simply r = p x q, but Sheets/Excel wouldn't know what to do with it. The reason is that --- despite the simplicity --- this is now of the form r = f(p, q) where the output depends on two inputs.

In the context of chemistry, boiling point bp of a molecule is a function not only of molar mass (MW), but also of its dipole μ and hydrogen bond capacity (H). In other words, bp = f(MW, μ, H) and we may want to find out their relative contributions. Non-linear regression is a general technique we can use in these cases, and the following tutorial shows how to do this using Excel Solver.

A starting template and the step-by-step solution are available.


Data Visualization

Tableau 101: Story-telling with Data

To illustrate what a data exploration platform like Tableau can do, we will use air quality data collected by the Environmental Protection Department of Hong Kong. Start by downloading the following two data files:

  1. Shatin 2021-05 Daily

  2. Central & Tai Po 2020 Hourly

In these files, each row corresponds to data at a particular time (day/hour), and each column a pollutant:

DATE STATION CO FSP NO2 NOX O3 RSP SO2
01-05-2021 SHATIN N.A. 15 47 52 85 21 12
02-05-2021 SHATIN N.A. 19 45 50 69 27 11

All pollutant unit were reported in μg/m3 except CO which is in 10μg/m3.

The pollutants monitored were:

  • CO = Carbon Monoxide
  • FSP = Fine Suspended Particulates
  • NO2 = Nitrogen Dioxide
  • NOX = Nitrogen Oxides
  • O3 = Ozone
  • RSP = Respirable Suspended Particulates
  • SO2 = Sulphur Dioxide

We will use the data to answer questions such as:

  • When do people go to work?
  • How much more traffic is in Central than Tai Po?
  • Which pollutants are related to one another?
  • What pollutants vary with season / time of day?
  • How much are people willing to pay to shorten their commute?

Getting Tableau

Please download, install, and request an education license for the Tableau application: https://www.tableau.com/academic/students. Do this now; it takes some days for the company to process your request. When you register, you'll need to enter which school you go to. I don't know why, but the school search does not work in English for LPCUWC; you will need to copy-paste this as the school name: 李寶椿聯合世界書院.

Examining the data files

The data files are in a .csv format. csv stands for Comma Separated Values and are plain text files that spreadsheet programs can import.

Open the Shatin file in Excel or Google Sheets and examine the content.

Drag-n-Drop graphing

Filtering and Coloring

Calculated Fields

DATETIME(STR([DATE]) + " " + STR([HOUR])+":00:00")

Tableau 102: Transforming Data

Tools more sophisticated than spreadsheets not only demand data in rigorous uniform columns, but also in a structure where each row strictly represents one piece of data. This may be different from how you have been structuring your data, or how your data acquisition software (e.g., Logger Pro) gives you data series. In this section we will look at how you can normalize your data for Tableau to consume. We will transform a set of UV-vis spectra for easy exploration. A more in-depth explanation can be found in the Tableau tutorials for Structuring Data.

First, download the UV-vis spectra series csv, and open this in Sheets/Excel.

Tableau showcase: Database IA walkthrough

Manually accessing public chemistry databases

PubChem / ChemSpider

Protein Database

SDBS, a spectral database

Jon's data locator

Database directories

Programming for Chemists

Python 101

Python 102: thermo module

Python 103: pandas for data handling

Python 104: Graphing with plotly

Python 201: Molecular modeling with psi4