top of page

1. Quick statistics re-hash before going into R-specific stuff


Haast tokoeka

Hi! Here are some instructions for running some basic parametric statistical analyses in R. Make sure you use an independent information source to familiarise yourself with the limitations and requirements of each test. An important one is “independence of X”. I suspect you understand this already and/or intuitively, but get on Wikipedia and have a happy read.

Essential statistics information

Variable type

The variables are the things in an analysis you want to test for relationships between. Let’s say you are wondering, “do Ohope chicks have a lower hatch weight than those from MEIT”, then kiwi project (Ohope vs MEIT) is a variable and hatch weight is a variable.

Before conducting your test, you need to know whether your data are categorical (discrete), continuous (scale) or count (the number of something). An example of a categorical variable you might use is kiwi sex or kiwi project. A categorical variable is a type of thing (R calls them a “factor”). A continuous variable might be time or bill length or the number of something. A continuous variable is one that can be measured on a scale (R calls them a “vector”). I always visualise a tape-measure or ruler as my scale.

The type of variable you have will dictate which tests you can/cannot use.

In the example above, Kiwi Project is a categorical variable (it can’t be measured on a scale with numbers) while hatch weight is a continuous variable (it can be measured on a scale).

NB: Sometimes a variable that might seem like a categorical or continuous variable can actually be a count variable – it depends on the study design and question you are asking. For example, if you want to compare the number of males that you hatched between the Project Kiwi and Taranaki kiwi projects in 2012, then “number of males” is a count variable even though males sort of seems like it could be a category (males are a type of “thing”) and it also seems like it could be a continuous variable (you could measure the number of males using a scale from 0 males to infinity males). However, as you only have the number of males from one year/season (2012), you cannot generate an “average” value for the number of males, it’s just one number. Therefore you have count data and it needs to be analysed accordingly. If you were interested in the average number of males hatched from Project Kiwi vs Taranaki over several years, then in this case “number of males” should be treated as a continuous variable.

Response variable or Dependent variable

This can take the form of a continuous or categorical variable. Let’s say you want to see if one variable influences another, for example “does female size influence egg size”? Here egg size is potentially being influenced by female size. Therefore you will test whether egg size is expressed “in response” to female size. Or worded in a different way, you are testing whether egg size is “dependent on” female size. The dependent variable is typically put on the y axis in figures/graphs.

Predictor or Predicting variable or Independent variable

In the above example female size is the predictor or independent variable. You whether egg size was dependent on female size. The independent variable is typically put on the x axis in figures/graphs.

Sample

The sample is the collection of values you have of a variable. If you wanted to compare bill length of male and female hatchlings then the data points you gather from females would be your sample of females and the data points you gather from males would be your male sample. It is called a “sample” because it is difficult to measure the entire population of something (in this case all female and male chicks), usually you can only ever “sample” a population. Some people might use “sample” to refer to both the male and female values simultaneously or they might refer to the male “sample” and the female “sample”, so when you are communicating, be clear on precisely what you mean (and find out what others mean) when using this word.

Group

Group is used when you want to be a little more specific than “sample”. For example let’s say you want to see whether eggs size significantly differs across different kiwi projects. You gather the egg size data from MEIT, PK and Taranaki. Here you have three “groups” of the categorical variable “kiwi project”. In this example “group” is synonymous with “level”.

Replicate

A replicate is an individual value (datum) within a sample.

Dataset

This can mean a few different things, but usually it refers to all the data from all samples and all variable types. You will analyse your “dataset” or “data” (plural form of datum).

Statistical significance

Each test calculates significance differently, but the premise is usually similar. If two things statistically differ from one another “significantly” it means the probability that they are the same is highly unlikely. The standard probability level that biologists use is an “alpha” of 0.05. This is the same as 1/20.

So if someone tells you their “P value” (probability value) is less than 0.05, then their test is “significant” at alpha 0.05. This means they are 95% certain there test reflects “reality” and that there is only a 1/20 chance their test is “wrong” and doesn’t reflect reality (I am not a statistician, so sound off if I am not quite right folks). There is a 19/20 chance the test does reflect reality. 19/20 is very close to 1. In probability theory a probability of 1 is certainty. If something has a probability of 1 of happening, it is certainly going to happen! Using an alpha of 0.05 (1/20 chance of being wrong, 95% certainty of being right) is a convention. If someone has a test come back with a P value more than 0.05 (and their alpha was set at 0.05), then their test is not significant according to their alpha, and they have to accept that there is no significant difference/relationship between the variables analysed.

bottom of page