4. Preparing data and getting data into R using the Bill length and Egg width practice datasets

Lindsey Gray
Aug 27, 2018
4 min read

Preparing and Getting data into R

You need to have your data saved in a text file that has no errors in it. If there are even little errors in the text file, e.g. typo’s, unintentional blank spaces or carriage returns, R will not be able to work with your data. I don’t think R is capable of reading data directly from Excel files as yet (unlike SPSS).

Bill length data set – suitable for a two sample t-test

Let’s say you have some data in Excel and want to analyse it in R. For example you have a dataset comprised of 24 data-points, 12 female 5 day old chick bill lengths and 12 male values. These are arranged in Excel in two columns sitting directly side-by-side and there are no blank spaces in between the values as you move down the column (all values are stacked directly on top of one another). The left-hand column has the word “sex” sitting in the cell directly above the first datum and the right-hand column “bill” (see Excel file Billsizediff). To make your text file you need to select and copy these two columns of data and their headings. Do not copy any surrounding cells. Each line or row of data represents the values from one individual bird. In the “sex” column you will have the word “female” written out 12 times (one for each female measured) and directly underneath 12 copies of the word male. Next to each sex entry will be a given female’s/male’s bill length value. Individual identity does not matter in this case, so it is not recorded.

(NB: This dataset comprises one categorical/discrete variable, “sex” and one continuous/scale variable “bill length”.)

You then open an application that writes text documents with file extension, “.txt”. On Mac this is “TextEdit”. I can’t remember what it is on Windows. Maybe “Note Book”? You need to make sure the file will be written as a “plain text” document and not a “rich text” document. Sometimes the default document type will be “rich text” when the application launches – so make sure you change the format.

Once you have the plain-text blank document open, you can paste your dataset from Excel in. You should then save the text file with a meaningful name in a meaningful place somewhere on your computer/disk. Don’t forget where you save your files. I usually make a new folder for each “project” I am working on, and then create sub-folders containing files of each type/purpose.

When you have launched R you can read in your data (or “dataframe” as R will refer to if) using the following command (we are calling the data frame in this example "data1"):

data1<-read.table(file.choose(), header=TRUE)

A window will pop-open and you can click on the text file you want to read into R. In the above text, the word “data1” represents the name I have given to the dataframe. If you wanted to call it something else, you could, for example you could write in:

values<-read.table(file.choose(), header=TRUE) and the dataframe would be called “values”.

If you were to now type “values” into the R console, then the dataframe would appear before your eyes like magic.

The “header=TRUE” part of the command tells R that your text file includes the names of each column of data and you would like to retain those headings in your dataframe.

(A dataframe is just one of the many types of objects R creates. When ever you enter and word directly followed by a “<-”, you are asking R to make an object called that word. In example directly above you are asking R to make an object called “values”.)

NB: there are many other methods for getting your dataset into R, but I find the function “file.choose()” the most simple!

You are now ready to run a t-test on the bill length data set saved in the data frame “data1”. See below for instructions on how to conduct the t-test.

Egg width data set – suitable for an ANOVA

Let’s say you want to do a comparison of egg widths across different kiwi project conservancies. You want to see whether egg width differs significantly between five projects, Project A, Project B, Project C, Project D, and Project E. You will set these data up in Excel again in two columns in Excel in “long” format (see Excel file “Egg size data set”) . Give each column a heading name. The column on the left could be called “Project” and the right-hand column “Width”. You “stack” the data from each Project on top of one another in the column. As before you should now select these two columns of data, paste them into an application that can save them in plain text format (see “Egg size data.txt” and then import the data into R as a dataframe object (giving them the sensible name of “eggwidth” using the following command:

Eggwidth<-read.table(file.choose(), header = TRUE)

See below for the ANOVA instructions.

#Blog #AudienceEngagement

EASY-R

Step-by-step instructions on how to use the free statistics program R for absolute beginners, by Biologists Lindsey Gray and Brittany Mitchell.

Download R: https://cran.r-project.org

1. Quick statistics re-hash before going into R-specific stuff

2. R specific words and definitions you need to learn

3. Downloading R and R packages

4. Preparing data and getting data into R using the Bill length and Egg width practice datasets

Comments