.csv
file%>%
:group_by
+ summarize
dplyr
functionspurrr
’s map*
functions to iterateOften, the first thing we need to do in data analysis is to load a dataset into R. When we bring spreadsheet-like (think Microsoft Excel tables) data, generally shaped like a rectangle, into R it is represented as what we call a data frame object. It is very similar to a spreadsheet where the rows are the collected observations and the columns are the variables.
The first kind of data we will learn how to load into R (as a data frame) is the
spreadsheet-like comma-separated values format (.csv
for short).
These files have names ending in .csv
, and can be opened open and saved from common spreadsheet programs like Microsoft Excel and Google Sheets.
For example, a .csv
file named state_property_vote.csv
is included with the code for this book.
This file— originally from Data USA—has US state-level property, income, population and voting data from 2015 and 2016.
If we were to open this data in a plain text editor, we would see each row on its own line, and each entry in the table separated by a comma:
state,med_income,med_prop_val,population,mean_commute_minutes,party
AK,64222,197300,733375,10.46830207,Republican
AL,36924,94800,4830620,25.30990746,Republican
AR,35833,83300,2958208,22.40108933,Republican
AZ,44748,128700,6641928,20.58786,Republican
CA,53075,252100,38421464,23.38085172,Democrat
CO,48098,198900,5278906,19.50792188,Democrat
CT,69228,246450,3593222,24.349675,Democrat
DC,70848,475800,647484,28.2534,Democrat
DE,54976,228500,926454,24.45553333,Democrat
To load this data into R, and then to do anything else with it afterwards, we will need to use something called a function.
A function is a special word in R that takes in instructions (we call these arguments) and does something. The function we will
use to read a .csv
file into R is called read_csv
.
In its most basic use-case, read_csv
expects that the data file:
,
) to separate the columns, andBelow you’ll see the code used to load the data into R using the read_csv
function. But there is one extra step we need to do first. Since read_csv
is not included in the base installation of R,
to be able to use it we have to load it from somewhere else: a collection of useful functions known as a library. The read_csv
function in particular
is in the tidyverse
library (more on this later), which we load using the library
function.
Next, we call the read_csv
function and pass it a single argument: the name of the file, "state_property_vote.csv"
. We have to put quotes around filenames and other letters and words that we
use in our code to distinguish it from the special words that make up R programming language. This is the only argument we need to provide for this file, because our file satifies everthing else
the read_csv
function expects in the default use-case (which we just discussed). Later in the course, we’ll learn more about how to deal with more complicated files where the default arguments are not
appropriate. For example, files that use spaces or tabs to separate the columns, or with no column names.
clicking the below button will make this book interactive and that could take some times to strat. Be patient…
us_data <- readr::read_csv("https://raw.githubusercontent.com/UBC-DSCI/introduction-to-datascience/master/state_property_vote.csv")
## Parsed with column specification:
## cols(
## state = col_character(),
## med_income = col_double(),
## med_prop_val = col_double(),
## population = col_double(),
## mean_commute_minutes = col_double(),
## party = col_character()
## )
## Parsed with column specification:
## cols(
## state = col_character(),
## med_income = col_double(),
## med_prop_val = col_double(),
## population = col_double(),
## mean_commute_minutes = col_double(),
## party = col_character()
## )
## # A tibble: 52 x 6
## state med_income med_prop_val population mean_commute_minutes party
## <chr> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 AK 64222 197300 733375 10.5 Republican
## 2 AL 36924 94800 4830620 25.3 Republican
## 3 AR 35833 83300 2958208 22.4 Republican
## 4 AZ 44748 128700 6641928 20.6 Republican
## 5 CA 53075 252100 38421464 23.4 Democrat
## 6 CO 48098 198900 5278906 19.5 Democrat
## 7 CT 69228 246450 3593222 24.3 Democrat
## 8 DC 70848 475800 647484 28.3 Democrat
## 9 DE 54976 228500 926454 24.5 Democrat
## 10 FL 43355 125600 19645772 24.8 Republican
## # … with 42 more rows
Above you can also see something neat that Jupyter does to help us understand our code: it colours text depending on its meaning in R. For example, you’ll note that functions get bold green text, while letters and words surrounded by quotations like filenames get blue text.
In case you want to know more (optional): We use the
read_csv
function from thetidyverse
instead of the base R functionread.csv
because it’s faster and it creates a nicer variant of the base R data frame called a tibble. This has several benefits that we’ll discuss in further detail later in the course.