-
Notifications
You must be signed in to change notification settings - Fork 7
/
Copy pathnotes.R
328 lines (236 loc) · 9.04 KB
/
notes.R
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
## # STS 98 - Lecture 2016.04.05
##
## See [notes.R](notes.R) to follow along in RStudio.
##
## Also see the [R input](r_session.txt) from the lecture.
## Logistics
## ---------
## The Wednesday 1:10 pm discussion will be in Surge III 1283.
##
## The Wednesday 2-4 pm office hours will be in Kerr 165.
##
## The Friday 1-3 pm office hours will be in SSH 1246.
## Inspecting Data
## ---------------
## R ignores lines starting with the # symbol. These are called comments. Use
## comments to plan out what you're trying to do, and to explain tricky pieces
## of code.
## Last time we looked some functions for changing R's working directory:
##
## * getwd(), setwd(), list.files()
##
## Then we loaded an RDS file with `readRDS()` and looked at some functions for
## inspecting the structure of a data set:
##
## * head(), tail()
## * nrow(), ncol(), dim()
## * str()
## * typeof(), class()
## * rownames(), colnames(), names()
##
## We also saw some functions for inspecting the content of a data set:
##
## * $ operator, to extract a column
## * table()
## * sort()
##
## Let's start with a few more of these.
x = readRDS("data/sts98.rds")
## This data set has everyone's class level, major, and department. We can get
## a quick summary of the data with the `summary()` function:
summary(x)
## This data set only has categorical information, so we just see counts. Last
## time we saw that the `table()` function computes counts, too. Unlike
## `summary()`, the count for every category is computed:
table(x$Major)
sort(table(x$Major))
## You can also use `table()` to count how many observations fall into pairs of
## categories, across two columns. This is called _cross-tabulation_. For
## instance, we can cross-tabulate class level and major:
table(x$Class, x$Major)
## This gives us a very detailed breakdown of the data. Like any other value in
## R, we can save this to a variable. The `addmargins()` function adds row and
## column sums to the table:
ctab = table(x$Major, x$Class)
addmargins(ctab)
## It's a good idea to use variables to save intermediate steps in your code so
## that if something goes wrong, you can figure out where the problem is.
## __Q:__ What happens if we sort the cross-tabulated values?
sort(table(x$Class, x$Major))
## This is probably not what we want. The counts in the table get sorted from
## smallest to largest, but the row and column names are blown away! The
## `sort()` function only works well on vectors. Later on we'll learn how to
## sort a table.
## Here's an example of `addmargins()` for a smaller part of the data:
# Get first 10 rows.
y = head(x, 10)
# Remove categories that aren't present in the first 10 rows; otherwise R
# remembers them.
y$Class = factor(y$Class)
y$Major = factor(y$Major)
# Cross-tabulate and add the margins.
table(y$Class, y$Major)
addmargins(table(y$Class, y$Major))
## File Formats
## ------------
## Data can be stored in many different formats. Most of the time, file names
## end with a dot followed by 3 or more letters. This is called an _extension_
## and indicates the format of the file.
##
## R understands many different formats. In this class we'll see:
##
## Format | Extension | R Function
## ---------------------------- | --------- | ----------
## R data | .rds | readRDS
## R data | .rda | load
## Comma-separated values (CSV) | .csv | read.csv
## Tab-separated values (TSV) | .tsv | read.delim
## Plain-text tables | | read.table
##
## The extension is just a hint! Sometimes it may be incorrect or misleading.
## For the rest of this lecture, we'll use the NOAA Significant Earthquakes
## Database. There's a download link in the `data/` directory of the lecture
## notes.
##
## Despite the `.txt` extension, this is a TSV file, because:
##
## 1. The download page says it's a "tab-delimited format".
## 2. Opening the file in a text editor shows tabs between the columns.
x = read.delim("signif.txt")
## This is a much larger data set than the STS 98 roster.
dim(x)
names(x)
## The `names()` function is unusual because it also allows us to change the
## names. Let's change them to lowercase to save some typing.
names(x) = tolower(names(x))
## Missing Values
## --------------
## First of all, do you notice anything confusing about this data set?
str(x)
## A lot of values show up as NA. NA is a special value in R that represents
## missing data. In other words, NA means no measurement was recorded. This is
## common in real data sets.
##
## For instance, in the earthquake data, details are scarce on earthquakes that
## happened thousands of years ago.
##
## With the `table()` function, you can find out how many values are missing by
## setting the `useNA` parameter to `'always'`:
table(x$intensity, useNA = 'always')
## Many other functions have options for handling NAs as well. For instance, an
## inspection function we didn't discuss yet is `range()`. It just reports the
## smallest and largest values in a vector.
range(x$year)
range(x$intensity)
range(x$intensity, na.rm = TRUE)
## The reasoning is that if there are unknown values, the range is unknown. So
## we have to explicitly tell `range()` to ignore unknown values.
## __Q:__ What's the type of NA?
##
## __A:__ It depends. By default it's logical:
typeof(NA)
## But NA can be converted to any type, to match other data in a vector. For
## instance, the NAs in the intensity column are integers:
head(x$intensity, 1)
typeof(head(x$intensity, 1))
## If we only need to check the number of NAs, there's an alternative to
## `table()`. The `is.na()` function returns TRUE or FALSE whether the elements
## of the parameter are NAs.
is.na(NA)
is.na(5)
is.na(c(NA, 4, 18, NA))
is.na(x$intensity)
## Since R represents TRUE as 1 and FALSE as 0, we can sum up the TRUES and
## FALSES to get the number of NAs:
sum(is.na(x$deaths))
## __Q:__ What if we want the number of values that aren't missing?
##
## __A:__ There are three ways to do this. Number of observations minus number of
## missing values:
nrow(x) - sum(is.na(x$intensity))
## Number of values minus number of missing values:
length(x$intensity) - sum(is.na(x$intensity))
## Number of values that are not (! means "not") missing:
sum(!is.na(x$intensity))
## Subsets & Conditions
## --------------------
## We can use functions we learned before to find out which countries had the
## most earthquakes:
tail(sort(table(x$country)))
summary(x$country)
## What if we just want data on USA earthquakes? The `subset()` function lets
## us take a subset of rows. The first parameter is a data set, and the second
## parameter is a condition.
x_us = subset(x, country == "USA")
## Since `=` already stands for assignment, R uses `==` to stand for equality.
## The result is TRUE or FALSE vector.
5 == 5
5 == 6
c(6, -2, 8) == c(6, 3, 8)
## Notice that we can also compare a vector against a single value:
c(6, -2, 8) == 8
8 == c(6, -2, 8)
## This is called _recycling_, because the single value gets reused for each
## comparsion. We didn't discuss it before, but all vectorized R functions
## recycle:
8 + c(2, 4, 6)
## __Q:__ How can we combine conditions? What if we want quakes that happened
## in the USA or happened before year 0?
##
## __A:__ Use a logical operator like | ("or") or & ("and"). These are
## discussed later on in the notes.
## Condition to get quakes before year 0:
##
## year < 0
##
## Get quakes that are in USA or before year 0:
x_us_bc = subset(x, country == "USA" | year < 0)
## __Q:__ Can we save conditions in a variable?
##
## __A:__ Yes. Let's say we wanted to get a count of how many quakes happened
## before year 0. Then:
from_bc = x$year < 0
# Before year 0.
sum(from_bc)
# After year 0.
sum(!from_bc)
## Now we've seen `==` and `all()` for working with comparisons, but there's
## several other useful functions:
##
## * Comparisons: ==, <, >, <=, >=
## * Logic Operators
## * or: |, ||
## * and: &, &&
## * not: !
## * quantifiers ("how many"): any(), all()
##
## Let's look at some examples.
## For `|`, left OR right must be TRUE to get TRUE:
TRUE | FALSE
c(TRUE, TRUE, FALSE, FALSE) | c(TRUE, FALSE, TRUE, FALSE)
# Get quakes from year -2150 OR in Jordan. Either or both conditions must be
# true.
subset(x, year == -2150 | country == "JORDAN")
## For `&`, left AND right must be TRUE to get TRUE:
TRUE & FALSE
c(TRUE, TRUE, FALSE, FALSE) & c(TRUE, FALSE, TRUE, FALSE)
# Get quakes from year -2150 AND in Jordan. Both conditions must be true.
subset(x, year == -2150 & country == "JORDAN")
## Beware! The `||` and `&&` operators are faster but only check the first
## element:
c(TRUE, FALSE, TRUE) || c(TRUE, FALSE, FALSE)
## The not operator returns the opposite:
!TRUE
!c(TRUE, FALSE)
## Also note we can abbreviate TRUE and FALSE as T and F:
T | F
## It's safer to use TRUE or FALSE, because T and F are just variables and can
## be assigned other values:
T = 6
## __Q:__ Can I reset T and F?
##
## __A:__ Yes, use `rm()` to remove a variable assignment.
rm(T)
## TRUE and FALSE are special values in R; the logical operators don't work on
## strings:
"TRUE" | "F"