Managing Data

Changing Entries

There are many different ways to edit the variables in a dataset. Specifically, we can create and add new variables into it by using the assignment operator <-. If we are not satisfied with the current data, we can always code it differently. These are shown below:

testData <- data.frame(x = c(1:10), y = c(11:20))  # dataframe created
testData

##     x  y
## 1   1 11
## 2   2 12
## 3   3 13
## 4   4 14
## 5   5 15
## 6   6 16
## 7   7 17
## 8   8 18
## 9   9 19
## 10 10 20

testData$sum <- testData$x + testData$y      # variable that adds the first two columns
testData$product <- testData$x * testData$y  # variable that multiplies the first two columns
testData

##     x  y sum product
## 1   1 11  12      11
## 2   2 12  14      24
## 3   3 13  16      39
## 4   4 14  18      56
## 5   5 15  20      75
## 6   6 16  22      96
## 7   7 17  24     119
## 8   8 18  26     144
## 9   9 19  28     171
## 10 10 20  30     200

If attach is used on the dataset, the name of the dataset can be omitted from the calling of each variable. The detach function can then be used to go back to using the dataset name. The same code from above is shown below with attach and detach:

testData <- data.frame(x = c(1:10), y = c(11:20))  # dataframe created
testData

##     x  y
## 1   1 11
## 2   2 12
## 3   3 13
## 4   4 14
## 5   5 15
## 6   6 16
## 7   7 17
## 8   8 18
## 9   9 19
## 10 10 20

attach(testData)           # for ommitting the dataset name
testData$sum <- x + y      # a variable that adds the first two columns
testData$product <- x * y  # a variable that multiplies the first two columns
testData

##     x  y sum product
## 1   1 11  12      11
## 2   2 12  14      24
## 3   3 13  16      39
## 4   4 14  18      56
## 5   5 15  20      75
## 6   6 16  22      96
## 7   7 17  24     119
## 8   8 18  26     144
## 9   9 19  28     171
## 10 10 20  30     200

detach(testData)  # again, for using the dataset name

Defining Factors Types

If there is a column that has values that are counted as one group or another (such as numbers that are really factors), we can make a new column that has a column of factors as text. For example:

attach(testData)
testData$overUnder[product < 100] = "Bad"    # create factor for under 100
testData$overUnder[product >= 100] = "Good"  # create factor for over 100
testData

##     x  y sum product overUnder
## 1   1 11  12      11       Bad
## 2   2 12  14      24       Bad
## 3   3 13  16      39       Bad
## 4   4 14  18      56       Bad
## 5   5 15  20      75       Bad
## 6   6 16  22      96       Bad
## 7   7 17  24     119      Good
## 8   8 18  26     144      Good
## 9   9 19  28     171      Good
## 10 10 20  30     200      Good

detach(testData)

Rearranging

In this section, how to sort and merge data will be shown:

Sorting

testData$dollarAmount <- c(25, 60, 37, 57, 18, 36, 47, 37, 47, 80)
attach(testData)
sortedA <- testData[order(dollarAmount),]  # sorting in ascending order
sortedA2 <- sort(testData$dollarAmount)    # sort() can also be used
sortedA

##     x  y sum product overUnder dollarAmount
## 5   5 15  20      75       Bad           18
## 1   1 11  12      11       Bad           25
## 6   6 16  22      96       Bad           36
## 3   3 13  16      39       Bad           37
## 8   8 18  26     144      Good           37
## 7   7 17  24     119      Good           47
## 9   9 19  28     171      Good           47
## 4   4 14  18      56       Bad           57
## 2   2 12  14      24       Bad           60
## 10 10 20  30     200      Good           80
sortedA2

##  [1] 18 25 36 37 37 47 47 57 60 80

sortedD <- testData[order(-dollarAmount),]                  # sorting in descending order
sortedD2 <- sort(testData$dollarAmount, decreasing = TRUE)  # sort() can also be used

sortedD

##     x  y sum product overUnder dollarAmount
## 10 10 20  30     200      Good           80
## 2   2 12  14      24       Bad           60
## 4   4 14  18      56       Bad           57
## 7   7 17  24     119      Good           47
## 9   9 19  28     171      Good           47
## 3   3 13  16      39       Bad           37
## 8   8 18  26     144      Good           37
## 6   6 16  22      96       Bad           36
## 1   1 11  12      11       Bad           25
## 5   5 15  20      75       Bad           18

sortedD2

##  [1] 80 60 57 47 47 37 37 36 25 18

detach(testData)

Merging

testDataset2 <- data.frame(numbers = c(1), moreNumbers = c(11))
testDataset2

##   numbers moreNumbers
## 1       1          11
totalSet <- merge(testData, testDataset2)  # merging two datasets
totalSet

##     x  y sum product overUnder dollarAmount numbers moreNumbers
## 1   1 11  12      11       Bad           25       1          11
## 2   2 12  14      24       Bad           60       1          11
## 3   3 13  16      39       Bad           37       1          11
## 4   4 14  18      56       Bad           57       1          11
## 5   5 15  20      75       Bad           18       1          11
## 6   6 16  22      96       Bad           36       1          11
## 7   7 17  24     119      Good           47       1          11
## 8   8 18  26     144      Good           37       1          11
## 9   9 19  28     171      Good           47       1          11
## 10 10 20  30     200      Good           80       1          11

Note: rbind() and cbind() can be used to add a row or column to a dataset. Information about these functions and the use of conditional operators can be found in Basic Syntax page.

Creating a Subset of the Data

R has powerful indexing features for creating a subset of the data. The following codes demonstrate ways to delete variables in a data frame.

totalSet[2,]        # gets all of row 2

##   x  y sum product overUnder dollarAmount numbers moreNumbers
## 2 2 12  14      24       Bad           60       1          11

totalSet$x <- NULL  # this excludes or removes the entire column
totalSet

##     y sum product overUnder dollarAmount numbers moreNumbers
## 1  11  12      11       Bad           25       1          11
## 2  12  14      24       Bad           60       1          11
## 3  13  16      39       Bad           37       1          11
## 4  14  18      56       Bad           57       1          11
## 5  15  20      75       Bad           18       1          11
## 6  16  22      96       Bad           36       1          11
## 7  17  24     119      Good           47       1          11
## 8  18  26     144      Good           37       1          11
## 9  19  28     171      Good           47       1          11
## 10 20  30     200      Good           80       1          11

Sometimes you may want to subset the data by selecting rows corresponding only to certain values. Suppose, for instance, that you wanted only the rows in the above example where dollarAmount was equal to $25, $57, or $80. Then you could use the syntax %in% as shown below.

totalSet[totalSet$dollarAmount %in% c(25, 57, 80), ]

Center for Analytics and Data Science

165 McVey Data Science Building

105 Tallawanda Rd

Oxford, OH 45056

513-529-2279 cads@��OH.edu

��

Changing Entries

Defining Factors Types

Rearranging

Sorting

Merging

Creating a Subset of the Data

Ready to take your skills to the next level?

Proceed to the advanced tutorials

Center for Analytics and Data Science

Contact Us

Initiatives

Find

��������

Managing Data

Changing Entries

Defining Factors Types

Rearranging

Sorting

Merging

Creating a Subset of the Data

Ready to take your skills to the next level?

Proceed to the advanced tutorials

Center for Analytics and Data Science

Contact Us

Follow Us

Initiatives

Find

Follow Us

��