Importing Data and Writing Files
Importing Common Files in R
R imports files is when the program reads an external file into memory and uses it in code to perform different manipulations and calculations. There are two different ways of doing this - using the RStudio interface or with code.
Note: If you do not save your workspace as a R workspace file, you will need to read in the dataset each time a program is closed and reopen.
- read.table function arguments (begins at 0:15)
- Using read.table to read in a data set (begins at 1:50)
- Using write.table to write out a data set (begins at 4:05)
- Total length (8:32)
There are four main windows of the Rstudio interface. For a review of these different windows, please refer to CREATE LINK An Introduction to the RStudio Interface. In this section we will focus on the Environment, or workspace, window. Notice there are two tabs called Environment and History. If not already selected, clicking on the Environment tab should reveal additional icons.
As pictured below, there are 4 different parts in the Rstudio display - you will find the Import Dataset icon in the Environment window located in the top right:
Clicking on Import Dataset will provide a choice between two options:
- From Text File: Once clicked on, this opens a window where a file can be opened. Text files (ending in .txt) and comma separated value files (ending in .csv) can both be opened, but normal excel files (ending in xls, xlsx, etc.) cannot be read this way unless they are first saved as a csv. There are special packages for reading these kinds of files. Once a dataset is selected and open is clicked, a window will appear showing how the data set will be read and what it will look like after reading. If this is satisfactory (it might need some tweaking to fix object classes, column names, etc.), click Import at the bottom of the window.
- From Web URL: If the dataset can be found via the internet, then the only thing necessary is the URL where the dataset is located; a window will appear asking for the URL when this option is selected. Once provided, click Okay at the bottom of the window and the dataset will be brought in.
The easiest way to import a dataset from an external file is by using a data.(file_type) syntax. There are a number of file types that can be read in this way. The two main ones are comma separated value files (.csv) and text files (.txt). In fact, the point-and-click method mentioned above will show you what code it will run to import the data you've selected.
Note: For Windows users, R gets confused if you use a directory path in your code like:
C:\mydocuments\
because R sees "\" as an escape character. Instead use:
C:\\my documents\\myfile.csv
or
C:/mydocuments/myfile.csv
Also, the directory path doesn’t actually need the “C:” as the beginning part. We only include it in the examples below for clarity.
Importing data from text files
For text files, read.table()
is used. There are many arguments for read.table()
, but the three main ones are the file
, header
, and sep
arguments.
- The
file
argument is used to specify file path, i.e. the route the computer takes to access the file; this is a hierarchy of folders needed to navigate to the file. Each path usually starts with C: (for windows), which signifies that the hierarchy of folders is in the C drive. After C:, a front slash (\) usually follows, signifying the next word is a folder name. For example, if the file path is C:\Users\Username\Downloads.filename.txt, the file path starts in the C drive, enters the Users folder, then enters the Username folder and looks for the file “filename.txt”. It is important that this whole hierarchy is surrounded by quotation marks (" "). - The
header
argument is set to either TRUE or FALSE. If true, this tells the program that there is a header in the file, so the first row will be treated as column names and not as data. If header is set to false, this tells the program that there is no header included and the columns should be given default names by RStudio. - The
sep
argument tells RStudio what each value or piece of information is separated by. The Default separator ofread.table()
is the “white space”, which issep = “ ”
. Other common separators are tab,sep = "\t"
, and comma,sep = ","
.
An example of a read.table()
call for a Comma Separated text file without headers:
read.table(files = "C:\Users\username\Downloads\filename.txt", header = FALSE, sep = ",")
An example of a read.table()
call for a Tab Separated text file with headers:
read.table(files = "C:\Users\username\Downloads\filename.txt", header = T, sep = "\t")
Importing data from .csv files
For a csv file, read.csv()
can be used. This function is basically read.table()
with the field separator argument defaulted to sep = ','
and a few other changes to smooth things out for a csv, since these are 'comma separated' files.
Note: Neither read.table()
nor read.csv()
will produce output in the Console when executed. However, a successful implementation of the function should create a data frame in the Environment window.
Importing data from .xls and .xlsx spreadsheets
Unlike .csv files, the Microsoft Excel file types .xls and .xlsx can contain multiple spreadsheets within the same file. To import data from these types of files, you may want to use the read_excel()
function from the readxl package.
install.packages('readxl') # install package
library(readxl) # load package
read_excel(path = "C:\Users\username\Downloads\filename.xlsx",
sheet = 1, col_names = T, col_types = NULL, na = "", skip = 0)
Notice that, by default, if you only provide the file path, the function will:
- only read in the first spreadsheet since sheet = 1,
- expect column names in the dataset since col_names = T,
- guess each column type since col_type = NULL,
- enter a blank for missing values since
na = ""
, and will not skip any rows sinceskip = 0
.
Also, readxl creates a tibble object when it imports data into your R session. You can index into a tibble the same way you would index into a data frame, list , or table. For example, if you wanted to take the mean of the first column from a tibble data structure called myData, you could use the following:
myData <- read_excel(path = "C:\Users\username\Downloads\filename.xlsx",
sheet = 1, col_names = T, col_types = NULL, na = "", skip = 0)
muFirst_Column <- mean(myData[[1]])
Note: Another useful function of the readxl package is excel_sheets()
. Given the file path, the function will return the names of the spreadsheets for the given Excel file.
Exporting data to a file using code
In R, we can export a dataset using various file formats. Permissible file types are .xls, .xlsx, .csv, and .txt. In the sections below, we will focus on the .txt and .csv file types.
The write.table()
function allows us to export data using different file formats. The separator and header arguments, as discussed earlier in the Importing Files section, can also be used to indicate the export file type. Also, the function write.csv()
can be used for comma separated files.
write.table(dataset, "filename.txt", sep = "\t") # tab separated file
write.table(dataset, "filename.csv", sep = ",") # comma separated file
write.csv(dataset, "filename.csv) # also comma separated file
write.csv(dataset, "C:\Users\Username\Desktop\filename.csv) # writes .csv file to Desktop
write.csv(dataset, "C:\Users\Username\Desktop\filename.csv, col.names = T) # indicating headers
to be written
Using fread for large datasets
For very large data sets, it may not be a good idea to use the base functions of R to bring in data. In fact, large datasets, especially those greater than a few gigabytes, may crash your computer. Using fread()
from the data.table package is a good way to install these datasets. The code for installing and loading the package as well as the basic syntax for the function call is shown below:
install.packages('data.table') # install package from CRAN repository
library(data.table) # load package into current R session
fread('filePath_and_fileName', sep = ',') # basic syntax similar to read.table
For more on R packages, please visit the Installing and Loading Packages page.