Perform group-wise data manipulation and deal with large datasets using R efficiently and effectively.
This book starts with the installation of R and how to go about using R and its libraries. We then discuss the mode of R objects and its classes and then highlight different R data types with their basic operations.
The primary focus on group-wise data manipulation with the split-apply-combine strategy has been explained with specific examples. The book also contains coverage of some specific libraries such as lubridate, reshape2, plyr, dplyr, stringr, and sqldf. You will not only learn about group-wise data manipulation, but also learn how to efficiently handle date, string, and factor variables along with different layouts of datasets using the reshape2 package.
By the end of this book, you will have learned about text manipulation using stringr, how to extract data from twitter using twitteR library, how to clean raw data, and how to structure your raw data for data mining.
What this book covers
Chapter 1, R Data Types and Basic Operations, discusses the different types of data used in R and their basic operations. Before introducing the data types in this chapter, we will highlight what an object in R is and its mode and class. The mode of an object could be either numeric, character, or logical, whereas its class could be vector, factor, list, data frame, matrix, array, or others. This chapter also highlights how to deal with objects in different modes and how to convert from one mode to another and what caution should be taken during conversion. Missing values in R and how to represent missing character and numeric data types are also discussed here. Along with the data types and basic operations, this chapter sheds light on another important aspect, which is almost never mentioned in other text books—the object naming convention in R. We talk about popular object-naming conventions used in R.
Chapter 2, Basic Data Manipulation, introduces some special features that we need to consider during data acquisition. Then, an important aspect of factor manipulation will be discussed, especially when subsetting a factor variable and how to remove unused factor levels. Date processing is also covered using an efficient R package: lubridate. Dealing with the date variable using the lubridate package is much more efficient than any other existing packages that are designed to work with the date variable. Also, string processing will be highlighted and the chapter ends with a description of subscripting and subsetting.
Chapter 3, Data Manipulation Using plyr, introduces the state-of-the-art approach called split-apply-combine to manipulate datasets. Data manipulation is an integral part of data cleaning and analysis. For large data, it is always preferable to perform the operations within the subgroup of a dataset to speed up the process. In R, this type of data manipulation can be done with base functionality, but for large data it requires considerable amount of coding and eventually takes more processing time. In the case of large datasets, we can split the data and perform the manipulation or analysis and then again combine them into a single output. This chapter contains a discussion on the different functions in the plyr package that are used for group-wise data manipulation and also for data analysis.
Chapter 4, Reshaping Datasets, deals with the orientation of datasets. Reshaping data is a common and tedious task in real-life data manipulation and analysis. A dataset might come with different levels of grouping and we need some reorientation to perform certain types of analysis. To perform statistical analysis, we sometimes require wide data and sometimes long data, and in that case we need to be able to fluently and fluidly reshape data to meet the requirements. Important functions from the reshape package will be discussed in this chapter with examples.
Chapter 5, R and Databases, talks about dealing with database software and R. One of the major problems in R is that its memory is bound by RAM, and that is why working with a dataset requires the data to be smaller than its memory. But in reality, the dataset is larger than the capacity of RAM and sometimes the length of arrays or vectors exceeds the maximum addressable range. To overcome these two limitations, R can be utilized with databases. Interacting with databases using R and dealing with large datasets with specialized packages and data manipulation with sqldf will be discussed with examples in this chapter.
Bibliography, provides a list of citations used in the book.
Author: Jaynal Abedin