I am working with data summaries. First, mpg, then something else.
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggformula)
Loading required package: scales
Attaching package: 'scales'
The following object is masked from 'package:purrr':
discard
The following object is masked from 'package:readr':
col_factor
Loading required package: ggridges
New to ggformula? Try the tutorials:
learnr::run_tutorial("introduction", package = "ggformula")
learnr::run_tutorial("refining", package = "ggformula")
library(mosaic)
Registered S3 method overwritten by 'mosaic':
method from
fortify.SpatialPolygonsDataFrame ggplot2
The 'mosaic' package masks several functions from core packages in order to add
additional features. The original behavior of these functions should not be affected by this.
Attaching package: 'mosaic'
The following object is masked from 'package:Matrix':
mean
The following object is masked from 'package:scales':
rescale
The following objects are masked from 'package:dplyr':
count, do, tally
The following object is masked from 'package:purrr':
cross
The following object is masked from 'package:ggplot2':
stat
The following objects are masked from 'package:stats':
binom.test, cor, cor.test, cov, fivenum, IQR, median, prop.test,
quantile, sd, t.test, var
The following objects are masked from 'package:base':
max, mean, min, prod, range, sample, sum
library(kableExtra)
Attaching package: 'kableExtra'
The following object is masked from 'package:dplyr':
group_rows
library(skimr)
Attaching package: 'skimr'
The following object is masked from 'package:mosaic':
n_missing
Look at the mpg dataset
mpg
# A tibble: 234 × 11
manufacturer model displ year cyl trans drv cty hwy fl class
<chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
1 audi a4 1.8 1999 4 auto… f 18 29 p comp…
2 audi a4 1.8 1999 4 manu… f 21 29 p comp…
3 audi a4 2 2008 4 manu… f 20 31 p comp…
4 audi a4 2 2008 4 auto… f 21 30 p comp…
5 audi a4 2.8 1999 6 auto… f 16 26 p comp…
6 audi a4 2.8 1999 6 manu… f 18 26 p comp…
7 audi a4 3.1 2008 6 auto… f 18 27 p comp…
8 audi a4 quattro 1.8 1999 4 manu… 4 18 26 p comp…
9 audi a4 quattro 1.8 1999 4 auto… 4 16 25 p comp…
10 audi a4 quattro 2 2008 4 manu… 4 20 28 p comp…
# ℹ 224 more rows
The table provides information on various car models from 1999 and 2008, highlighting key specifications. The data allows for a detailed comparison of the cars’ performance and specifications across different years and models.
First 10 rows of the mpg dataset
mpg
# A tibble: 234 × 11
manufacturer model displ year cyl trans drv cty hwy fl class
<chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
1 audi a4 1.8 1999 4 auto… f 18 29 p comp…
2 audi a4 1.8 1999 4 manu… f 21 29 p comp…
3 audi a4 2 2008 4 manu… f 20 31 p comp…
4 audi a4 2 2008 4 auto… f 21 30 p comp…
5 audi a4 2.8 1999 6 auto… f 16 26 p comp…
6 audi a4 2.8 1999 6 manu… f 18 26 p comp…
7 audi a4 3.1 2008 6 auto… f 18 27 p comp…
8 audi a4 quattro 1.8 1999 4 manu… 4 18 26 p comp…
9 audi a4 quattro 1.8 1999 4 auto… 4 16 25 p comp…
10 audi a4 quattro 2 2008 4 manu… 4 20 28 p comp…
# ℹ 224 more rows
head(10)
[1] 10
The table displays the first 10 rows of the mpg dataset, with details on engine size (displacement), number of cylinders, transmission type (automatic or manual), drivetrain (front-wheel or all-wheel drive), and fuel efficiency in city and highway miles per gallon (MPG). The engine displacement ranges from 1.8 to 3.1 liters, and the number of cylinders is either 4 or 6. The city MPG varies from 16 to 21, while highway MPG ranges from 25 to 31.
This summary provides an overview of the specifications and performance characteristics of car models in the dataset.
Inspect - mpg dataset
inspect(mpg)
categorical variables:
name class levels n missing
1 manufacturer character 15 234 0
2 model character 38 234 0
3 trans character 10 234 0
4 drv character 3 234 0
5 fl character 5 234 0
6 class character 7 234 0
distribution
1 dodge (15.8%), toyota (14.5%) ...
2 caravan 2wd (4.7%) ...
3 auto(l4) (35.5%), manual(m5) (24.8%) ...
4 f (45.3%), 4 (44%), r (10.7%)
5 r (71.8%), p (22.2%), e (3.4%) ...
6 suv (26.5%), compact (20.1%) ...
quantitative variables:
name class min Q1 median Q3 max mean sd n
1 displ numeric 1.6 2.4 3.3 4.6 7 3.471795 1.291959 234
2 year integer 1999.0 1999.0 2003.5 2008.0 2008 2003.500000 4.509646 234
3 cyl integer 4.0 4.0 6.0 8.0 8 5.888889 1.611534 234
4 cty integer 9.0 14.0 17.0 19.0 35 16.858974 4.255946 234
5 hwy integer 12.0 18.0 24.0 27.0 44 23.440171 5.954643 234
missing
1 0
2 0
3 0
4 0
5 0
The inspection of the mpg dataset reveals two types of variables: categorical and quantitative. Categorical variables include manufacturer, model, transmission, drivetrain, fl (fuel type), and class, with a total of 234 entries and no missing data. Quantitative variables, such as engine displacement, year, number of cylinders, city miles per gallon, and highway miles per gallon, are summarized with key statistics. For example, engine displacement ranges from 1.6 to 7 liters, and city MPG varies from 9 to 35, with an average of 16.86. This overview highlights the structure and details of the dataset, providing both descriptive and numerical insights.
Skim - mpg dataset
skim(mpg)
Data summary
Name
mpg
Number of rows
234
Number of columns
11
_______________________
Column type frequency:
character
6
numeric
5
________________________
Group variables
None
Variable type: character
skim_variable
n_missing
complete_rate
min
max
empty
n_unique
whitespace
manufacturer
0
1
4
10
0
15
0
model
0
1
2
22
0
38
0
trans
0
1
8
10
0
10
0
drv
0
1
1
1
0
3
0
fl
0
1
1
1
0
5
0
class
0
1
3
10
0
7
0
Variable type: numeric
skim_variable
n_missing
complete_rate
mean
sd
p0
p25
p50
p75
p100
hist
displ
0
1
3.47
1.29
1.6
2.4
3.3
4.6
7
▇▆▆▃▁
year
0
1
2003.50
4.51
1999.0
1999.0
2003.5
2008.0
2008
▇▁▁▁▇
cyl
0
1
5.89
1.61
4.0
4.0
6.0
8.0
8
▇▁▇▁▇
cty
0
1
16.86
4.26
9.0
14.0
17.0
19.0
35
▆▇▃▁▁
hwy
0
1
23.44
5.95
12.0
18.0
24.0
27.0
44
▅▅▇▁▁
All variables have complete data with no missing values.This dataset provides a comprehensive view of car specifications and fuel efficiency, ready for further analysis.
Data Dictionary
Quantitative Data
Engine Displacement (dbl): The engine size in liters.
Model Year (int): The year of the car’s model, ranging from 1999 to 2008.
City Mileage (dbl): Miles per gallon (MPG) in city driving conditions.
Highway Mileage (dbl): Miles per gallon (MPG) in highway driving conditions.
Qualitative Data
Manufacturer (chr): The car’s manufacturer, e.g., Audi, Toyota.
Model (chr): The specific car model, e.g., A4, Corolla.
Transmission (chr): The type of transmission, e.g., auto (automatic), manual (m5/m6).
Drivetrain (chr): The type of drivetrain, e.g., f (front-wheel drive), 4 (four-wheel drive).
Fuel (chr): The type of fuel used, e.g., p (premium), r (regular).
Class of Vehicle (chr): The category of the vehicle, e.g., compact, SUV.
Cylinders (int): The number of cylinders in the engine (4, 6, etc.).
Here, several variables— cyl, fl, drv, class, and trans—have been converted from their original data types to factors. This transformation changes them into categorical variables, making them more suitable for analysis involving groupings or classifications.
Average Highway MPG grouped by the number of cylinders
The table summarizes the average highway miles per gallon for cars grouped by the number of cylinders. Cars with 4 cylinders have the highest average highway MPG at 28.80, followed closely by 5-cylinder cars with 28.75, though the 5-cylinder group only includes 4 cars. Cars with 6 cylinders average 22.82 MPG, while cars with 8 cylinders have the lowest fuel efficiency, averaging 17.63 MPG. Overall, the data shows that vehicles with fewer cylinders tend to be more fuel-efficient on the highway, with MPG decreasing as the number of cylinders increases.
Average City MPG grouped by the number of cylinders
The table summarizes the average city miles per gallon based on the number of cylinders. Cars with 4 cylinders have the highest average city MPG at 21.01, followed by 5-cylinder cars at 20.50, though the sample size for 5-cylinder cars is small with only 4 cars. Vehicles with 6 cylinders average 16.22 MPG, while those with 8 cylinders have the lowest city MPG at 12.57. This data shows that cars with fewer cylinders tend to have better fuel efficiency in city driving.
Average Highway MPG grouped by the number of cylinders and fuel type
`summarise()` has grouped output by 'cyl'. You can override using the `.groups`
argument.
# A tibble: 13 × 4
# Groups: cyl [4]
cyl fl average_hwy count
<fct> <fct> <dbl> <int>
1 4 p 27.8 22
2 4 r 28.3 55
3 4 d 43 3
4 4 c 36 1
5 5 r 28.8 4
6 6 p 25.3 17
7 6 r 22.2 60
8 6 e 17 1
9 6 d 22 1
10 8 p 20.8 13
11 8 r 17.5 49
12 8 e 12.7 7
13 8 d 17 1
The table shows the average highway MPG for cars based on cylinders and fuel type. Cars with 4 cylinders are more fuel-efficient, with MPG ranging from 27.82 for premium fuel to 43 for diesel, though alternative fuel samples are small. For 6-cylinder cars, MPG drops to between 25.29 (petrol) and 17 (ethanol). 8-cylinder cars are the least efficient, with 17.51 MPG for regular fuel and 12.71 for ethanol. Overall, cars with fewer cylinders and certain fuels, like diesel, achieve better highway fuel efficiency.
Average City MPG grouped by the number of cylinders and fuel type
`summarise()` has grouped output by 'cyl'. You can override using the `.groups`
argument.
# A tibble: 13 × 4
# Groups: cyl [4]
cyl fl average_hwy count
<fct> <fct> <dbl> <int>
1 4 p 19.9 22
2 4 r 20.8 55
3 4 d 32.3 3
4 4 c 24 1
5 5 r 20.5 4
6 6 p 16.8 17
7 6 r 16.1 60
8 6 e 11 1
9 6 d 17 1
10 8 p 13.8 13
11 8 r 12.7 49
12 8 e 9.57 7
13 8 d 14 1
The table provides a breakdown of the average city miles per gallon by cylinder count and fuel type (fl). For 4-cylinder cars, average city MPG varies by fuel type, with diesel cars achieving the highest at 32.33 MPG, followed by compressed natural gas (24 MPG) and regular fuel (20.78 MPG). 6-cylinder cars show lower MPG, with ethanol-fueled cars having the lowest efficiency at 11 MPG. For 8-cylinder cars, premium fuel provides an average of 13.77 MPG. (considering the fact that rows 11-13 are not visible for me)
Average Highway MPG for different car manufacturers
# A tibble: 15 × 2
manufacturer mean_mileage_manf
<chr> <dbl>
1 audi 26.4
2 chevrolet 21.9
3 dodge 17.9
4 ford 19.4
5 honda 32.6
6 hyundai 26.9
7 jeep 17.6
8 land rover 16.5
9 lincoln 17
10 mercury 18
11 nissan 24.6
12 pontiac 26.4
13 subaru 25.6
14 toyota 24.9
15 volkswagen 29.2
The table displays the average highway miles per gallon for different car manufacturers. Volkswagen leads with the highest average highway MPG at 29.22, followed closely by Honda (28.56), Hyundai (26.86), and Audi (26.44). Other manufacturers like Subaru, Pontiac, and Nissan also show relatively high fuel efficiency, with averages above 24 MPG. In contrast, manufacturers such as Dodge, Jeep, and Land Rover have the lowest average highway MPG, ranging from 16.5 to 17.9.
# A tibble: 15 × 2
manufacturer mean_mileage_manf
<chr> <dbl>
1 audi 17.6
2 chevrolet 15
3 dodge 13.1
4 ford 14
5 honda 24.4
6 hyundai 18.6
7 jeep 13.5
8 land rover 11.5
9 lincoln 11.3
10 mercury 13.2
11 nissan 18.1
12 pontiac 17
13 subaru 19.3
14 toyota 18.5
15 volkswagen 20.9
The table displays the average city miles per gallon for different car manufacturers. Volkswagen leads with the highest average city MPG at 20.93, followed by Honda (24.44) and Subaru (19.29). Other manufacturers like Audi (17.61), Nissan (18.08), and Toyota (18.53) show moderate fuel efficiency. In contrast, manufacturers like Dodge, Jeep, and Land Rover have the lowest city MPG, ranging from 11.5 to 13.5.
Warning: One or more parsing issues, call `problems()` on your data frame for details,
e.g.:
dat <- vroom(...)
problems(dat)
Rows: 599 Columns: 1
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): Age;Gender;Grade;AMAS;RCMAS;Arith
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
The Math Anxiety dataset contains 599 rows and focuses on variables such as Age, Gender, Grade, AMAS (Abbreviated Math Anxiety Scale), RCMAS (Revised Children’s Manifest Anxiety Scale), and Arithmetic scores. The data is structured in a format where each row represents an individual student with their respective attributes.
Rows: 599 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ";"
chr (2): Gender, Grade
dbl (3): AMAS, RCMAS, Arith
num (1): Age
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
The dataset is read using a semicolon (;) delimiter.
The table displays the first 10 rows of the Math Anxiety dataset, which consists of 6 variables: Age, Gender, Grade, AMAS (Abbreviated Math Anxiety Scale), RCMAS (Revised Children’s Manifest Anxiety Scale), and Arithmetic scores. The scores in AMAS, RCMAS, and Arithmetic vary across the students, showcasing different levels of math anxiety and performance. For instance, one student has an AMAS score of 9 and RCMAS score of 20, while another has an AMAS score of 28 and an RCMAS score of 32, indicating variability in anxiety levels among students.
The glimpse provides a quick overview of the structure and types of data within the dataset.
Inspect - math_anxiety
inspect(math_anxiety)
categorical variables:
name class levels n missing
1 Gender character 2 599 0
2 Grade character 2 599 0
distribution
1 Boy (53.9%), Girl (46.1%)
2 Primary (66.9%), Secondary (33.1%)
quantitative variables:
name class min Q1 median Q3 max mean sd n missing
1 Age numeric 37 1061.5 1208 1418.5 1875 1246.49249 223.112183 599 0
2 AMAS numeric 4 18.0 22 26.5 45 21.98164 6.597962 599 0
3 RCMAS numeric 1 14.0 19 25.0 41 19.24040 7.566802 599 0
4 Arith numeric 0 4.0 6 7.0 8 5.30217 2.105220 599 0
The summary of the Math Anxiety dataset, based on the inspect() function, reveals both categorical and quantitative variables. The categorical variables are Gender and Grade, with 53.9% of the entries labeled as Boy and 46.1% as Girl. In terms of Grade, 66.9% belong to the Primary level, while 33.1% are from the Secondary level. The quantitative variables include Age, AMAS (Math Anxiety Scale), RCMAS (Revised Children’s Manifest Anxiety Scale), and Arith (Arithmetic ability). The distribution of ages ranges from 37 to 1875, with a mean of 1246.49 and a standard deviation of 223.11. AMAS scores range from 4 to 45, with a mean of 21.98 and a standard deviation of 6.60. For RCMAS, the range is 1 to 41, with a mean of 19.24 and a standard deviation of 7.57. Finally, Arith scores vary from 0 to 8, with a mean of 5.30 and a standard deviation of 2.11.
Skim - math_anxiety
skim(math_anxiety)
Data summary
Name
math_anxiety
Number of rows
599
Number of columns
6
_______________________
Column type frequency:
character
2
numeric
4
________________________
Group variables
None
Variable type: character
skim_variable
n_missing
complete_rate
min
max
empty
n_unique
whitespace
Gender
0
1
3
4
0
2
0
Grade
0
1
7
9
0
2
0
Variable type: numeric
skim_variable
n_missing
complete_rate
mean
sd
p0
p25
p50
p75
p100
hist
Age
0
1
1246.49
223.11
37
1061.5
1208
1418.5
1875
▁▁▇▇▃
AMAS
0
1
21.98
6.60
4
18.0
22
26.5
45
▂▆▇▃▁
RCMAS
0
1
19.24
7.57
1
14.0
19
25.0
41
▂▇▇▅▁
Arith
0
1
5.30
2.11
0
4.0
6
7.0
8
▂▃▃▇▇
This reveals a well-rounded dataset, complete with both categorical and numerical variables that offer a balanced view of the participants’ characteristics and their math-related anxiety scores. The dataset is complete with no missing data.
Data Dictionary
Quantitative Data
Age (dbl): The age of the participant, measured in years.
AMAS (dbl): American Mathematics Anxiety Scale (AMAS) score, indicating the level of math anxiety.
In this transformation, the Age column has been scaled down by dividing the values by 120, and the Gender column has been converted into a factor with two levels: “Boy” and “Girl.”
Summary of Average AMAS Scores and Count by Gender
# A tibble: 2 × 3
Gender average_AMAS count
<fct> <dbl> <int>
1 Boy 21.2 323
2 Girl 22.9 276
The summary of average AMAS scores grouped by gender reveals that girls have a slightly higher average AMAS score (22.93) compared to boys (21.17). The total count of boys in the dataset is 323, while the total count of girls is 276. This suggests that while boys and girls show close levels of math anxiety, girls exhibit a marginally higher average score in the dataset.
Summary of Average AMAS Scores and Count by Gender and Age Group
`summarise()` has grouped output by 'Gender'. You can override using the
`.groups` argument.
# A tibble: 474 × 4
# Groups: Gender [2]
Gender Age average_AMAS count
<fct> <dbl> <dbl> <int>
1 Boy 7.76 16 1
2 Boy 7.82 24 1
3 Boy 7.83 13 1
4 Boy 7.9 14 1
5 Boy 7.91 11 1
6 Boy 7.94 29 1
7 Boy 7.96 16.5 2
8 Boy 7.98 9 1
9 Boy 7.98 29 1
10 Boy 7.99 9 1
# ℹ 464 more rows
The analysis of average AMAS scores, grouped by gender and age, provides a detailed look at how math anxiety varies across different age groups for boys and girls. By breaking down the data in this way, it is possible to explore patterns in math anxiety that may correlate with age and gender, allowing for a more nuanced understanding of how these factors influence AMAS scores.
Summary of Average RCMAS Scores and Count by Gender
# A tibble: 2 × 3
Gender average_RCMAS count
<fct> <dbl> <int>
1 Boy 18.1 323
2 Girl 20.6 276
The table shows a summary of average RCMAS (Revised Children’s Manifest Anxiety Scale) scores grouped by gender. The average RCMAS score for boys is 18.12, based on 323 participants, while the average RCMAS score for girls is 20.55, based on 276 participants. This suggests that, on average, girls exhibit slightly higher anxiety levels as measured by the RCMAS scale compared to boys.
Summary of Average RCMAS Scores and Count by Gender and Age Group
Warning: One or more parsing issues, call `problems()` on your data frame for details,
e.g.:
dat <- vroom(...)
problems(dat)
Rows: 16369 Columns: 1
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): title;author;date;publisher;identifier;series;subseries;nchap;nword...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
startrek_data
# A tibble: 16,369 × 1
title;author;date;publisher;identifier;series;subseries;nchap;nword;nchar;d…¹
<chr>
1 "Star Trek: Star Trek Movie Tie-In;Alan Dean Foster;2009-05-12;Simon and Sch…
2 "Starfleet Academy: The Delta Anomaly;Rick Barba;2010-11-02;Simon Spotlight;…
3 "Starfleet Academy: The Edge;Rudy Josephs;2010-12-28;Simon Spotlight;9781442…
4 "Starfleet Academy: The Gemini Agent;Rick Barba;2011-06-28;Simon Spotlight;9…
5 "Starfleet Academy: The Assassination Game;Alan Gratz;2012-06-26;Simon Spotl…
6 "Star Trek: Into Darkness;Alan Dean Foster;2013-05-21;Gallery Books;97814767…
7 "Captain's Table 1: War Dragons;James T. Hirk;1998-06-01;Pocket Books;978143…
8 "Captain's Table 5: Once Burned;Mackenzie;1998-10-01;Pocket Books;9780743455…
9 "Captain's Table 6: Where Sea Meets Sky;Christopher Pike;1998-10-01;Pocket B…
10 "For my brother, Ray, who introduced me to Star Trek and helped tune it in b…
# ℹ 16,359 more rows
# ℹ abbreviated name:
# ¹`title;author;date;publisher;identifier;series;subseries;nchap;nword;nchar;dedication`
The Star Trek Books dataset contains 16,369 entries, each representing a book or publication related to the Star Trek franchise. Key variables include the book title, author, publication date, and publisher. The dataset also includes details such as unique identifiers, the series and subseries the book belongs to, and the number of chapters, words, and characters in each book. Additionally, there is a field for any author dedications.
Rows: 783 Columns: 11
── Column specification ────────────────────────────────────────────────────────
Delimiter: ";"
chr (7): title, author, publisher, identifier, series, subseries, dedication
dbl (3): nchap, nword, nchar
date (1): date
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
The dataset is read using a semicolon (;) delimiter.
Dataset - startrek_data
startrek_data
# A tibble: 783 × 11
title author date publisher identifier series subseries nchap nword
<chr> <chr> <date> <chr> <chr> <chr> <chr> <dbl> <dbl>
1 Star Tr… Alan … 2009-05-12 Simon an… 1439163391 AV <NA> 18 77035
2 Starfle… Rick … 2010-11-02 Simon Sp… 978144241… AV Starflee… 14 40129
3 Starfle… Rudy … 2010-12-28 Simon Sp… 978144241… AV Starflee… 31 52547
4 Starfle… Rick … 2011-06-28 Simon Sp… 978144241… AV Starflee… 13 39495
5 Starfle… Alan … 2012-06-26 Simon Sp… 978144242… AV Starflee… 30 62030
6 Star Tr… Alan … 2013-05-21 Gallery … 978147671… AV <NA> 17 77438
7 Captain… James… 1998-06-01 Pocket B… 978143910… CT <NA> 21 95110
8 Captain… Macke… 1998-10-01 Pocket B… 978074345… CT <NA> 26 76392
9 Captain… Chris… 1998-10-01 Pocket B… 978143910… CT <NA> 34 78678
10 The Cap… John … 2000-03-01 Pocket B… 978074340… CT <NA> 176 436682
# ℹ 773 more rows
# ℹ 2 more variables: nchar <dbl>, dedication <chr>
First 10 rows of the Star Trek Book dataset
startrek_data %>%head(10)
# A tibble: 10 × 11
title author date publisher identifier series subseries nchap nword
<chr> <chr> <date> <chr> <chr> <chr> <chr> <dbl> <dbl>
1 Star Tr… Alan … 2009-05-12 Simon an… 1439163391 AV <NA> 18 77035
2 Starfle… Rick … 2010-11-02 Simon Sp… 978144241… AV Starflee… 14 40129
3 Starfle… Rudy … 2010-12-28 Simon Sp… 978144241… AV Starflee… 31 52547
4 Starfle… Rick … 2011-06-28 Simon Sp… 978144241… AV Starflee… 13 39495
5 Starfle… Alan … 2012-06-26 Simon Sp… 978144242… AV Starflee… 30 62030
6 Star Tr… Alan … 2013-05-21 Gallery … 978147671… AV <NA> 17 77438
7 Captain… James… 1998-06-01 Pocket B… 978143910… CT <NA> 21 95110
8 Captain… Macke… 1998-10-01 Pocket B… 978074345… CT <NA> 26 76392
9 Captain… Chris… 1998-10-01 Pocket B… 978143910… CT <NA> 34 78678
10 The Cap… John … 2000-03-01 Pocket B… 978074340… CT <NA> 176 436682
# ℹ 2 more variables: nchar <dbl>, dedication <chr>
The first 10 rows of the Star Trek Books dataset display information about books published by various publishers, such as Simon and Schuster, Simon Spotlight, Gallery Books, and Pocket Books. The dataset shows identifiers and indicates the series the books belong to, like “AV” or “CT.” Some entries also belong to subseries, such as “Starfleet Academy.” Additionally, the dataset provides details on the number of chapters (nchap) and the total word count (nword) for each book.
This provides a glimpse into the 783 rows and 11 columns, summarizing various characteristics of Star Trek books.
Inspect - startrek_data
inspect(startrek_data)
categorical variables:
name class levels n missing
1 title character 781 783 0
2 author character 277 783 0
3 publisher character 21 772 11
4 identifier character 783 783 0
5 series character 28 783 0
6 subseries character 15 56 727
7 dedication character 372 372 411
distribution
1 Kobayashi Maru (0.3%), Warped (0.3%) ...
2 Peter David (4.9%) ...
3 Pocket Books (67.4%) ...
4 (%) ...
5 TOS (26.8%), TNG (18.6%), SCE (10.7%) ...
6 Typhon Pact (16.1%) ...
7 (%) ...
Date variables:
name class first last min_diff max_diff n missing
1 date Date 1967-01-01 2017-11-28 0 days 485 days 783 0
quantitative variables:
name class min Q1 median Q3 max mean sd
1 nchap numeric 1 13 21 29.0 373 24.58816 21.61247
2 nword numeric 782 52500 70730 90994.5 687175 76190.07535 52453.34633
3 nchar numeric 4337 310520 415964 555866.5 4484069 461822.36271 326062.44928
n missing
1 760 23
2 783 0
3 783 0
The inspection of the Star Trek book dataset reveals a comprehensive breakdown of variables in categorical, date, and quantitative formats. Among the categorical variables, we have entries such as title, author, publisher, identifier, series, subseries, and dedication. These variables contain character data, with subseries and dedication having the most missing values. For date variables, we have the date variable that records the publication dates of books, ranging from 1967-01-01 to 2017-11-28, spanning a 50-year period. The quantitative variables include nchap (number of chapters), nword (number of words), and nchar (number of characters), all numeric with varying distributions. The average number of chapters per book is approximately 24.59, while the average word count is around 76190 words, and character count averages around 461822 characters.
Skim - startrek_data
skim(startrek_data)
Data summary
Name
startrek_data
Number of rows
783
Number of columns
11
_______________________
Column type frequency:
character
7
Date
1
numeric
3
________________________
Group variables
None
Variable type: character
skim_variable
n_missing
complete_rate
min
max
empty
n_unique
whitespace
title
0
1.00
4
58
0
781
0
author
0
1.00
2
138
0
277
0
publisher
11
0.99
7
26
0
21
0
identifier
0
1.00
10
41
0
783
0
series
0
1.00
2
6
0
28
0
subseries
727
0.07
4
23
0
15
0
dedication
411
0.48
98
97953
0
372
0
Variable type: Date
skim_variable
n_missing
complete_rate
min
max
median
n_unique
date
0
1
1967-01-01
2017-11-28
2001-12-14
577
Variable type: numeric
skim_variable
n_missing
complete_rate
mean
sd
p0
p25
p50
p75
p100
hist
nchap
23
0.97
24.59
21.61
1
13
21
29.0
373
▇▁▁▁▁
nword
0
1.00
76190.08
52453.35
782
52500
70730
90994.5
687175
▇▁▁▁▁
nchar
0
1.00
461822.36
326062.45
4337
310520
415964
555866.5
4484069
▇▁▁▁▁
The categorical variables like title, author, and publisher are mostly complete, although subseries has significant missing values (727 missing entries) and dedication has 411 missing entries. The date column spans from 1967 to 2017 with a median date around December 14, 2001. For numeric variables, nchap (number of chapters) has 23 missing values, with an average of around 25 chapters per book. The nword (number of words) and nchar (number of characters) columns are complete, showing an average of 76,190 words and 461,822 characters per book.
Data Dictionary
Quantitative Data
nword (dbl): The number of words in the book.
nchar (dbl): The number of characters (including spaces and punctuation) in the book.
date (date): The publication date of the book.
Qualitative Data
author: The name of the author who wrote the book.
publisher: The publishing company responsible for releasing the book.
series: The main series of the Star Trek universe to which the book belongs (e.g., The Original Series, The Next Generation).
subseries: A subcategory or subseries within the main Star Trek series (e.g., Deep Space Nine).
dedication: A text string containing the book’s dedication, if applicable
Summary of Average Number of Characters and Book Count by Author
# A tibble: 277 × 3
author average_characters count
<chr> <dbl> <int>
1 A. C. Crispin 571858 1
2 A.C. Crispin 480611. 3
3 Aaron Rosenberg 87922. 3
4 Adam “mojo” Lebowitz; Robert Bonchune; Jonathan Lan… 144868 1
5 Alan Dean Foster 498784. 2
6 Alan Gratz 349595 1
7 Allyn Gibson 144935 1
8 Andrew J. Robinson 641696 1
9 Andy Mangels 576557 1
10 Andy Mangels and Michael A. Martin 571987 1
# ℹ 267 more rows
This gives insight into the volume of work and average text length produced by different authors in the dataset.
Summary of Average Number of Words and Book Count by Publisher
The summary of the Star Trek book dataset, grouped by publisher, provides insights into the average number of words and the total book count for each publisher. For example, Abrams Publications has one book with an average of 132,041 words, while Aladdin has 28 books with an average of 23,800 words.
Summary of Average Word Count and Total Books by Series
# A tibble: 28 × 3
series average_words count
<chr> <dbl> <int>
1 AV 58112. 6
2 CT 157688. 5
3 DS9 91641. 83
4 DSC 89367 1
5 DTI 59215. 5
6 ENT 82724. 19
7 KE 74668. 4
8 MIR 115916. 5
9 MISC 89974. 13
10 MYR 145127 3
# ℹ 18 more rows
The grouping is done by series, and it shows a range of series with their calculated average word counts and the total number of books within each series. For example, the “TOS” series has 210 books with an average word count of 76,522.67, while the “CT” series has only 5 books but a much higher average word count of 157,687.80. There is also a wide variation in word count across different series, such as “YA-DS9” having only 23,298 words on average compared to more substantial works like “MYR” with 145,127 words. This summary helps identify how the different series compare in terms of length and volume, highlighting the diversity of content.
Summary of Total Books and Average Word Count by Series and Author
`summarise()` has grouped output by 'series'. You can override using the
`.groups` argument.
# A tibble: 419 × 4
# Groups: series [28]
series author total_books average_words
<chr> <chr> <int> <dbl>
1 AV Alan Dean Foster 2 77236.
2 AV Alan Gratz 1 62030
3 AV Rick Barba 2 39812
4 AV Rudy Josephs 1 52547
5 CT Christopher Pike 1 78678
6 CT James T. Hirk 1 95110
7 CT John J. Ordover and Dean Wesley Smith 1 436682
8 CT Keith R.A. Decandido 1 101577
9 CT Mackenzie 1 76392
10 DS9 Andrew J. Robinson 1 113304
# ℹ 409 more rows
This allows for an analysis of how different authors contribute to various Star Trek series in terms of the number of books they have written and the average length (in words) of their works.
Summary of Total Books and Average Character Count by Year and Author
`summarise()` has grouped output by 'year(date)'. You can override using the
`.groups` argument.
# A tibble: 624 × 4
# Groups: year(date) [51]
`year(date)` author total_books average_characters
<dbl> <chr> <int> <dbl>
1 1967 James Blish 1 235524
2 1968 James Blish 1 232094
3 1969 James Blish 1 224369
4 1970 James Blish 1 207001
5 1971 James Blish 1 239859
6 1972 James Blish 4 268216.
7 1973 James Blish 1 314575
8 1974 James Blish 1 310749
9 1975 James Blish 1 332807
10 1976 Sondra Marshak 1 431759
# ℹ 614 more rows
This helps analyze trends over time, showing how many books each author contributed in a particular year and the typical length of those books in terms of characters.
Summary of Total Books and Average Character Count by Publisher and Series
This helps in analyzing the contribution of different publishers to various Star Trek series and the typical length of books published within those series. It provides insight into the publishing trends, helping to compare the output volume and book length across different series and publishers.
Summary of Average Word Count and Total Books by Subseries
# A tibble: 16 × 3
subseries average_words count
<chr> <dbl> <int>
1 Academy 106196 1
2 Dark Passions 52072. 2
3 Day of Honor 116344. 6
4 Destiny 147208 4
5 Dominion War 70683. 5
6 Gateways 116709 1
7 Mirror Universe Trilogy 95785. 3
8 Prey 97729 3
9 Section 31 75524. 6
10 Starfleet Academy 48550. 4
11 The Badlands 60314. 2
12 The Brave and the Bold 66966 2
13 The Fall 95050. 5
14 Totality 80817. 3
15 Typhon Pact 124866. 9
16 <NA> 74781. 727
The summarized data from the Star Trek book dataset provides insights into the average word count and total number of books for each subseries. For example, the “Academy” subseries has 1 book with an average word count of 106,196, while “Day of Honor” spans 6 books with an average of 113,644 words. The “Destiny” subseries consists of 4 books averaging 147,208 words, and the “Typhon Pact” subseries features 9 books with an average of 124,866 words. Some subseries, like “Prey,” with 3 books averaging 97,729 words, reflect moderately sized collections. Additionally, there are entries like NA, which account for 727 books with an average word count of 74,780, indicating a group of records that might not fit into a specific subseries.