Session 2 — Data wrangling

Language Analytics in R

Author

Peter Gilles

0.1 Learning objectives

By the end of this session you will:

Use dplyr verbs to filter, select, mutate, arrange, and summarize data.
Chain operations with the pipe (|>).
Use group_by() for group-wise summaries.
Combine tidyr operations (pivot, separate, unite) where relevant.
Apply the tidyverse workflow to language and corpus data.

1 Tidyverse and dplyr

The tidyverse: a set of packages for data manipulation and visualization.
dplyr verbs: one table in, one table out; work with columns by name.

1.1 Filter and select

filter() — keep rows that satisfy a condition.
select() — keep or drop columns (with helpers like starts_with(), contains()).

[Add code chunks and examples using a linguistic or corpus dataset.]

1.2 Mutate and arrange

mutate() — add or modify columns (e.g. derived variables, recoding).
arrange() — sort rows (ascending/descending).

[Add code chunks and examples.]

1.3 Summarize and group_by

summarize() — one row per group (or one row total).
group_by() — define groups for group-wise operations (e.g. by lemma, by document).

[Add code chunks: e.g. counts per category, means per group.]

2 Pipes and chaining

Pipe |>: pass the result of one step as the first argument to the next.
Style: one verb per line, indent after |>.

[Add a full pipeline example: load → filter → group_by → summarize → arrange.]

3 Tidyr (when needed)

pivot_longer() / pivot_wider() — reshape between long and wide.
separate() / unite() — split or combine columns.

[Add a short example if relevant for the course data.]

4 Hands-on exercises

Filter and summarize a dataset (e.g. population, LOD examples, or word counts).
Build a pipeline from raw data to a summary table.

5 Summary and further reading

R4DS Data transformation, Tidy data.
R for Linguists — datasets and dplyr.