Session 2 — Data wrangling

Language Analytics in R

Author

Peter Gilles

0.1 Learning objectives

By the end of this session you will:

  • Use dplyr verbs to filter, select, mutate, arrange, and summarize data.
  • Chain operations with the pipe (|>).
  • Use group_by() for group-wise summaries.
  • Combine tidyr operations (pivot, separate, unite) where relevant.
  • Apply the tidyverse workflow to language and corpus data.

1 Tidyverse and dplyr

  • The tidyverse: a set of packages for data manipulation and visualization.
  • dplyr verbs: one table in, one table out; work with columns by name.

1.1 Filter and select

  • filter() — keep rows that satisfy a condition.
  • select() — keep or drop columns (with helpers like starts_with(), contains()).

[Add code chunks and examples using a linguistic or corpus dataset.]

1.2 Mutate and arrange

  • mutate() — add or modify columns (e.g. derived variables, recoding).
  • arrange() — sort rows (ascending/descending).

[Add code chunks and examples.]

1.3 Summarize and group_by

  • summarize() — one row per group (or one row total).
  • group_by() — define groups for group-wise operations (e.g. by lemma, by document).

[Add code chunks: e.g. counts per category, means per group.]

2 Pipes and chaining

  • Pipe |>: pass the result of one step as the first argument to the next.
  • Style: one verb per line, indent after |>.

[Add a full pipeline example: load → filter → group_by → summarize → arrange.]

3 Tidyr (when needed)

  • pivot_longer() / pivot_wider() — reshape between long and wide.
  • separate() / unite() — split or combine columns.

[Add a short example if relevant for the course data.]

4 Hands-on exercises

  • Filter and summarize a dataset (e.g. population, LOD examples, or word counts).
  • Build a pipeline from raw data to a summary table.

5 Summary and further reading