Session 1 — Introduction: R workflow and first steps
Language Analytics in R
Author
Peter Gilles
0.1 Learning objectives
By the end of this session you will:
Use RStudio and run R code in the console and in a Quarto document.
Create and inspect objects with the assignment operator (<-).
Write clear names and comments, and apply basic code style.
Call functions with correct syntax and use the pipe (|>).
Run and render a minimal Quarto (.qmd) script.
Load and inspect text/language data (e.g. CSV) and understand basics of data import for corpora.
This course focuses on linguistic and language data in R. Examples and exercises use words, frequencies, and simple text data so you can quickly apply the same ideas to your own corpora and experiments.
Run code from the editor with Cmd+Enter (Mac) or Ctrl+Enter (Windows/Linux).
1.2 Getting help
Prefix a command with ?, e.g. ?mean to call the help page for this command. Or use the Help pane in RStudio.
# Access help for a function by name:?mean# Alternatively, use the help() function:help("mean")# Search help topics with apropos() or help.search():apropos("lin") # Shows all objects/functions with "lin" in the name
help.search("linear model") # Looks up help topics related to linear models# View example usage for functions:example(mean)
mean> x <- c(0:10, 50)
mean> xm <- mean(x)
mean> c(xm, mean(x, trim = 0.10))
[1] 8.75 5.50
# See the vignette (long-form documentation) for a package:# vignette("dplyr")
There are numerous R cheatsheets that summarise command usages in a structured way.
1.3 Packages
Install a package once with install.packages("package_name") or use the packages tab in RStudio. Load it every time you start R or open a document with library(package_name):
# Core packages for data import and manipulationlibrary(tidyverse)
Warning: package 'ggplot2' was built under R version 4.5.2
Warning: package 'tibble' was built under R version 4.5.2
Warning: package 'tidyr' was built under R version 4.5.2
Warning: package 'readr' was built under R version 4.5.2
Warning: package 'purrr' was built under R version 4.5.2
Warning: package 'dplyr' was built under R version 4.5.2
1.4
1.5 Coding basics
You can use R as a calculator (e.g. for reaction times, accuracy, or counts):
1/200*30
[1] 0.15
(59+73+2) /3
[1] 44.66667
sin(pi /2)
[1] 1
Create objects with assignment:
calc <- (59+73+2) /3# e.g. store a value like VOT (ms) or word lengthvot_ms <-25word_length <-nchar("linguistics")
Every assignment has the form:
object_name <- value
Read it as: “object name gets value”.
Tip: In RStudio use Alt + - (minus) to insert <- with spaces.
The value is stored but not printed. Type the object name to inspect it:
vot_ms
[1] 25
word_length
[1] 11
Combine values into a vector with c() — like a column of observations (e.g. reaction times, word lengths, or a list of words):
# numeric vector: e.g. reaction times in msrt_ms <-c(320, 415, 380, 290, 410)# character vector: e.g. word forms or conditionswords <-c("cat", "dog", "mouse")# a rose is a rose — last expression is printed in the reportstein <-c("a", "rose", "is", "a", "rose", "is", "a", "rose")print(stein)
[1] "a" "rose" "is" "a" "rose" "is" "a" "rose"
Instead of referencing every token in a vector individually, we can convert them to factors.
Vectors and factors hold a single column of values. For tabular data (rows and columns), R uses data frames. A tibble is a modern, tidyverse-style data frame: it prints nicely, doesn’t convert strings to factors by default, and works well with the pipe and dplyr.
2.1 Creating a tibble
Use tibble() to build a table from named vectors (each vector becomes a column). Columns can be referred to by name in later code.
library(tibble)# Example: a small table (e.g. word, frequency, part of speech)lex <-tibble(word =c("the", "cat", "sat", "on", "mat"),freq =c(100, 5, 3, 50, 2),pos =c("det", "noun", "verb", "prep", "noun"))lex
# A tibble: 5 × 3
word freq pos
<chr> <dbl> <chr>
1 the 100 det
2 cat 5 noun
3 sat 3 verb
4 on 50 prep
5 mat 2 noun
With data.frame() you get a classic R data frame; tibbles are usually preferred in the tidyverse because of their print format and consistent behaviour.
2.2 Inspecting structure
Print: typing the name shows the first rows and column types (tibbles don’t flood the console).
glimpse() (dplyr): one row per column, with type and a few values.
nrow(), ncol(), names(): number of rows, number of columns, column names.
Use $ to get one column as a vector: lex$word, lex$freq. You can use that in expressions (e.g. mean(lex$freq)) or pass it to functions. Later you’ll use dplyr verbs (filter, select, mutate) to work with whole tables without pulling columns out by hand.
lex$word
[1] "the" "cat" "sat" "on" "mat"
mean(lex$freq)
[1] 32
3 Comments and names
3.1 Comments
R ignores everything after # on a line. Use comments to explain why you did something (e.g. why you chose a threshold or excluded a condition), not just what the code does.
# word list for lexical decision practicewords <-c("cat", "dog", "mouse")# mean reaction time (ms)mean(rt_ms)
[1] 363
3.2 Naming conventions
Names must start with a letter and contain only letters, numbers, _, and ..
Use snake_case: lowercase words separated by _ (e.g. word_list, response_time).
Example: seq() creates sequences (e.g. for item IDs or conditions):
seq(from =1, to =10)
[1] 1 2 3 4 5 6 7 8 9 10
seq(1, 5, by =0.5)
[1] 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0
You can omit argument names for the first arguments. Use Tab after typing a function name to see arguments and help.
Strings (text) need matched quotation marks — essential when working with words and corpora:
# quotation marks around stringsword <-"hello"# Compare strings with == (e.g. for accuracy coding)"linguistics"=="linguistics"
[1] TRUE
"linguistics"=="lingiustics"
[1] FALSE
If you see a + in the console, R is waiting for more input (often a missing " or )). Press Escape to cancel.
5 Code style
Good style makes code easier to read for you and others. We follow the tidyverse style guide.
5.1 Names (reminder)
Use lowercase, numbers, and _.
Prefer long, descriptive names (e.g. participant_response_time over prt).
5.2 Spaces
Put spaces around +, -, ==, <, etc. (except ^).
Put spaces around <-.
Put a space after every comma.
# Strive for: spaces around operators and after comman_char <-nchar("word")mean(rt_ms, na.rm =TRUE)
[1] 363
# Avoid: n_char<-nchar("word") or mean(rt_ms,na.rm=TRUE)
5.3 Pipes
The pipe |> (alternatively %>% when using package magrittr) passes the result on the left into the function on the right. Read it as “then”.
Put a space before |> and put |> at the end of the line.
Indent the next line by two spaces.
c("Méindeg", "Dënschdeg", "Mëttwoch") |>nchar()
[1] 7 9 8
We’ll use the pipe with language data in the next section.
5.4 Sectioning comments
Break long scripts into sections:
# Load data --------------------------------------# Plot data --------------------------------------
In RStudio: Cmd+Shift+R (Mac) or Ctrl+Shift+R (Windows/Linux) inserts a section header.
Tip: Install the styler package and use the command palette (Cmd+Shift+P) to format code automatically.
6 Scripting in Quarto
6.1 Why Quarto?
Combine text, code, and output in one document — ideal for reporting on corpus or experiment results.
Reproducible: re-run and re-render to update tables and figures.
Output to HTML, PDF, or other formats.
6.2 Running code in a .qmd file
Create an empty Quarto file in RStudio: File > New File > Quarto Document …
Code chunks are enclosed in ```{r} and ```.
Run one chunk: Cmd+Enter (Mac) or Ctrl+Enter (Windows/Linux).
Render the whole document: Cmd+Shift+K (or Ctrl+Shift+K) or click Render.
Example chunk options:
#| label: my-chunk#| echo: true#| eval: true
label: unique name for the chunk.
echo: whether to show the code in the output.
eval: whether to run the code.
7 Working with text data and corpora (data import)
For language analytics we usually work with tables (e.g. CSV or Excel) or text files. Here we focus on reading tabular data with readr and tidyverse tools. We will cover tidytext and quanteda in Session 3 (corpus analysis).
7.1 Reading a CSV file
Use read_csv() from the readr package (included in tidyverse) to read comma-separated files. Example: population by commune (Luxembourg), file rnrpp-population-commune.csv (same folder as this document, or adjust the path if needed).
# Mean population per commune (total = sum of the four count columns)mean(pop$FEMMES_MINEURES + pop$HOMMES_MINEURS + pop$FEMMES_MAJEURES + pop$HOMMES_MAJEURS)
[1] 6774.01
7.4 Aggregating the data
With dplyr you can aggregate by groups (e.g. totals, means). Example: add a total population per commune, then show the top 10 communes by total population; or summarize across all rows (e.g. national totals by category).
# Total population per commune (using the pipe)pop |>mutate(total = FEMMES_MINEURES + HOMMES_MINEURS + FEMMES_MAJEURES + HOMMES_MAJEURS ) |>arrange(desc(total)) |>select(COMMUNE_NOM, total) |>head(10)
So: load with read_csv(), inspect with glimpse(), names(), head(), access columns with $, and aggregate with mutate(), summarise(), arrange(), and select() (and later group_by() for group-wise summaries).
For text mining workflows with tidytext and quanteda, see Session 3.
8 Import LOD dictionary data (XML)
8.1 Getting the file
The LOD linguistic dataset is published as open data by the Luxembourg government (Zenter fir d’Lëtzebuerger Sprooch) under CC0 (LOD – Linguistesch Daten (resource). Download the zip, extract it, and place the XML file (new_lod-art.xml, ~74 MB) in a folder in your project (e.g. data/). Below we read it from that folder and inspect its structure.
Install once:install.packages("xml2"). Then load with library(xml2).
8.2 Reading the XML file
Read the XML from the file path. The file is large, so the first parse may take a few seconds.
library(xml2)
Warning: package 'xml2' was built under R version 4.5.2
# Path to the XML file (e.g. in data/; adjust if your file is elsewhere)lod_xml_path <-"new_lod-art.xml"# Read XML from the folderlod_xml <-read_xml(lod_xml_path)# Root elementxml_name(lod_xml)
[1] "lod"
xml_children(lod_xml) |>length() # number of top-level children (entries)
[1] 33737
8.3 Structure of the LOD XML
The document has a single root element <lod>, which contains many <entry> elements. Each entry represents one headword (lemma) and its dictionary information. The hierarchy is:
1. Entry level (<entry id="...">)
<lemma> — the headword (e.g. “A”).
<ipa> — IPA pronunciation (e.g. “aː”).
<microStructure> — grammatical and semantic information.
2. Inside <microStructure>
<partOfSpeech gen="..."> — part of speech (e.g. SUBST, with optional gender).
<inflection type="..."> — inflectional forms (e.g. plural), with <form nRuleForm="..."> for each form.
<translation> — equivalent in the target language.
<semanticClarifier> — short clarification (e.g. “Sehorgan” for “Auge”).
<examples count="..."> — example sentences:
<example> (optional id):
<text> — Luxembourgish example: <word> elements for each token, <inflectedHeadword> for the headword form used; optional <attribute> (e.g. EGS = example).
<gloss> (optional) — gloss/paraphrase, again with <word> elements.
So each entry → microStructure → grammaticalUnit → one or more meaning → translations in several languages and examples (with optional glosses).
8.4 How XPath works
XPath is a language for selecting nodes in an XML (or HTML) tree. In R you use it with xml2 via xml_find_first() (one node) and xml_find_all() (all matching nodes). The expression is a path that describes where to go in the tree.
XPath
Meaning
Example
nodename
Direct child with that name
lemma → children of the current node named lemma
/nodename
Child from the root
/lod → root element; /lod/entry → every entry directly under lod
//nodename
Any descendant (any depth)
//entry → every entry in the document
.//nodename
Any descendant from the current node
Inside an entry, .//lemma finds lemma anywhere under that entry
*
Any element (wildcard)
./* → all direct children of the current node
[@attr='value']
Filter by attribute
targetLanguage[@lang='de'] → targetLanguage whose lang is de
path1/path2
Then: path2 under path1
.//meaning/examples/example → example under examples under meaning
Examples on the LOD:
//entry — every entry in the document (from root).
//entry/lemma — every lemma that is a direct child of an entry.
.//lemma — the lemma of the current entry (when you’re already on one entry).
.//meaning/targetLanguage[@lang='en']/translation — English translations under the current entry.
.//meaning/examples/example — all example nodes under the current entry.
The dot (.) is the current node; .// means “starting from here, any descendant”. Without the dot, // starts from the document root.
8.5 Inspecting one entry in R
To see the structure of a single entry and extract a few fields:
# First entryfirst_entry <-xml_find_first(lod_xml, "//entry")# Lemma and IPAxml_find_first(first_entry, ".//lemma") |>xml_text()
# Number of examples in the first meaningxml_find_first(first_entry, ".//meaning/examples") |>xml_attr("count")
[1] "2+"
8.6 Exploring the examples section
Each meaning can contain an <examples> block with one or more <example> nodes. Each <example> has at least one <text> element whose children are <word> and <inflectedHeadword> (and optionally <attribute>), in order — that is the Luxembourgish example sentence. Below we collect several example sentences from random entries and put them into a tibble so you can work with them like any table (filter, count, join, etc.).
Helper: turn one <example> node into the sentence string (first <text> only):
# Sentence text from one <example>: first <text>, then concatenate child elements (word, inflectedHeadword, etc.)get_sentence <-function(ex) { text_node <-xml_find_first(ex, ".//text")if (inherits(text_node, "xml_missing")) return(NA_character_) tokens <-xml_find_all(text_node, "./*")paste(xml_text(tokens), collapse =" ")}
Collect examples and build a tibble: sample a number of entries, take up to two examples per entry, then combine lemma and sentence into a data frame.
library(tibble)all_entries <-xml_find_all(lod_xml, "//entry")set.seed(123)n_entries <-30sampled <-sample(length(all_entries), min(n_entries, length(all_entries)))rows <-list()for (i inseq_along(sampled)) { entry <- all_entries[[sampled[i]]] lemma <-xml_find_first(entry, ".//lemma") |>xml_text() exs <-xml_find_all(entry, ".//meaning/examples/example")# Take at most 2 examples per entry exs <- exs[1:min(2, length(exs))]for (ex in exs) { sent <-get_sentence(ex)if (!is.na(sent) &&nchar(sent) >0) rows[[length(rows) +1]] <-list(lemma = lemma, sentence = sent) }}# One row per example sentencelod_examples <-tibble(lemma =vapply(rows, function(r) r$lemma, character(1)),sentence =vapply(rows, function(r) r$sentence, character(1)))lod_examples
# A tibble: 34 × 2
lemma sentence
<chr> <chr>
1 Belsch meng bescht Frëndin ass Belsch
2 Urbanistin d' Urbanistin schafft Mesuren aus, déi zur Berouegung vum Trafi…
3 befalen bei waarmem, fiichtem Wieder ginn d' Riewen nawell gär vun enge…
4 Finanzwiesen d' EU fuerdert méi eng streng Reglementatioun vum europäesche F…
5 Finanzwiesen als Ekonomistin hues de gutt Chancen, am Finanzwiesen eng Aarbe…
6 Virbau wéi mir d' Haus renovéiert hunn, hu mer eisen Hall duerch e Vir…
7 Interferenz fir Interferenzen tëschent den elektreschen Apparater ze vermei…
8 Interferenz bei Kanner, déi méisproocheg opwuessen, kënnt et dacks zu sproo…
9 Kärebrout eise Bäcker mécht wonnerbaart Kärebrout
10 Kärebrout kanns de nach zwee Kärebrout bei de Bäcker siche goen?
# ℹ 24 more rows
You can now use dplyr or tidyr on lod_examples (e.g. count sentences per lemma, filter by length, or join with other data). The same idea scales to all entries if you loop over every entry and every example and bind the rows.
text (word, inflectedHeadword, optional attribute), optional gloss.
This structure lets you extract lemmas, translations, parts of speech, inflections, and example sentences (with glosses) for lexical and corpus-oriented analyses in R.
9 Hands-on exercises
Typo. Why does this fail? (Look very carefully at the object name.)
my_variable <-10my_varıable
Vectors and strings. In a new code chunk:
Create a character vector test_words with at least three words (e.g. c("word1", "word2", "word3")).
Use length(test_words) and nchar(test_words) and interpret the result.
Comparisons. Run these and say what they return (TRUE/FALSE) and why:
"dog"=="dog""dog"!="cat"nchar("linguistics") >5
Data import and inspection. Create a short CSV string with columns word and count (e.g. three rows). Read it with read_csv(), then use glimpse() and $ to inspect the count column and compute mean(count).
Style. Restyle this into a clear, multi-line pipe (spaces, line breaks, indentation):
3 Comments and names
3.1 Comments
R ignores everything after
#on a line. Use comments to explain why you did something (e.g. why you chose a threshold or excluded a condition), not just what the code does.3.2 Naming conventions
_, and.._(e.g.word_list,response_time).R is case-sensitive and does not fix typos:
word_listandWord_listare different objects.Tip: Type a prefix and press Tab for autocomplete; use ↑ or Cmd+↑ to recall previous commands.