Session 1 — Introduction: R workflow and first steps

Language Analytics in R

Author

Peter Gilles

0.1 Learning objectives

By the end of this session you will:

  • Use RStudio and run R code in the console and in a Quarto document.
  • Create and inspect objects with the assignment operator (<-).
  • Write clear names and comments, and apply basic code style.
  • Call functions with correct syntax and use the pipe (|>).
  • Run and render a minimal Quarto (.qmd) script.
  • Load and inspect text/language data (e.g. CSV) and understand basics of data import for corpora.

This course focuses on linguistic and language data in R. Examples and exercises use words, frequencies, and simple text data so you can quickly apply the same ideas to your own corpora and experiments.

The material draws on R for Data Science (2e) (chapters 2–4), R for Linguists, and other linguistics-oriented R resources (see Further reading).

1 RStudio and running R code

1.1 The RStudio interface

  • Console: type R code and see results.
  • Editor: write and save scripts (e.g. .qmd or .R).
  • Environment: view objects you have created (datasets, variables).
  • Files / Plots / Help: navigate files, view plots, read help.

Run code from the editor with Cmd+Enter (Mac) or Ctrl+Enter (Windows/Linux).

1.2 Getting help

Prefix a command with ?, e.g. ?mean to call the help page for this command. Or use the Help pane in RStudio.

# Access help for a function by name:
?mean

# Alternatively, use the help() function:
help("mean")

# Search help topics with apropos() or help.search():
apropos("lin")  # Shows all objects/functions with "lin" in the name
 [1] ".linearGradientPattern"           ".QuartoInlineRender"             
 [3] ".tilingPattern"                   "[.DLLInfoList"                   
 [5] "$.DLLInfo"                        "abline"                          
 [7] "ceiling"                          "contourLines"                    
 [9] "file.link"                        "file.symlink"                    
[11] "findLineNum"                      "getCallingDLL"                   
[13] "getCallingDLLe"                   "getDLLRegisteredRoutines.DLLInfo"
[15] "getNativeSymbolInfo"              "getSrcLines"                     
[17] "globalCallingHandlers"            "line"                            
[19] "lines"                            "lines.default"                   
[21] "loglin"                           "make.link"                       
[23] "matlines"                         "print.DLLInfo"                   
[25] "print.DLLInfoList"                "qqline"                          
[27] "readline"                         "readLines"                       
[29] "smooth.spline"                    "spline"                          
[31] "splinefun"                        "splinefunH"                      
[33] "Sys.readlink"                     "unlink"                          
[35] "withCallingHandlers"              "writeLines"                      
[37] "xspline"                         
help.search("linear model")  # Looks up help topics related to linear models

# View example usage for functions:
example(mean)

mean> x <- c(0:10, 50)

mean> xm <- mean(x)

mean> c(xm, mean(x, trim = 0.10))
[1] 8.75 5.50
# See the vignette (long-form documentation) for a package:
# vignette("dplyr")

There are numerous R cheatsheets that summarise command usages in a structured way.

1.3 Packages

Install a package once with install.packages("package_name") or use the packages tab in RStudio. Load it every time you start R or open a document with library(package_name):

# Core packages for data import and manipulation
library(tidyverse)
Warning: package 'ggplot2' was built under R version 4.5.2
Warning: package 'tibble' was built under R version 4.5.2
Warning: package 'tidyr' was built under R version 4.5.2
Warning: package 'readr' was built under R version 4.5.2
Warning: package 'purrr' was built under R version 4.5.2
Warning: package 'dplyr' was built under R version 4.5.2

1.4

1.5 Coding basics

You can use R as a calculator (e.g. for reaction times, accuracy, or counts):

1 / 200 * 30
[1] 0.15
(59 + 73 + 2) / 3
[1] 44.66667
sin(pi / 2)
[1] 1

Create objects with assignment:

calc <- (59 + 73 + 2) / 3

# e.g. store a value like VOT (ms) or word length
vot_ms <- 25
word_length <- nchar("linguistics")

Every assignment has the form:

object_name <- value

Read it as: “object name gets value”.

Tip: In RStudio use Alt + - (minus) to insert <- with spaces.

The value is stored but not printed. Type the object name to inspect it:

vot_ms
[1] 25
word_length
[1] 11

Combine values into a vector with c() — like a column of observations (e.g. reaction times, word lengths, or a list of words):

# numeric vector: e.g. reaction times in ms
rt_ms <- c(320, 415, 380, 290, 410)

# character vector: e.g. word forms or conditions
words <- c("cat", "dog", "mouse")

# a rose is a rose — last expression is printed in the report
stein <- c("a", "rose", "is", "a", "rose", "is", "a", "rose")
print(stein)
[1] "a"    "rose" "is"   "a"    "rose" "is"   "a"    "rose"

Instead of referencing every token in a vector individually, we can convert them to factors.

stein.fac <- as.factor(c("a", "rose", "is", "a", "rose", "is", "a", "rose"))
print(stein.fac)
[1] a    rose is   a    rose is   a    rose
Levels: a is rose

Print as table

table(stein.fac)
stein.fac
   a   is rose 
   3    2    3 

2 Tibbles and data frames

Vectors and factors hold a single column of values. For tabular data (rows and columns), R uses data frames. A tibble is a modern, tidyverse-style data frame: it prints nicely, doesn’t convert strings to factors by default, and works well with the pipe and dplyr.

2.1 Creating a tibble

Use tibble() to build a table from named vectors (each vector becomes a column). Columns can be referred to by name in later code.

library(tibble)

# Example: a small table (e.g. word, frequency, part of speech)
lex <- tibble(
  word = c("the", "cat", "sat", "on", "mat"),
  freq = c(100, 5, 3, 50, 2),
  pos  = c("det", "noun", "verb", "prep", "noun")
)
lex
# A tibble: 5 × 3
  word   freq pos  
  <chr> <dbl> <chr>
1 the     100 det  
2 cat       5 noun 
3 sat       3 verb 
4 on       50 prep 
5 mat       2 noun 

With data.frame() you get a classic R data frame; tibbles are usually preferred in the tidyverse because of their print format and consistent behaviour.

2.2 Inspecting structure

  • Print: typing the name shows the first rows and column types (tibbles don’t flood the console).
  • glimpse() (dplyr): one row per column, with type and a few values.
  • nrow(), ncol(), names(): number of rows, number of columns, column names.
glimpse(lex)
Rows: 5
Columns: 3
$ word <chr> "the", "cat", "sat", "on", "mat"
$ freq <dbl> 100, 5, 3, 50, 2
$ pos  <chr> "det", "noun", "verb", "prep", "noun"
nrow(lex)
[1] 5
names(lex)
[1] "word" "freq" "pos" 

2.3 Accessing columns

Use $ to get one column as a vector: lex$word, lex$freq. You can use that in expressions (e.g. mean(lex$freq)) or pass it to functions. Later you’ll use dplyr verbs (filter, select, mutate) to work with whole tables without pulling columns out by hand.

lex$word
[1] "the" "cat" "sat" "on"  "mat"
mean(lex$freq)
[1] 32

3 Comments and names

3.1 Comments

R ignores everything after # on a line. Use comments to explain why you did something (e.g. why you chose a threshold or excluded a condition), not just what the code does.

# word list for lexical decision practice
words <- c("cat", "dog", "mouse")

# mean reaction time (ms)
mean(rt_ms)
[1] 363

3.2 Naming conventions

  • Names must start with a letter and contain only letters, numbers, _, and ..
  • Use snake_case: lowercase words separated by _ (e.g. word_list, response_time).
response_time_ms <- 320
word_list <- c("the", "cat", "sat")
# Avoid: responseTimeMs, Word.List

R is case-sensitive and does not fix typos: word_list and Word_list are different objects.

Tip: Type a prefix and press Tab for autocomplete; use or Cmd+↑ to recall previous commands.

4 Calling functions

Functions are called like this:

function_name(argument1 = value1, argument2 = value2, ...)

Example: seq() creates sequences (e.g. for item IDs or conditions):

seq(from = 1, to = 10)
 [1]  1  2  3  4  5  6  7  8  9 10
seq(1, 5, by = 0.5)
[1] 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

You can omit argument names for the first arguments. Use Tab after typing a function name to see arguments and help.

Strings (text) need matched quotation marks — essential when working with words and corpora:

# quotation marks around strings
word <- "hello"

# Compare strings with == (e.g. for accuracy coding)
"linguistics" == "linguistics"
[1] TRUE
"linguistics" == "lingiustics"
[1] FALSE

If you see a + in the console, R is waiting for more input (often a missing " or )). Press Escape to cancel.

5 Code style

Good style makes code easier to read for you and others. We follow the tidyverse style guide.

5.1 Names (reminder)

  • Use lowercase, numbers, and _.
  • Prefer long, descriptive names (e.g. participant_response_time over prt).

5.2 Spaces

  • Put spaces around +, -, ==, <, etc. (except ^).
  • Put spaces around <-.
  • Put a space after every comma.
# Strive for: spaces around operators and after comma
n_char <- nchar("word")
mean(rt_ms, na.rm = TRUE)
[1] 363
# Avoid: n_char<-nchar("word") or mean(rt_ms,na.rm=TRUE)

5.3 Pipes

The pipe |> (alternatively %>% when using package magrittr) passes the result on the left into the function on the right. Read it as “then”.

  • Put a space before |> and put |> at the end of the line.
  • Indent the next line by two spaces.
c("Méindeg", "Dënschdeg", "Mëttwoch") |>
  nchar()
[1] 7 9 8

We’ll use the pipe with language data in the next section.

5.4 Sectioning comments

Break long scripts into sections:

# Load data --------------------------------------

# Plot data --------------------------------------

In RStudio: Cmd+Shift+R (Mac) or Ctrl+Shift+R (Windows/Linux) inserts a section header.

Tip: Install the styler package and use the command palette (Cmd+Shift+P) to format code automatically.

6 Scripting in Quarto

6.1 Why Quarto?

  • Combine text, code, and output in one document — ideal for reporting on corpus or experiment results.
  • Reproducible: re-run and re-render to update tables and figures.
  • Output to HTML, PDF, or other formats.

6.2 Running code in a .qmd file

  • Create an empty Quarto file in RStudio: File > New File > Quarto Document …
  • Code chunks are enclosed in ```{r} and ```.
  • Run one chunk: Cmd+Enter (Mac) or Ctrl+Enter (Windows/Linux).
  • Render the whole document: Cmd+Shift+K (or Ctrl+Shift+K) or click Render.

Example chunk options:

#| label: my-chunk
#| echo: true
#| eval: true
  • label: unique name for the chunk.
  • echo: whether to show the code in the output.
  • eval: whether to run the code.

7 Working with text data and corpora (data import)

For language analytics we usually work with tables (e.g. CSV or Excel) or text files. Here we focus on reading tabular data with readr and tidyverse tools. We will cover tidytext and quanteda in Session 3 (corpus analysis).

7.1 Reading a CSV file

Use read_csv() from the readr package (included in tidyverse) to read comma-separated files. Example: population by commune (Luxembourg), file rnrpp-population-commune.csv (same folder as this document, or adjust the path if needed).

# A tibble: 102 × 6
   COMMUNE_CODE COMMUNE_NOM FEMMES_MINEURES HOMMES_MINEURS FEMMES_MAJEURES
   <chr>        <chr>                 <dbl>          <dbl>           <dbl>
 1 1101         Beaufort                312            306            1250
 2 1102         Bech                    104            134             557
 3 0802         Beckerich               271            287            1178
 4 1103         Berdorf                 216            244             906
 5 0401         Bertrange               839            953            3899
 6 0301         Bettembourg            1091           1167            4821
 7 0702         Bettendorf              269            293            1203
 8 1201         Betzdorf                394            487            1663
 9 0502         Bissen                  337            314            1462
10 1202         Biwer                   194            194             727
# ℹ 92 more rows
# ℹ 1 more variable: HOMMES_MAJEURS <dbl>

7.2 Inspecting the dataset

  • Print the data frame: type its name.
  • View(pop): opens a spreadsheet-like viewer in RStudio.
  • glimpse(): compact view of columns and types.
glimpse(pop)
Rows: 102
Columns: 6
$ COMMUNE_CODE    <chr> "1101", "1102", "0802", "1103", "0401", "0301", "0702"…
$ COMMUNE_NOM     <chr> "Beaufort", "Bech", "Beckerich", "Berdorf", "Bertrange…
$ FEMMES_MINEURES <dbl> 312, 104, 271, 216, 839, 1091, 269, 394, 337, 194, 160…
$ HOMMES_MINEURS  <dbl> 306, 134, 287, 244, 953, 1167, 293, 487, 314, 194, 174…
$ FEMMES_MAJEURES <dbl> 1250, 557, 1178, 906, 3899, 4821, 1203, 1663, 1462, 72…
$ HOMMES_MAJEURS  <dbl> 1234, 575, 1177, 927, 3798, 4573, 1286, 1587, 1450, 79…
nrow(pop)
[1] 102
ncol(pop)
[1] 6
names(pop)
[1] "COMMUNE_CODE"    "COMMUNE_NOM"     "FEMMES_MINEURES" "HOMMES_MINEURS" 
[5] "FEMMES_MAJEURES" "HOMMES_MAJEURS" 
head(pop, 5)
# A tibble: 5 × 6
  COMMUNE_CODE COMMUNE_NOM FEMMES_MINEURES HOMMES_MINEURS FEMMES_MAJEURES
  <chr>        <chr>                 <dbl>          <dbl>           <dbl>
1 1101         Beaufort                312            306            1250
2 1102         Bech                    104            134             557
3 0802         Beckerich               271            287            1178
4 1103         Berdorf                 216            244             906
5 0401         Bertrange               839            953            3899
# ℹ 1 more variable: HOMMES_MAJEURS <dbl>

7.3 Accessing columns

Use $ to get one column as a vector (e.g. to compute totals or means):

pop$COMMUNE_NOM
  [1] "Beaufort"                "Bech"                   
  [3] "Beckerich"               "Berdorf"                
  [5] "Bertrange"               "Bettembourg"            
  [7] "Bettendorf"              "Betzdorf"               
  [9] "Bissen"                  "Biwer"                  
 [11] "Boulaide"                "Bourscheid"             
 [13] "Bous"                    "Bous-Waldbredimus"      
 [15] "Clervaux"                "Colmar-Berg"            
 [17] "Consdorf"                "Contern"                
 [19] "Dalheim"                 "Diekirch"               
 [21] "Differdange"             "Dippach"                
 [23] "Dudelange"               "Echternach"             
 [25] "Ell"                     "Erpeldange-sur-S\xfbre" 
 [27] "Esch-sur-Alzette"        "Esch-sur-S\xfbre"       
 [29] "Ettelbruck"              "Feulen"                 
 [31] "Fischbach"               "Flaxweiler"             
 [33] "Frisange"                "Garnich"                
 [35] "Goesdorf"                "Grevenmacher"           
 [37] "Groussbus-Wal"           "Habscht"                
 [39] "Heffingen"               "Helperknapp"            
 [41] "Hesperange"              "Junglinster"            
 [43] "K\xe4erjeng"             "Kayl"                   
 [45] "Kehlen"                  "Kiischpelt"             
 [47] "Koerich"                 "Kopstal"                
 [49] "Lac de la Haute-S\xfbre" "Larochette"             
 [51] "Lenningen"               "Leudelange"             
 [53] "Lintgen"                 "Lorentzweiler"          
 [55] "Luxembourg"              "Mamer"                  
 [57] "Manternach"              "Mersch"                 
 [59] "Mertert"                 "Mertzig"                
 [61] "Mondercange"             "Mondorf-les-Bains"      
 [63] "Niederanven"             "Nommern"                
 [65] "P\xe9tange"              "Parc Hosingen"          
 [67] "Pr\xe9izerdaul"          "Putscheid"              
 [69] "Rambrouch"               "Reckange-sur-Mess"      
 [71] "Redange/Attert"          "Reisdorf"               
 [73] "Remich"                  "Roeser"                 
 [75] "Rosport-Mompach"         "Rumelange"              
 [77] "Saeul"                   "Sandweiler"             
 [79] "Sanem"                   "Schengen"               
 [81] "Schieren"                "Schifflange"            
 [83] "Schuttrange"             "Stadtbredimus"          
 [85] "Steinfort"               "Steinsel"               
 [87] "Strassen"                "Tandel"                 
 [89] "Troisvierges"            "Useldange"              
 [91] "Vall\xe9e de l'Ernz"     "Vianden"                
 [93] "Vichten"                 "Waldbillig"             
 [95] "Waldbredimus"            "Walferdange"            
 [97] "Weiler-la-Tour"          "Weiswampach"            
 [99] "Wiltz"                   "Wincrange"              
[101] "Winseler"                "Wormeldange"            
pop$FEMMES_MAJEURES
  [1]  1250   557  1178   906  3899  4821  1203  1663  1462   727   635   687
 [13]     0  1300  2546  1013   827  2000   951  3111 12393  1973  9274  2567
 [25]   678  1117 15436  1330  4186   965   519   880  2040   960   662  2210
 [37]   938  2114   623  2052  7066  3690  4791  4116  2969   524  1093  1908
 [49]   874   911   889  1133  1399  1952 56612  4620   942  4330  2332   961
 [61]  3034  2431  2992   604  8540  1686   744   462  1969  1214  1348   557
 [73]  1765  2923  1521  2299   422  1572  8019  2135   896  4827  1763   831
 [85]  2541  2580  4504   925  1429   849  1162   923   573   782     2  3864
 [97]  1012  1042  3269  1942   623  1337
# Mean population per commune (total = sum of the four count columns)
mean(pop$FEMMES_MINEURES + pop$HOMMES_MINEURS + pop$FEMMES_MAJEURES + pop$HOMMES_MAJEURS)
[1] 6774.01

7.4 Aggregating the data

With dplyr you can aggregate by groups (e.g. totals, means). Example: add a total population per commune, then show the top 10 communes by total population; or summarize across all rows (e.g. national totals by category).

# Total population per commune (using the pipe)
pop |>
  mutate(
    total = FEMMES_MINEURES + HOMMES_MINEURS + FEMMES_MAJEURES + HOMMES_MAJEURS
  ) |>
  arrange(desc(total)) |>
  select(COMMUNE_NOM, total) |>
  head(10)
# A tibble: 10 × 2
   COMMUNE_NOM         total
   <chr>               <dbl>
 1 "Luxembourg"       137663
 2 "Esch-sur-Alzette"  38269
 3 "Differdange"       31420
 4 "Dudelange"         22377
 5 "P\xe9tange"        21290
 6 "Sanem"             19609
 7 "Hesperange"        17260
 8 "Schifflange"       11678
 9 "Bettembourg"       11652
10 "Mamer"             11582
# One-row summary: national totals by category (sum across all communes)
pop |>
  summarise(
    femmes_mineures = sum(FEMMES_MINEURES),
    hommes_mineurs  = sum(HOMMES_MINEURS),
    femmes_majeures = sum(FEMMES_MAJEURES),
    hommes_majeurs  = sum(HOMMES_MAJEURS),
    total          = sum(FEMMES_MINEURES + HOMMES_MINEURS + FEMMES_MAJEURES + HOMMES_MAJEURS)
  )
# A tibble: 1 × 5
  femmes_mineures hommes_mineurs femmes_majeures hommes_majeurs  total
            <dbl>          <dbl>           <dbl>          <dbl>  <dbl>
1           63118          66173          280678         280980 690949

So: load with read_csv(), inspect with glimpse(), names(), head(), access columns with $, and aggregate with mutate(), summarise(), arrange(), and select() (and later group_by() for group-wise summaries).

For text mining workflows with tidytext and quanteda, see Session 3.

8 Import LOD dictionary data (XML)

8.1 Getting the file

The LOD linguistic dataset is published as open data by the Luxembourg government (Zenter fir d’Lëtzebuerger Sprooch) under CC0 (LOD – Linguistesch Daten (resource). Download the zip, extract it, and place the XML file (new_lod-art.xml, ~74 MB) in a folder in your project (e.g. data/). Below we read it from that folder and inspect its structure.

Install once: install.packages("xml2"). Then load with library(xml2).

8.2 Reading the XML file

Read the XML from the file path. The file is large, so the first parse may take a few seconds.

library(xml2)
Warning: package 'xml2' was built under R version 4.5.2
# Path to the XML file (e.g. in data/; adjust if your file is elsewhere)
lod_xml_path <- "new_lod-art.xml"

# Read XML from the folder
lod_xml <- read_xml(lod_xml_path)

# Root element
xml_name(lod_xml)
[1] "lod"
xml_children(lod_xml) |> length()   # number of top-level children (entries)
[1] 33737

8.3 Structure of the LOD XML

The document has a single root element <lod>, which contains many <entry> elements. Each entry represents one headword (lemma) and its dictionary information. The hierarchy is:

1. Entry level (<entry id="...">)

  • <lemma> — the headword (e.g. “A”).
  • <ipa> — IPA pronunciation (e.g. “aː”).
  • <microStructure> — grammatical and semantic information.

2. Inside <microStructure>

  • <partOfSpeech gen="..."> — part of speech (e.g. SUBST, with optional gender).
  • <inflection type="..."> — inflectional forms (e.g. plural), with <form nRuleForm="..."> for each form.
  • <grammaticalUnit> — groups one or more meanings.

3. Meaning level (<meaning id="..." video="...">)

  • <number> — sense number.
  • <targetLanguage lang="de|fr|en|pt|nl"> — translations:
    • <translation> — equivalent in the target language.
    • <semanticClarifier> — short clarification (e.g. “Sehorgan” for “Auge”).
  • <examples count="..."> — example sentences:
    • <example> (optional id):
      • <text> — Luxembourgish example: <word> elements for each token, <inflectedHeadword> for the headword form used; optional <attribute> (e.g. EGS = example).
      • <gloss> (optional) — gloss/paraphrase, again with <word> elements.

So each entrymicroStructuregrammaticalUnit → one or more meaning → translations in several languages and examples (with optional glosses).

8.4 How XPath works

XPath is a language for selecting nodes in an XML (or HTML) tree. In R you use it with xml2 via xml_find_first() (one node) and xml_find_all() (all matching nodes). The expression is a path that describes where to go in the tree.

XPath Meaning Example
nodename Direct child with that name lemma → children of the current node named lemma
/nodename Child from the root /lod → root element; /lod/entry → every entry directly under lod
//nodename Any descendant (any depth) //entry → every entry in the document
.//nodename Any descendant from the current node Inside an entry, .//lemma finds lemma anywhere under that entry
* Any element (wildcard) ./* → all direct children of the current node
[@attr='value'] Filter by attribute targetLanguage[@lang='de']targetLanguage whose lang is de
path1/path2 Then: path2 under path1 .//meaning/examples/exampleexample under examples under meaning

Examples on the LOD:

  • //entry — every entry in the document (from root).
  • //entry/lemma — every lemma that is a direct child of an entry.
  • .//lemma — the lemma of the current entry (when you’re already on one entry).
  • .//meaning/targetLanguage[@lang='en']/translation — English translations under the current entry.
  • .//meaning/examples/example — all example nodes under the current entry.

The dot (.) is the current node; .// means “starting from here, any descendant”. Without the dot, // starts from the document root.

8.5 Inspecting one entry in R

To see the structure of a single entry and extract a few fields:

# First entry
first_entry <- xml_find_first(lod_xml, "//entry")

# Lemma and IPA
xml_find_first(first_entry, ".//lemma") |> xml_text()
[1] "A"
xml_find_first(first_entry, ".//ipa")    |> xml_text()
[1] "aː"
# Part of speech
xml_find_first(first_entry, ".//partOfSpeech") |> xml_text()
[1] "SUBST"
xml_find_first(first_entry, ".//partOfSpeech") |> xml_attr("gen")
[1] "N"
# First meaning: translations (e.g. German, English)
xml_find_all(first_entry, ".//meaning/targetLanguage[@lang='de']/translation") |> xml_text()
 [1] "Auge"             "blaues Auge"      "Veilchen"         "Glasauge"        
 [5] "vor Augen führen" "Auge"             "Fettauge"         "Auge"            
 [9] "Keim"             "Knospe"           "Auge"            
xml_find_all(first_entry, ".//meaning/targetLanguage[@lang='en']/translation") |> xml_text()
 [1] "eye"               "black eye"         "glass eye"        
 [4] "ocular prosthetic" "to make aware of"  "to show"          
 [7] "to demonstrate"    "fat globule"       "eye"              
[10] "bud"               "pip"               "spot"             
# Number of examples in the first meaning
xml_find_first(first_entry, ".//meaning/examples") |> xml_attr("count")
[1] "2+"

8.6 Exploring the examples section

Each meaning can contain an <examples> block with one or more <example> nodes. Each <example> has at least one <text> element whose children are <word> and <inflectedHeadword> (and optionally <attribute>), in order — that is the Luxembourgish example sentence. Below we collect several example sentences from random entries and put them into a tibble so you can work with them like any table (filter, count, join, etc.).

Helper: turn one <example> node into the sentence string (first <text> only):

# Sentence text from one <example>: first <text>, then concatenate child elements (word, inflectedHeadword, etc.)
get_sentence <- function(ex) {
  text_node <- xml_find_first(ex, ".//text")
  if (inherits(text_node, "xml_missing")) return(NA_character_)
  tokens <- xml_find_all(text_node, "./*")
  paste(xml_text(tokens), collapse = " ")
}

Collect examples and build a tibble: sample a number of entries, take up to two examples per entry, then combine lemma and sentence into a data frame.

library(tibble)

all_entries <- xml_find_all(lod_xml, "//entry")
set.seed(123)
n_entries  <- 30
sampled    <- sample(length(all_entries), min(n_entries, length(all_entries)))
rows       <- list()

for (i in seq_along(sampled)) {
  entry  <- all_entries[[sampled[i]]]
  lemma  <- xml_find_first(entry, ".//lemma") |> xml_text()
  exs    <- xml_find_all(entry, ".//meaning/examples/example")
  # Take at most 2 examples per entry
  exs    <- exs[1:min(2, length(exs))]
  for (ex in exs) {
    sent <- get_sentence(ex)
    if (!is.na(sent) && nchar(sent) > 0)
      rows[[length(rows) + 1]] <- list(lemma = lemma, sentence = sent)
  }
}

# One row per example sentence
lod_examples <- tibble(
  lemma    = vapply(rows, function(r) r$lemma,    character(1)),
  sentence = vapply(rows, function(r) r$sentence, character(1))
)

lod_examples
# A tibble: 34 × 2
   lemma        sentence                                                        
   <chr>        <chr>                                                           
 1 Belsch       meng bescht Frëndin ass Belsch                                  
 2 Urbanistin   d' Urbanistin schafft Mesuren aus, déi zur Berouegung vum Trafi…
 3 befalen      bei waarmem, fiichtem Wieder ginn d' Riewen nawell gär vun enge…
 4 Finanzwiesen d' EU fuerdert méi eng streng Reglementatioun vum europäesche F…
 5 Finanzwiesen als Ekonomistin hues de gutt Chancen, am Finanzwiesen eng Aarbe…
 6 Virbau       wéi mir d' Haus renovéiert hunn, hu mer eisen Hall duerch e Vir…
 7 Interferenz  fir Interferenzen tëschent den elektreschen Apparater ze vermei…
 8 Interferenz  bei Kanner, déi méisproocheg opwuessen, kënnt et dacks zu sproo…
 9 Kärebrout    eise Bäcker mécht wonnerbaart Kärebrout                         
10 Kärebrout    kanns de nach zwee Kärebrout bei de Bäcker siche goen?          
# ℹ 24 more rows

You can now use dplyr or tidyr on lod_examples (e.g. count sentences per lemma, filter by length, or join with other data). The same idea scales to all entries if you loop over every entry and every example and bind the rows.

8.7 Summary of the LOD XML structure

Level Element Role
Root lod Container for all entries.
Entry entry (id) One headword: lemma, ipa, microStructure.
Micro microStructure partOfSpeech, inflection(s), grammaticalUnit(s).
Unit grammaticalUnit Groups one or more meanings.
Meaning meaning (id, video) Sense: number, targetLanguage (translation, semanticClarifier), examples.
Example example text (word, inflectedHeadword, optional attribute), optional gloss.

This structure lets you extract lemmas, translations, parts of speech, inflections, and example sentences (with glosses) for lexical and corpus-oriented analyses in R.

9 Hands-on exercises

  1. Typo. Why does this fail? (Look very carefully at the object name.)

    my_variable <- 10
    my_varıable
  2. Vectors and strings. In a new code chunk:

    • Create a character vector test_words with at least three words (e.g. c("word1", "word2", "word3")).
    • Use length(test_words) and nchar(test_words) and interpret the result.
  3. Comparisons. Run these and say what they return (TRUE/FALSE) and why:

    "dog" == "dog"
    "dog" != "cat"
    nchar("linguistics") > 5
  4. Data import and inspection. Create a short CSV string with columns word and count (e.g. three rows). Read it with read_csv(), then use glimpse() and $ to inspect the count column and compute mean(count).

  5. Style. Restyle this into a clear, multi-line pipe (spaces, line breaks, indentation):

    word_data|>filter(freq>3)|>arrange(freq)|>select(word,freq)

10