This is an R Markdown script showing how I analyzed the phonological distances between Eurasian lects based on Phonotacticon 1.0 for my thesis ``Phonological areas in Eurasia’’ (2024).
First, I set up the environment of R Markdown.
options(scipen = 100, digits = 3)
Then I load the required R packages.
library(data.table)
library(tidytable)
library(stringr)
library(stringi)
library(geosphere)
library(plotly)
library(geodata)
library(plyr)
library(tibble)
library(forcats)
library(purrr)
library(Rfast)
library(e1071)
library(caret)
library(dplyr)
library(anocva)
library(profmem)
library(stargazer)
library(lme4)
library(viridis)
library(spdep)
library(spatialreg)
library(xtable)
library(ggforce)
library(magrittr)
library(vegan)
I load Phonotacticon.
Phonotacticon <- read.csv("Phonotacticon1_0.csv") %>%
as.data.table()
Phonotacticon
All the subset sample lects are listed below.
Eurasia <- Phonotacticon %>%
.[Onset != '?' &
Nucleus != '?' &
Coda != '?' &
!grepl('C{2,}', Onset) &
!grepl('C{2,}', Coda) &
!grepl('\\[.{10}.*?\\]\\[.{10}.*?\\]', Onset) &
!grepl('\\[.{10}.*?\\]\\[.{10}.*?\\]', Coda)] %>%
.[, .(Lect, Phoneme, Tone, Onset, Nucleus, Coda)]
Eurasia$Lect
## [1] "A'ou" "Akajeru"
## [3] "Amdo Tibetan" "Angami Naga"
## [5] "Ao Naga" "Archi"
## [7] "Aromanian" "Arpitan"
## [9] "Arvanitika Albanian" "Asho Chin"
## [11] "Assamese" "Asturian-Leonese-Cantabrian"
## [13] "Atong (India)" "Avar"
## [15] "Baba Malay" "Badaga"
## [17] "Bagvalal" "Bantawa"
## [19] "Basque" "Betta Kurumba"
## [21] "Bezhta" "Bih"
## [23] "Bisu" "Biyo"
## [25] "Bodo-Mech" "Bolyu"
## [27] "Bonan" "Budukh"
## [29] "Bugan" "Bujhyal"
## [31] "Bulo Stieng" "Bunan"
## [33] "Burmese" "Burushaski"
## [35] "Cao Miao" "Catalan"
## [37] "Central Bai" "Central Chong"
## [39] "Central Hongshuihe Zhuang" "Central Khmer"
## [41] "Chak" "Chintang"
## [43] "Chitwania Tharu" "Chong of Chanthaburi"
## [45] "Chothe" "Chukchi"
## [47] "Chut" "Chuvash"
## [49] "Cosao" "Cypriot Arabic"
## [51] "Daai Chin" "Dagur"
## [53] "Daman-Diu Portuguese" "Dandami Maria"
## [55] "Danish" "Daohua"
## [57] "Dari" "Darma"
## [59] "Deori" "Dhimal"
## [61] "Dhivehi" "Domari"
## [63] "Dongxiang" "Duhumbi"
## [65] "Dumi" "Dungan"
## [67] "Duoluo Gelao" "Dutch"
## [69] "Dzongkha" "E"
## [71] "Eastern Katu" "Eastern Kayah"
## [73] "Eastern Magar" "Eastern Newari"
## [75] "Eastern Panjabi" "Eastern Tamang"
## [77] "Enu" "Ersu"
## [79] "Estonian Swedish" "Evenki"
## [81] "Forest Enets" "French"
## [83] "Friulian" "Galo"
## [85] "Gan Chinese" "Gata'"
## [87] "Georgian" "German"
## [89] "Gheg Albanian" "Gilaki"
## [91] "Godoberi" "Godwari"
## [93] "Gujarati" "Gurani"
## [95] "Hakka Chinese" "Halbi"
## [97] "Halh Mongolian" "Hills Karbi"
## [99] "Hindi" "Hinuq"
## [101] "Hmong Njua" "Hokkaido Ainu"
## [103] "Honi" "Hui Chinese"
## [105] "Hungarian" "Icelandic"
## [107] "Ingrian" "Irula of the Nilgiri"
## [109] "Italian" "Iu Mien"
## [111] "Japanese" "Japhug"
## [113] "Jarawa (India)" "Jejueo"
## [115] "Jennu Kurumba" "Jerung"
## [117] "Jinyu Chinese" "Jiongnai Bunu"
## [119] "Kabardian" "Kadar"
## [121] "Kado" "Kaduo"
## [123] "Kashmiri" "Kathmandu Valley Newari"
## [125] "Katso" "Kayan Lahwi"
## [127] "Kazakh" "Kelantan-Pattani Malay"
## [129] "Ket" "Khams Tibetan"
## [131] "Khasi" "Khezha Naga"
## [133] "Khinalug" "Khmu"
## [135] "Kirghiz" "Kirmanjki"
## [137] "Kman" "Kodava"
## [139] "Koi" "Koireng"
## [141] "Komi-Zyrian" "Konda-Dora"
## [143] "Konkan Marathi" "Korean"
## [145] "Korku" "Korra Koraga"
## [147] "Koryak" "Kotia-Adivasi Oriya-Desiya"
## [149] "Kucong" "Kui (India)"
## [151] "Kumaoni" "Kumarbhag Paharia"
## [153] "Kumyk" "Kurtokha"
## [155] "Kuy" "Kyerung"
## [157] "Lachi" "Ladino"
## [159] "Lahu" "Lak"
## [161] "Lakkia" "Lambadi"
## [163] "Lamjung-Melamchi Yolmo" "Lao"
## [165] "Lashi" "Laven"
## [167] "Laz" "Leh Ladakhi"
## [169] "Lepcha" "Lhomi"
## [171] "Liangmai Naga" "Limbu"
## [173] "Lisu" "Longchuan Achang"
## [175] "Macedonian" "Maithili"
## [177] "Malacca-Batavia Portuguese Creole" "Malavedan"
## [179] "Malayalam" "Manchu"
## [181] "Mandarin Chinese" "Mang"
## [183] "Mangghuer" "Manipuri"
## [185] "Mao Naga" "Maonan"
## [187] "Maram Naga" "Marathi"
## [189] "Marwari (India)" "Mewati"
## [191] "Milang" "Min Bei Chinese"
## [193] "Min Nan Chinese" "Miyako"
## [195] "Mlabri" "Modern Greek"
## [197] "Moken" "Mon"
## [199] "Mongghul" "Moyon"
## [201] "Muduga" "Mulam"
## [203] "Mundari" "Nanai"
## [205] "Narua" "Naukan Yupik"
## [207] "Negidal" "Neo-Mandaic"
## [209] "Nepali" "Nganasan"
## [211] "Nihali" "Nimadi"
## [213] "Nocte Naga" "North Azerbaijani"
## [215] "North-Central Dargwa" "Northeastern Thai"
## [217] "Northern Jinghpaw" "Northern Pashto"
## [219] "Northern Pinghua" "Northern Pumi"
## [221] "Northern Thai" "Northern Yukaghir"
## [223] "Northwestern Kolami" "Nung (Myanmar)"
## [225] "Nuristani Kalasha" "Nyahkur"
## [227] "Odia" "Oki-No-Erabu"
## [229] "Oroch" "Ostfränkisch"
## [231] "Pa-Hng" "Pacoh"
## [233] "Paite Chin" "Pela"
## [235] "Peripheral Mongolian" "Phom Naga"
## [237] "Piemontese" "Pite Saami"
## [239] "Pnar" "Pontic"
## [241] "Portuguese" "Pu-Xian Chinese"
## [243] "Purik-Sham-Nubra" "Pwo Eastern Karen"
## [245] "Rabha" "Rajbanshi"
## [247] "Ravula" "Russia Buriat"
## [249] "Russian" "Rutul"
## [251] "Sadri" "Sadu"
## [253] "Sakha" "Sangkong"
## [255] "Sani" "Santali"
## [257] "Saurashtra" "Sedang"
## [259] "Selkup" "Semelai"
## [261] "Shixing" "Sholaga"
## [263] "Sichuan Yi" "Sikkimese"
## [265] "Sindhi" "Sinhala"
## [267] "Situ" "Solu-Khumbu Sherpa"
## [269] "Sora" "South Azerbaijani"
## [271] "South Wa" "Southeast Pashayi"
## [273] "Southern Altai" "Southern Amami-Oshima"
## [275] "Southern Jinghpaw" "Southern Pashto"
## [277] "Southern Pumi" "Southern Qiang"
## [279] "Southern Rengma Naga" "Southern Yukaghir"
## [281] "Southwestern Dargwa" "Sri Lanka Malay"
## [283] "Standard Malay" "Stau-Dgebshes"
## [285] "Sui" "Sunwar"
## [287] "Tai Do-Mene-Yo" "Tamil"
## [289] "Tangam" "Tatar"
## [291] "Thado Chin" "Thai"
## [293] "Thakali" "Thangmi"
## [295] "Thulung" "Tibetan"
## [297] "Toda" "Tsat"
## [299] "Tsez" "Tshangla"
## [301] "Tulu" "Tundra Nenets"
## [303] "Tuvinian" "Udihe"
## [305] "Uighur" "Vaagri Booli"
## [307] "Vach-Vasjugan" "Varhadi-Nagpuri"
## [309] "Vietnamese" "Waddar"
## [311] "Wambule" "Wayu"
## [313] "Welsh" "West Yugur"
## [315] "Western Armenian" "Western Magar"
## [317] "Western Muya" "Western Ong-Be"
## [319] "Western Parbate Kham" "Western Puroik"
## [321] "Western Tamang" "Western Xiangxi Miao"
## [323] "Westphalic" "Wu Chinese"
## [325] "Wuding-Luquan Yi" "Wutunhua"
## [327] "Yakkha" "Yerong-Southern Buyang"
## [329] "Yongbei Zhuang" "Youle Jinuo"
## [331] "Yue Chinese" "Zaiwa"
## [333] "Zauzou" "Zbu"
## [335] "Zeme Naga"
I make a list of lects and their geographical coordinates.
Lect_LonLat <- Phonotacticon %>%
.[Lect %in% Eurasia$Lect] %>%
.[, .(Lect, lon, lat)]
Lect_LonLat
For visualizations, I prepare a map of Eurasia. First, I load map data.
map <- map_data("world")
head(map)
Then I create a map of Eurasia.
EurasiaMap <- ggplot(map, aes(x = long, y = lat)) +
geom_polygon(aes(group = group),
fill = "white",
color = "darkgrey",
size = 0.2) +
coord_map("ortho",
orientation = c(20, 70, 0),
xlim = c(10, 130),
ylim = c(0, 90)) +
theme_void()
EurasiaMap
Below shows the first ten segments and the first ten featural values of a modified version of PanPhon.
PanPhon <- fread("PanPhonPhonotacticon1_0.csv") %>%
as.data.table() %>%
unique(by = 'ipa')
PanPhon
Check if all phonemic transcriptions are present in PanPhon:
Transcriptions <- Eurasia$Phoneme %>%
str_split_fixed(pattern = ' ', n = Inf) %>%
as.data.table() %>%
melt(measure.vars = colnames(.)) %>%
select(-variable) %>%
filter(value != '') %>%
distinct() %>%
mutate(Correct = value %in% PanPhon$ipa)
all(Transcriptions$Correct)
## [1] TRUE
Define a function making a booktabs code:
booktabs <- function(x, y) {
addtorow <- list()
addtorow$pos <- list(-1, 0, nrow(x))
addtorow$command <- c('\\toprule ', '\\midrule ', '\\bottomrule ')
print(x,
file = y,
include.rownames = FALSE,
add.to.row = addtorow,
hline.after = NULL)
}
In this section, I will analyze the sequences of each lect.
I arrange PanPhon segments in alphabetical order.
PanPhonOrder <- PanPhon$ipa[
order(-nchar(PanPhon$ipa),
PanPhon$ipa)]
head(PanPhonOrder, 10)
## [1] "h͡d̪͡ɮ̪ʲʷ⁺" "h͡d̪͡ɮ̪ʷː⁺" "h͡d̪͡ɮ̪ʷˠ⁺" "h͡d̪͡ɮ̪ʷˤ⁺" "h͡d̪͡z̪ʲʷ⁺" "h͡d̪͡z̪ʷː⁺" "h͡d̪͡z̪ʷˠ⁺" "h͡d̪͡z̪ʷˤ⁺"
## [9] "h͡t̪͡ɬ̪ʲʷ⁺" "h͡t̪͡ɬ̪ʷː⁺"
I create a regex line of PanPhon in order to split the segments from sequences.
PanPhonRegex <- paste0("(?:",
paste(PanPhonOrder, collapse="|"),
'|B|C|Č|F|G|Ł|L|N|O|P|R|S|T|V|W|X|Z',
")")
str_trunc(PanPhonRegex, 100)
## [1] "(?:h͡d̪͡ɮ̪ʲʷ⁺|h͡d̪͡ɮ̪ʷː⁺|h͡d̪͡ɮ̪ʷˠ⁺|h͡d̪͡ɮ̪ʷˤ⁺|h͡d̪͡z̪ʲʷ⁺|h͡d̪͡z̪ʷː⁺|h͡d̪͡z̪ʷˠ⁺|h͡d̪͡z̪ʷˤ⁺|h͡t̪͡ɬ..."
I also create PanPhon regex including brackets, in order to detect segments within brackets (e. g. [ptk] meaning “/p/, /t/, or /k/”.)
PanPhonRegexBrackets <- paste0('(?:',
'(?<=\\[).*?(?=\\])|',
paste(PanPhonOrder, collapse="|"),
'|B|C|Č|F|G|Ł|L|N|O|P|R|S|T|V|W|X|Z',
')')
str_trunc(PanPhonRegexBrackets, 100)
## [1] "(?:(?<=\\[).*?(?=\\])|h͡d̪͡ɮ̪ʲʷ⁺|h͡d̪͡ɮ̪ʷː⁺|h͡d̪͡ɮ̪ʷˠ⁺|h͡d̪͡ɮ̪ʷˤ⁺|h͡d̪͡z̪ʲʷ⁺|h͡d̪͡z̪ʷː⁺|h͡d̪͡z̪ʷˠ⁺|..."
I define “classes”, i. e. underspecified segments transcribed in capitals (e. g. P for plosives).
Classes <- PanPhon %>%
mutate(B = cons == 1 & lab == 1,
C = cons == 1,
Č = cons == 1 & delrel == 1 & son == -1 & cont == -1,
`F` = cons == 1 & cont == 1 & son == -1,
G = grepl('j|w|ɥ|ɰ', ipa),
Ł = cons == 1 & cor == 1 & lat == 1,
L = cons == 1 & cont == 1 & cor == 1 & son == 1,
N = nas == 1 & syl == -1,
P = cons == 1 & cont == -1 & delrel == -1 & son == -1,
R = cont == 1 & son == 1 & syl == -1 & !grepl('h|ɦ', ipa),
S = cons == 1 & cont == 1 & cor == 1 & son == -1,
`T` = cons == 1 & son != 1,
V = cons == -1 & cont == 1 & son == 1 & syl == 1,
W = syl == -1 & voi == 1,
X = syl == -1 & voi == -1,
Z = cont == 1 & syl == -1) %>%
select(ipa, B, C, Č, `F`, G, Ł, L, N, P, R, S, `T`, V, W, X, Z) %>%
pivot_longer(cols = -ipa,
names_to = 'Class',
values_to = 'Value') %>%
filter(Value) %>%
select(-Value)
Classes
I extract phonemes from the phonemic inventories.
Phonemes <- stri_extract_all_regex(Eurasia$Phoneme,
pattern = PanPhonRegex,
simplify = TRUE) %>%
as.data.table() %>%
mutate(Lect = Eurasia$Lect) %>%
melt(id.vars = 'Lect',
variable.name = 'Number',
value.name = 'ipa') %>%
select(-Number) %>%
filter(ipa != '')
Phonemes
I subset lect, onsets, nuclei, and codas from Phonotacticon.
LectONC <- Eurasia %>%
.[, .(Lect, Onset, Nucleus, Coda)] %>%
melt(id.vars = 'Lect',
variable.name = 'Category',
value.name = 'Sequence')
LectONC
I extract the sequences from onset, nucleus, and coda categories.
Sequences <- LectONC[, tstrsplit(Sequence, ' ', fixed = FALSE)] %>%
.[, c('Lect', 'Category') := .(LectONC$Lect, LectONC$Category)] %>%
melt(id.vars = c('Lect', 'Category'),
variable.name = 'Number',
value.name = 'Sequence') %>%
.[, -c('Number')] %>%
.[!is.na(Sequence)] %>%
distinct()
Sequences
I subset sequences that include underspecified segments (transcribed in capital letters).
Capitals <-
Sequences %>%
.[grepl('B|C|Č|F|G|Ł|L|N|O|P|R|S|T|V|W|X|Z', Sequence)] %>%
.[, -c('Category')] %>%
distinct()
Capitals
I convert the capital letters into the corresponding phonemes in each lect. For example, P (“plosive”) in Italian is converted to all the plosive phonemes in Italian phonemic inventory.
Decapitalized <-
stri_extract_all_regex(Capitals$Sequence,
pattern = PanPhonRegex,
simplify = TRUE) %>%
as.data.table() %>%
.[, c('Lect', 'Sequence') :=
.(Capitals$Lect, Sequence = Capitals$Sequence)] %>%
melt(id.vars = c('Lect', 'Sequence'),
variable.name = 'Order',
value.name = 'Class') %>%
.[, Order := as.integer(as.factor(Order))] %>%
.[Class != ''] %>%
merge(Classes, all = TRUE, allow.cartesian = TRUE) %>%
.[, ipa := if_else(is.na(ipa), Class, ipa)] %>%
.[, -c('Class')] %>%
merge(Phonemes) %>%
setorder(col = Order) %>%
split(by = c('Lect', 'Sequence')) %>%
lapply(function(x)
split(x, by = 'Order')) %>%
lapply(function(x)
lapply(x, function(x)
x <- x$ipa)) %>%
lapply(function(x)
expand.grid(x) %>%
do.call(what = paste0)) %>%
enframe() %>%
unnest() %>%
as.data.table() %>%
separate(col = name,
into = c('Lect', 'Sequence'),
sep = '\\.') %>%
setnames('value', 'NewSequence') %>%
merge(Sequences, all = TRUE) %>%
mutate(Sequence =
if_else(!is.na(NewSequence),
NewSequence,
Sequence)) %>%
.[, -c('NewSequence')]
Decapitalized
I split the sequences into segments, including bracketed segments (such as [ptk] for “/p/, /t/, or /k/.)
ToUnbracket <- stri_extract_all_regex(Decapitalized$Sequence,
pattern = PanPhonRegexBrackets,
simplify = TRUE) %>%
as.data.table() %>%
mutate(Lect = Decapitalized$Lect,
Category = Decapitalized$Category,
Sequence = Decapitalized$Sequence) %>%
melt(id = c('Lect', 'Category', 'Sequence'),
variable.name = 'Order',
value.name = 'ipa') %>%
mutate(Order = Order %>%
as.factor() %>%
as.integer()) %>%
filter(ipa != "")
ToUnbracket
I subset bracketed sequences.
Bracketed <- ToUnbracket %>%
filter(grepl('\\[', Sequence))
Bracketed
I convert the bracketed sequences into all logically possible sequences. For example, Laven’s sequence [bdɟɡ] [rl] is converted into /br/, /bl/, /dr/, /dl/, /ɟr/ /ɟl/, /ɡr/, and /ɡl/.
Unbracketed <- Bracketed$ipa %>%
stri_extract_all_regex(pattern = PanPhonRegex, simplify = TRUE) %>%
as.data.table() %>%
mutate(Sequence = Bracketed$Sequence,
Order = Bracketed$Order) %>%
melt(id.vars = c('Sequence', 'Order'),
variable.name = 'Number',
value.name = 'ipa') %>%
filter(ipa != '') %>%
select(-Number) %>%
setorder(col = Order) %>%
split(by = 'Sequence') %>%
lapply(function(x)
split(x, by = 'Order')) %>%
lapply(function(x)
lapply(x, function(x)
x <- x$ipa)) %>%
lapply(function(x)
expand.grid(x) %>%
do.call(what = paste0)) %>%
enframe() %>%
unnest() %>%
setnames(c('name', 'value'),
c('Sequence', 'NewSequence')) %>%
as.data.table()
Unbracketed
I join the unbracketed sequences into the whole list of sequences. Then I split the sequences into segments (e. g. /pl/ into /p/ and /l/).
Segments <-
stri_extract_all_regex(
Unbracketed$NewSequence,
pattern = PanPhonRegex,
simplify = TRUE) %>%
as.data.table() %>%
mutate(Sequence = Unbracketed$Sequence,
NewSequence = Unbracketed$NewSequence) %>%
pivot_longer(cols = -c(Sequence, NewSequence),
names_to = 'Order',
values_to = 'NewIPA') %>%
mutate(Order = Order %>%
as.factor() %>%
as.integer()) %>%
filter(NewIPA != '') %>%
full_join(ToUnbracket) %>%
mutate(Sequence =
if_else(
!is.na(NewSequence),
NewSequence,
Sequence),
ipa =
if_else(
!is.na(NewIPA),
NewIPA,
ipa)) %>%
select(-NewSequence, -NewIPA) %>%
as.data.table()
## Joining with `by = join_by(Sequence, Order)`
Segments
In this section, I will measure the length of each sequence, where length is the number of segments that consist a sequence.
First, I measure the length of each sequence, in terms of the number of segments involved.
Sequences_length <- Segments %>%
.[, .(Length = max(Order)), by = .(Lect, Category, Sequence)]
Sequences_length
I join the length of each sequence to segments.
Segments <- left_join(Segments, Sequences_length)
## Joining with `by = join_by(Sequence, Lect, Category)`
Segments
In this section, I will show how I measure the distance between two sequences, e. g. between /pl/ and /spl/.
First, I count the maximal length of all sequences.
MaxLength <- max(Sequences_length$Length)
MaxLength
## [1] 6
I count the number of all the split segments.
Segments_number <- nrow(Segments)
Segments_number
## [1] 81807
In order to measure the distance between two sequences of different length. I assign different “positions” to each sequence. As the maximal length of all sequences is six, a sequence of only one segment has six positions within these six slots (from 0 to 5).
Sequences_rep <- bind_rows(rep(list(Segments), MaxLength)) %>%
mutate(Position = rep(0:(MaxLength - 1),
each = Segments_number)) %>%
mutate(Order = Order + Position) %>%
filter(Length + Position <= MaxLength) %>%
select(-Length)
Sequences_rep
I join segments with their phonological features (retrieved from PanPhon). Each feature is assigned the value of the position.
Sequences_features <- Sequences_rep %>%
left_join(PanPhon, by = 'ipa') %>%
melt(id = c('Lect',
'Category',
'Sequence',
'Order',
'ipa',
'Position'),
variable.name = 'Feature',
value.name = 'Value') %>%
mutate(Feature = paste0(Feature, Order)) %>%
dcast(Lect + Category + Sequence + Position ~ Feature,
value.var = 'Value',
fun.aggregate = sum,
fill = 0) %>%
mutate(SequencePosition = paste0(Sequence, Position)) %>%
select(-Lect, -Category, -Position, -Sequence) %>%
distinct()
Sequences_features