Uiser:Illandancient/Word frequencies

Word frequencies

At about eleven thirty on Tuesday 25th August reddit user u/Ultach posted in the r/Scotland forum of their discovery that a large amount of the Scots language wikipedia had been created and edited by a teenager from North Carolina who didn't speak the Scots language. The Scots wikipedia at the time had about 57,950 articles of which around 20,000 were created by the North Carolinian, who had ammassed over 200,000 edits over an eight year period.

There was a fuss on Reddit, a fuss in some media, and a group of Scots speakers and wikipedians came together to fix the problem. Whilst I only have a passing knowledge of the Scots language, I was at the time reading Northern and Insular Scots by Robert McColl Miller, about the dialects of Scots spoken in Caithness, Orkney and Shetland. It was interesting stuff. And inspired by the book I thought that perhaps various datamunging techniques could be used to study the North Carolinian Scots dialect used on wikipedia, and also the fixed Scots dialect used on wikipedia, which might eventually amount to a standardised spelling system, and that the Scots Wikipedia could be presented as a corpus of the Scots language.

Acting quickly I found a script on GitHub that could scrape wikipedia and get the word frequencies, realing I'd never get it up and running myself I contacted the creator Ilya Semenov to commission a scrape of the Scots wiki. He very kindly sent through the Scots word frequency list for the 20200801 wikidump in just a few minutes, this proved invaluable in the first few days of the scots wiki clean up exercise.

It was used to identify the most commonly used words of the North Carolina dialect, so that automated tools could be used to fix spellings.

Python scripts

I had a go at running the word frequency script myself, but ran into many problems.

I hadn't used Python in a number of years (8) and my laptop wasn't happy about trying to resurrect old programming environments
The script didn't want to work on a Windows computer, so I had to resort to digging out an old Raspberry Pi linux computer that hadn't been used for a number of years (4)
Flashing the Raspberry Pi with the latest OS lead to it running crazy slowly, webpages took minutes to load, file movements were jerky and slow.
The word-freq script required a wikipedia extractor script where the latest GitHub commit was broken
Using the latest version of Python (3.4) seems to break everything, because although no longer supported, an older version of Python seems to run everything (2.7)
The missus bought me a new fast SD card from Amazon which arrived in less that 24 hours, but this too proved to be crazy slow
Eventually got the scripts to work in the early morning of 01-09-2020, it takes 4449 seconds to process the current scotswiki of 57,000 articles

Comparing word lists

Armed with two different Scots word frequency lists, one from the start of Aug 2020 and one from the start of May, it would be possible to compare them to see which new words had been introduced and if the counts of words had increased or decreased.

Comparing the English word list and the Scots word list might provide a list of words unique to English, unique to Scots and common to both languages. It should be noted that many Scots words just happen to be spelled exactly the same way as many English words, although the definitions, usage and grammar varies.

Old vs New Scots Wiki comparison

Comparing the 01-08-2020 Scots word list with a 01-09-2020 Scots word list should elicit the North Carolina Scots dialect, which, whilst a linguistic dead-end, might be useful.

keeng (-1221)
televeesion (-850)
daughter (-621)
than (-593)
years (-499)
miles (-467)
lairge (-455)
built (-327)
creautit (-318)
months (-295)
each (-212)
well (-198)
operating (-195)
brought (-175)
given (-169)
perhaps (-166)
system (-144)
himself (-142)
systems (-133)
haeve (-112)
lairgest (-105)
was (-92)
with (-92)
keengs (-79)
herself (-79)
such (-77)
they (-68)
their (-68)
more (-64)
height (-63)
large (-63)
father (-61)
mother (-61)
together (-61)
break (-59)
brother (-59)
thought (-58)
forward (-57)
kernel (-56)
family (-54)
computer (-53)
tournament (-53)
were (-52)
various (-49)
program (-49)
memory (-48)
windows (-48)
hardware (-42)
total (-41)
used (-41)
file (-41)
use (-40)
os (-40)
also (-39)
insects (-39)
programs (-37)
user (-37)
through (-36)
word (-35)
granddaughter (-34)
unix (-34)
eight (-34)
computers (-33)
these (-32)
one (-32)
linux (-32)
has (-31)
tha (-30)
software (-30)
developed (-29)
mode (-29)
old (-28)
other (-27)
device (-27)
access (-26)
example (-25)
resources (-25)
have (-24)
can (-23)
only (-23)
en (-23)
many (-23)
there (-22)
interface (-22)
toun (-21)
after (-21)
breaks (-21)
server (-21)
aa (-20)
all (-20)
small (-19)
keeng's (-19)
insect (-19)
daughters (-19)
freebsd (-18)
lairger (-18)
number (-18)
over (-18)
any (-18)
code (-17)

(it looks like the automatic replacing of words such as keeng has been successful, but also a number of articles on computer operating systems seems to have changed too.

Comparing the word frequency lists from several languages could help to understand trends in wiki word usage, for example the equivalent words for 'city', 'province' and 'municipality' are more common on wiki pages than general usage word lead us to expect.

Uniquely English words

As a starting point wikipedia user James Salsman has a list of English words not usually seen in Scots

afraid after angry ball before behind between blow carefully cattle child cloth clothes creature cry dig dirty do doubt down dusk dusty ewe fancy four friend from girl going have head hold house hundred knock kye live make mud my now old one our out over pet potatoes shake sing small smell spin stay strange strike stubborn stupid take to today tomorrow town two ugly upside very water what which who woman you

This could serve as a kernel starting point of a whitelist for identifying uniquely English words if we compare their frequency or rank within the Scots and English word frequency lists.

Looking at merely the top 20,000 words in each language, there are 11,515 words common to both languages, and therefore 8,485 words unique to each. This is somewhat flawed as the English wikipedia contains a total over over a million unique words, whilst the Scots wikipedia contain around 55,000 unique words. A google sheet of the word list comparison can be found here.

The top twenty most frequently used words that are common to both are as follows, with Scots wiki occurrences.

the (417,894)
an (182,133)
in 159508
is 112395
as 46494
it 34234
on 30780
for 30530
that 23256
at 18388
or 18090
are 17643
he 17047
his 15935
its 15848
which 12692
of 12633
population 12613
municipality (11,635)
de (10,525)

Most of these are to be expected, except "de", which is barely an English word, there's something going on here.

The top twenty most frequently used words that are unique to the Scots wikipedia are as follows

tae (74,257)
wis 59140
bi 32329
wi 30901
frae 26145
ceety 17840
haes 16541
ane 12166
aw 10999
toun 10742
aurie 10281
destrict 9301
locatit 8893
haed 8846
maist 8258
efter 8123
pairt 7353
twa 7133
ither 6756
hae (6,505)

I'm skeptical whether "ceety" or "aurie" should be anywhere near a most frequently used Scots word list.

The top twenty most frequently used words that are unique to the English wikipedia are as follows

given
himself
brought
defeated
opening
competed
township
households
moving
featuring
accepted
providing
household
surrounding
painting
losing
resulting
suggested
allowing
founding

These are all words that have been eliminated mechanically using bots, so no surprises. If we instead ignore words ending with 'ing', the next twenty words are:-

unknown
slightly
height
band's
herself
ncaa
norway
roof
follow
perhaps
users
collected
paintings
grounds
musicians
musician
moth
owners
thousands
wounded

"ncaa" might come as a surprise, it has 104,048 occurrences, a bit of wiki-fu elicits that it is the en:National Collegiate Athletic Association, perhaps there are 104,048 individual athletes with their own wikipedia pages. The Scots should count their blessings that they have avoided this, although there does seem to be an abundance of pages about European royalty, and death metal bands.

Comparisons with Gaelic

Once I was aware that there was actually a Gaelic language wikipedia, albeit with only 14,000 pages, processed it for a word frequency list, which was then compared with Scots. The results can be found on a google spreadsheet here.

It seems that due to the small size of the Gaelic corpus here, the word list 'bottoms' out. I only consider the top 20,000 words in each language, which ought to filter out words that are only in a handful of articles and keep the ones in most common use. There are a lot of words showing up as being common to both languages, when in actually fact looking at the numbers of occurrences they are very rare in Gaelic and very common in Scots.

Scots Gaelic word count comparisons
Word	Scots occurrences	Gaelic occurrences
or	18,090	64
are	17,643	36
he	17,047	46
his	15,935	76
its	15,848	42

Due to comparative rarity of these words in Gaelic, should they be considered unique to Scots (outwith English)? The Gaelic wikipedia is about a quarter the size of the Scots wikipedia, we might expect there to be only factor of four difference between the most common words.

Perhaps the entire comparison of overlapping word spaces needs to be done by hand, aided by algorithm rather than completely automatically. Maybe considering the percentile ranks of each word in each language and their ratio to each other would eliminate any artifacts from the corpus size. For example "its" is a 99th percentile word in Scots, but a 1st percentile word in Gaelic, so effectively it is unique to Scots. This might take more processing than Excel can manage alone, I'll have to roll my sleeves up and get back into perl.

Article count and stats

2005-06-23 00:00 0 airticles
2020-08-01 56,524 unique wirds, 4,458,169 total wirds (twa letters or mair, three occurrences or mair)
2020-08-29 23:30 57,956 airticles
2020-09-01 09:39 57,984 airticles
2020-09-01 56,828 unique wirds, 4,500,147 total wirds (twa letter or mair, three occurrences or mair)

Citogenesis

Citogenesis is a satirical theory about where citations come from, as explained in this XKCD comic. Made up on wikipedia -> used in official documented -> cited on wikipedia. There appear to be a few examples on the Scots wiki.

conseeder
proposeetion
peteetion

These words are very rare on the Scottish Corpus website, yet are very common on the Scots wikipedia. It is believed that the Online Scots dictionary harvested words and definitions from wikipedia. These words have been used in the Scots translation Scottish Parliament document "Your Scottish Pairlament". This document has subsequently been cited in the Scots wikipedia, thus completing the circle.

It is possible that these spellings are used in the north-east Scots dialect, 'conseeder' was used by north eastern author Sheena Blackhall in 1996, the other words have not.