Linguistic Data Analysis

This data analysis was undertaken as a part of the Cal’s Stat 215A.

##Please set path before running: path to where the main data is

##libraries:
library(kableExtra)
library(dbscan)
library(factoextra)
library(fastcluster)
library(FactoMineR)
library(NbClust)
library(tidyverse)
library(magrittr)
library(cluster)
library(cowplot)
library(NbClust)
library(clValid)
library(ggfortify)
library(clustree)
library(dendextend)
library(factoextra)
library(FactoMineR)
library(corrplot)
library(GGally)
library(knitr)
library(kableExtra)
library(gplots)
library(wesanderson)
library(ggplot2)
library(tidyverse)
library(maps)
library(crosstalk)
library(readr)
library(gridExtra)
library(leaflet)
library(MASS)
library(Rtsne)
library(irlba)

state_df <- map_data("state")

my_map_theme <- theme_void()


# load the data
ling_data <- read.table("https://raw.githubusercontent.com/malvikarajeev/linguisticSurvey/master/lingData.txt", header = T)
ling_location <- read.table("https://raw.githubusercontent.com/malvikarajeev/linguisticSurvey/master/lingLocation.txt", header = T)
# question_data contains three objects: quest.mat, quest.use, all.ans


library(repmis)
source_data("https://github.com/malvikarajeev/linguisticSurvey/blob/master/question_data.RData?raw=true")
## [1] "quest.mat" "quest.use" "all.ans"
answers <- all.ans[50:122]

####USING PACKAGE ZIPCODE
library(zipcode)
data("zipcode")
###changing ZIPs to add a zero
zip <- ling_data$ZIP
zip <- as.character(zip)
for(i in 1:length(zip)) {
  if(as.numeric(zip[i]) < 10000){
    zip[i] <- paste0("0", zip[i])
  }
}

ling_data$ZIP <- zip
t2 <- merge(ling_data,zipcode, by.x = 'ZIP', by.y = 'zip')

t2 <- t2[, -c(2:4, 72, 73)]

names(t2)[69:72] <- c("CITY", "STATE", "lat", "long")

ling_data <- t2

l <- length(all.ans)

structure <- matrix(numeric(l*2), l,2)
for (i in 1:l){
  
  temp <- all.ans[[i]]
  structure[i,1] <- temp$qnum[1]
  structure[i,2] <- length(temp$ans.let)
}


structure <- as.data.frame(structure)
names(structure) <- c('ques.num', 'number_choices')

#so one person has 73 responses.
#for every response, structure$number_choices tells us which response it is. 

##missing: 112, 113, 114, 116, 122


struc <- structure[50:122,]
struc <- struc[-c(63:65,67,73),]
struc <- struc[-59,]

#source("/Users/malvikarajeev/Desktop/stat215/stat-215-a/lab2/R/clean.R")
dummy <- ling_data[,-c(1, 69:72)]

names(dummy) <- struc$ques.num
answer_no <- struc$number_choices
N <- nrow(ling_data)

l <- as.data.frame(matrix(numeric(N), N))

for (i in 1:length(names(dummy))){
  df <- dummy[i]
  names(df) <- "answers"
  df$recode <- list(rep(0, answer_no[i]))
  df$recode <- Map(function(x,y) `[<-`(x,y,1), x = df$recode, y = df$answers)
  
  temp <- data.frame(matrix(unlist(df$recode), nrow=length(df$recode), byrow=T))
  l <- c(l, temp)
}
l <- data.frame(l)

l <- l[,-1]



#binary <- read.csv("~/Desktop/stat215/stat-215-a/lab2/data/binary.csv")
#binary <- binary[,-1] ##serial numbers
binary <- l

########################################################
##fix the column  names using structure
##creating a column on vector names

create_names <- function(x) {
  return(lapply(x$number_choices, function(x) {seq(1:x)}))
}

names_col <- create_names(struc)
names_ans <- unlist(sapply(1:67, function(x) {paste(struc$ques.num[[x]], names_col[[x]], sep = "_")}))

names(binary) <- names_ans

######################################################

binary$lat <- ling_data$lat
binary$long <- ling_data$long
binary$id <- ling_data$ID
binary$city <- ling_data$CITY
binary$state <- ling_data$STATE

##keeping first three zips of dataframe

binary$zip <- substr(as.character(ling_data$ZIP),1,nchar(ling_data$ZIP) - 2)

##clear out indivudals who didnt answer all the questions

binary <- binary[rowSums(binary[,1:468]) == 67,]

###########################BY ZIP
##group by . have to remove: state, zip, city, lat, long
 



temp <- binary[, -(469:472)]

by_zip <- temp %>% group_by(zip) %>% summarise_all(sum)

##to add columns for lat,long, stat etc, we group ling_data by zip, 
#and then report the MODE of each of the required columns
#group by city, get the first state, most frequent occuring city, and most frequent occuring lat and log
#to make it easier will moduralise it:##na.last =NA removes NAs

ling_data$newZIP <- substr(as.character(ling_data$ZIP), 1, nchar(ling_data$ZIP) - 2)

get_mode <- function(x) {
  #return(names(sort(table(x, use.NA = 'always'), decreasing = T, na.last = T)[1]))
  #return(which.max(tabulate(x)))
  ux <- unique(x)
  ux[which.max(tabulate(match(x, ux)))]
}

temp <- ling_data %>% group_by(newZIP) %>% summarise(state = get_mode(STATE), city = get_mode(CITY),
                                                     lat = get_mode(as.numeric(lat)), long = get_mode(as.numeric(long)))


##by_zip_ll has all the added columns

by_zip_ll <- merge(by_zip, temp[,c("lat","long", "newZIP","state","city")], 
                   by.x = 'zip', by.y = 'newZIP', all.x = T)

##adding state info using data(states)
by_zip_ll$state <- as.character(by_zip_ll$state)

data(state)
state_info <- data.frame(stringsAsFactors = F, state.abb, 
                         state.region)

by_zip_ll <- merge(by_zip_ll, state_info, by.x = "state", by.y = "state.abb") 

##finally, remove hawaki and alaska

by_zip_ll <- by_zip_ll %>% filter(!(state == 'AK' | state == 'HI'))

just_zip <- by_zip_ll
##let's change frequenceies to relative frequencies for PCA and KMEANS, DBSCAN ETC.

temp <- by_zip_ll[,-c(1,2, 471:747)]



temp <- t(apply(temp, 1, function(i) i/sum(i))) ##transpose because R populates by column

##sanity check
#rowSums(temp)

by_zip_ll[,-c(1,2, 471:747)] <- temp

Introduction

The study of aggregate linguistic properties over spatial variation is called dialectometry, a sub branch of dialectology: the study of dialects. As language variation is complex, both geographically and dynamically, computational techniques, that can deal with large amounts of granular data, and statistic tehcniques, that can help make inferences from this data, are pivotal for the advancement of dialectometry.

In 2003, a dialect survey was condcted as part of an expansion of an initiative started by Professor Bert Vaux at Harvard University. The Dialect Survey uses a series of questions, including rhyming word pairs and vocabulary words, to explore words and sounds in the English language. The survey was conducted to obtain a contemporary view of American English dialectal variation. It started as an online survey, with a final tally of around 47,000 respondents. For this report, we’re interested in the lexical-variant questions, rather than phoenetical variation.

By analysing the responses to these questions, we are interested in investigating some geographical structure that might be present in this data. In this report, we’ll explore some dimension reduction methods, and also use some clustering methods to cluster observations into geographically-meaningful groups, using k-means and hierarchical bipartite spectral clustering.

Dataset

The survey dataset contains a set of 122 questions. Each question has around 47,000 responses. For our analyses and clustering, we group the data the first 3 digits of the respondents ZIP code. U.S. ZIP Code Areas (Three-Digit) represents the first three digits of a ZIP Code. The first digit of a five-digit ZIP Code divides the United States into 10 large groups of states numbered from 0 in the Northeast to 9 in the far West.

Within these areas, each state is divided into an average of 10 smaller geographical areas, identified by the second and third digits. These digits, in conjunction with the first digit, represent a sectional center facility or a mail processing facility area.

There are around ~800 such areas. Each question has a varying degree of possible responses, summarised in ‘answers’ data. Each row represents an individuals reponse, along with their city, state and ZIP, although this was user input so is extremely essy (specially the city). The main dataset, ‘ling_data’ contains this information. In the data cleaning section, I will explain how we sufficied through these challenges.

Data Cleaning

  1. The first step was to fix the ling_data. I used the package ‘zipcode’, which has all the unique zipcodes of United States, along with the corresponding city and State. Before merging ling_data with this dataset, I had to add a leading ‘0’ before the 4 digit ZIPs. After merging on the zip code, I was able to remove all the messy entries of ‘cities’ and ‘states’.

  2. After that, I subsetted the data to our questions of interests, i.e the lexical questions. Then, I changed the ~47,000 x 67 categorical response matrix to a ~47,000 x 468 binary matrix. To illustrate: Question 65 has 6 options. If person A picked option 4, their corresponding entry would become (0,0,0,1,0,0). I also changed the column names to the answer options.

  3. Then, I removed all respondents who hadn’t answered all the questions, that is, their rows in the binary matrix did not sum to 67. This is to avoid skewing the data.

  4. Next, I grouped by the 3-digit zip column by adding all the responses and selecting the mode of city, state, latitude and longitude within that zipcode. I removed Alaska and Hawaii from the dataset to make graphical representation easier.

  5. Finally, I kept two dataframes for analyses, the one described above, and one in which I scale every observation within that zip by total observations in the zip. This is to normalise zips with too many or too few respondents.

Exploratory Data Analysis

I picked question 105 - What do you call a carbonated beverage? and question 65 - what do you call the insect that glows in the dark because they involve words that people use in common everyday dialect and it’s usually an either-or situation.