Modeling regions of linguistic phenomena, including the inherent spatial dependencies

UFSP SpuR: SynMod+

Merlin Unterfinger, Geography UZH

25 8 2017

Abstract

In this project a new approach to model linguistic regions is presented. To study the spatial distribution of linguistic phenomena concepts from Geography are applied on a linguistic data set, which was collected in the project ‘Syntactic Atlas of German-speaking Switzerland’ (SADS) during the years 2000 till 2002. Continuous spatial surfaces for the manifestations and intensity of the ‘infinitival complementizer’ linguistic phenomena in the Swiss-German language are generated by combining techniques from machine learning, like the Support Vector Machine (SVM) and a spatial regularization of the classification map (MRF) with other well-known geographic concepts (Tobler’s first law, hiking function and least cost paths) to create linguistic regions, which take the inherent spatial dependencies between the data points into account. Trudgill’s gravity index finds application to weight the linguistic influence of two neighboring points on each other.

The project is supervised by Curdin Derungs, head of the GISLab, a research group of the ‘UFSP Language and Space’ initiative at the University of Zurich.

1 Libraries

The following libraries are needed, to execute the code:

e1071 – Support vector Machine – (Meyer et al. 2015)
FNN – Fast nearest neighbor search – (Beygelzimer et al. 2013)
foreign – Open foreign data formats (SPSS) – (R Core Team 2016)
gdistance – Least cost path, walking time – (Etten 2017)
ggplot2 – Plots – (Wickham 2009)
GISTools – Plots – (Brunsdon and Chen 2014)
gridExtra – Plots – (Auguie 2016)
rasterVis – Plots of raster – (Perpinan and Hijmans 2016)
rgdal – Open Shapefiles – (R. Bivand, Keitt, and Rowlingson 2016)
rgeos – Geoprocessing – (R. Bivand and Rundel 2016)
sp – Spatial datatypes – (E. J. Pebesma and Bivand 2005; Bivand, Pebesma, and Gomez-Rubio 2013)
spatialEco – Plots – (Evans 2017)
plyr – Plots – (Wickham 2011)

2 Introduction

3 Methods

3.1 Data preprocessing

3.1.1 Linguistic data

The database for this project is the Syntactic Atlas of German-speaking Switzerland (SADS). In the years between 2000 and 2002 about 3’200 participants answered four questionnaires at 383 different survey sites. The questions asked, were about syntactic phenomena of the Swiss German language. Compared to other linguistic surveys, SADS is special, because it has multiple participants respondents per study site and therefore covers the local linguistic diversity. Per study site from 3 to 26 persons (median: 8) were involved in the survey (Bucheli and Glaser 2002; Jeszenszky and Weibel 2016).
In this project, four phenomena of the SADS data set are used, which belong to a syntactical construct that is called ‘infinitival complementizer’. An infinitival complementizer is a so-called purposive infinitival clause, since the clause expresses a purpose. An example from the English language would be the difference between ‘in order’ to or ‘to’. The original data set has been aggregated to the level of the Swiss communities. This means each community is represented by a point (point data). Due to the strongly varying number of participants per survey site, a majority voting per survey site and phenomenon is not suitable. Therefore, a random sampling on survey site is applied to achieve a statistically more stable data base for the study. Each survey site was sampled 8 (median of persons per survey site) times with replacement. Then a majority vote of a phenomenon’s manifestations was applied on the random sampled data set. The majority vote values (class assignments) serve as input for the further steps.

3.1.2 Population

The population data of the Swiss communities in the year 2013 is provided by the Federal Office for statistics (BfS). The data set (STATPOP 2013) represents the population of Switzerland aggregated to hectare raster cells. Each raster cell is stored as a polygon. For easier access on the data, the polygons have been rasterized to a GeoTiff (with 100m resolution) using the to raster module of ArcGIS.

3.1.3 Geodata

The digital elevation model DHM200 of Swisstopo delivers the terrain height information used in this study. For visualization purposes the outline of Switzerland is used from the SwissBOUNDARIES3D data set from Swisstopo. The outline of Switzerland was simplified to save processing time, when plotting.

3.1.4 Overview

Overview plot of the communities for phenomenon i1. The point size represents the number of participants per survey site.

Overview plot of the communities for phenomenon iv6. The point size represents the number of participants per survey site.

Overview plot of the communities for phenomenon i11. The point size represents the number of participants per survey site.

Overview plot of the communities for phenomenon iv14. The point size represents the number of participants per survey site.

Structure of the processed data set (random sampling & majority vote).
Easting	Northing	BFS	Name	Count	i1	i6	i11	iv14
786007.8	194653.8	3871_1	Klosters-Serneus	8	2	2	1	2
780616.8	204792.1	3893_1	Sankt Antönien	8	2	2	2	2
778152.1	198598.1	3882_1	Küblis	6	2	2	2	2
771204.8	204549.3	3962_1	Schiers	11	2	2	2	2
761946.6	196105.1	3945_1	Trimmis	7	2	2	2	2

Table 1 shows the structure of the processed data set. Each community is represented as a data point, which has a row in the table. The phenomena are represented by the majority vote (dominant variant) of the random sampling in a particular community.

3.2 Support vector machine

3.2.1 Parameter selection

To select the best parameters for the training of the SVM, with a radial basis function (RBF) kernel, a Grid Search is performed. The aim is to find the best combination of gamma and cost based on a previously given selection of possible values for each parameter. This step is repeatedly for the four linguistic phenomena and its corresponding data record. The grid search is performed by applying the tune.svm() function from the e1071 package. In the following table, the best parameters for the training of a SVM on this data set, found by a grid search, are listed (Karatzoglou, Meyer, and Hornik 2006).

The best parameters from grid search.
	Gamma	Cost	Error	Dispersion
i1	1.5	10	0.120	0.047
i6	1.5	1	0.089	0.055
i11	2.0	1	0.104	0.037
iv14	2.0	1	0.068	0.031

3.2.2 Training of the SVM

With the evaluated parameters from the previous section, a SVM is trained for each phenomenon. The trained SVM is of type ‘RBF’, which means it uses a Gaussian kernel and therefore is a non parametric/nonlinear algorithm.

3.3 Spatial context

3.3.1 Setup grid

A regular grid of data points is expanded over the area of Switzerland. The grid extent is taken from a buffered (8000m) convex hull drawn around all data points.

The extent and resolution of the grid.
	Min	Max	Pixel	Resolution
Easting	566091	794007	152	1500
Northing	88511	300150	141	1500

By applying the extract() function from the raster package on the grid data points and the data layers of the population STATPOP13 and the height DHM25_100m the values at the grid data points position are extracted.

3.3.2 Nearest neighbors

The nearest neighbors of each grid point are extracted by applying the spDists() function from the sp package on the grid. Then the knn.index() function from the sp package extracts the indices of the k nearest neighbors.

3.3.3 Hiking function

Tobler (1993) defined a hiking function, which is representing the walking velocity \(W\) [\(\frac ms\)] dependent on the slope of a path: \[W = \frac{6e^{-3.5|m + 0,05|}}{3.6}\] Because the units in this study are meters and seconds, the function has to be divided by 3.6 to get from \(\frac {km}h\) to \(\frac ms\)

Toblers hiking function.

3.3.4 Walking time

By using the the gdistance package, least cost paths between all neighbors are estimated and the time for walking these paths is extracted with the costDistance() function. The least cost path estimation is based on a transition layer (conductance), which is generated with the height differences of the pixels in the DHM. In a next step Toblers hiking function is applied on the Slope in 8 directions of each pixel. This produces a transition layer with the walking velocity \(W\) in the 8 possible directions at every pixel in the DHM.

Least cost paths to the k neighbors of an example point.

Because searching least cost paths on such a huge (global) transition layer is computationally expensive, multiple local (smaller) transition layers are generated with a moving window approach that has a variable size. By doing so for every grid point and its \(k\) neighbors, a local transition layer is generated and the walking time for the least costs paths to the neighbors is computed locally.

Histogram of the extracted walking times for the grid neighbors.

After the walking time extraction, one big data frame, containing the edges (ind1, ind2) is created and completed with population values (p1, p2) for the start and the end point of each edge. Just for completeness the heights (h1, h2) and the euclidean distance (dx) between the points is added. Then the mean slope (m) for each edge is calculated. It is important to state that the edges appear twice in the data frame, this is to represent the two possible directions of the edges.

3.3.5 Gravity index

To get a meaningful index of the linguistic influence of two neighbors on each other, a concept from Trudgill (1974) is applied: \[I_{ij}=\frac{P_i P_j}{d_{ij}^2}\times\frac{P_i}{P_i+P_j}\] \(P_i\) stands for the population size of the data point \(i\) (community) and \(d_{ij}\) is representing the walking time between the two data points \(i\) and \(j\).

Some example edges of neighbors in the grid, showing the attribute values.
ind1	ind2	p1	p2	h1	h2	dx	m	t	I
13425	13426	306	1	251.199	257.025	1509.380	-0.004	1086.908	0.000
13425	13424	306	158	251.199	266.575	1509.380	-0.010	1113.375	0.026
13425	13325	306	8	251.199	257.800	1500.987	-0.004	1100.758	0.002
13425	13523	306	32	251.199	247.144	1500.987	0.003	1062.060	0.008
13425	13324	306	93	251.199	276.977	2128.660	-0.012	1584.168	0.009
13425	13524	306	1	251.199	254.737	2128.660	-0.002	1525.586	0.000

3.3.6 Weighted asymmetric adjacency matrix

The edge list and the gravity index, representing linguistic influence of two neighbors on each other, is now filled into an weighted asymmetric adjacency matrix. This is done by creating an empty \(m \times n\) matrix and then using ind1 as \(m\) and ind2 as \(n\) to fill in the \(I\) (gravity index) values.

Weighted (linguistic influence) asymmetric adjacency matrix of the grid neighbours.

3.4 Spatial regularization of the classification map

3.4.1 Markov random field

The Markov Random Field (MRF) is a statistical method based on a Bayesian approach, which considers spatial dependencies in the decision rule, instead of integrating them in the classification stage. The optimization is done by Iterated Conditional Modes (ICM), which is defined over the minimum energy per data point: \[y^* = \underset{y\in C}{\operatorname{argmin}} - [\sum\limits_{i=1}^n\log(p(y_i|x_i)) + \beta\sum\limits_{i=1}^n\sum_{j\in N_i}V(y_i,y_j)]\] In the MRF the probabilities of the pixel-wise SVM classification \(p(y_i|x_i)\) are used as starting point (a priori) for a spatial regularization of the classification map. The spatial dependencies are represented by the term \(V(y_i,y_j)\). Because it is not possible to include all the surrounding data points into the decision rule, the influence is limited to local neighborhoods \(N\) (also called cliques). In our case the \(N\) is given by the nearest grid neighbors. The minima of the global energy is approached by minimizing the local energies in the MRF. A new classification map is obtained by taking the most probable class affiliation \(y^*\) (of all possible classes \(y\in C\)) based on the posterior probabilities of the spatial regularization. (Besag 1986).

The MRF is constructed using the following components as starting point: (I) probabilities - These are extracted from a prediction done by the SVM trained in the previous part; (II) weighted edges - These are given by the asymmetric adjacency matrix created of the nearest grid neighbors and the index of linguistic influence \(I_{ij}\) and (III) the class affiliation of the data points given through the SVM prediction.

3.4.2 Energy function

To minimize the global energy of the MRF, there has to be defined an energy function for the local energies of the data points. In this project the energy function definition leans on a paper of Tarabalka et al. (2010). So the the local energy of a data point is defined as: \[U(x_{i})=U_{linguistic}(x_{i}) + U_{spatial}(x_{i})\] The linguistic energy term only uses the SVM probabilities and is computed as: \[U_{linguistic}(x_{i}) = -ln\{P(x_i|L_i) \}\] And the spatial energy term is calculated using the following equation: \[U_{spatial}(x_{i}) = \beta \frac{\sum_{x_j\in N_i} I_{ij}(1-\delta(L_i, L_j))}{\sum_{x_j\in N_i} I_{ij}}\] where \(\delta(.,.)\) is the Kronecker delta function (\(\delta(a, b) = 1\) if \(a =b\), and \(\delta(a, b) = 0\) otherwise) and \(\beta\) is a parameter that controls the importance of the spatial versus spectral energy terms. \(I_{ij}\) is the index of the linguistic influence of between two neighbors, defined in the a previous section.

3.4.3 Minimizing the global energy of the MRF

The global energy of the MRF is defined as the sum of all local energies: \[E = \sum\limits_{i=1}^NU(x_{i})\] The global energy is minimized through the iterative minimization of the local energies. The iterative minimization of the local energies is performed by an adoption of the Metropolis algorithm from Metropolis et al. (1953), based on stochastic relaxation and annealing described by S. Geman and Geman (1984). Per iteration a randomly chosen data point is assigned to a random class. Based on the observed change \(\Delta U = U^{new}(x_{i})-U^{old}(x_{i})\) of the local energy, the random class assignment is either kept or discarded. If \(\Delta U\) is negative, the new class assignment is accepted. To avoid the problem of converging to a local minima, also some class changes with a positive \(\Delta U\) are allowed. But these are only accepted with the probability \(p = exp(\frac{-\Delta U}{T})\) (stochastic relaxation), where \(T\) (temperature) is a control parameter, which is getting smaller with every iteration (annealing) and thus allows fewer class changes with positive \(\Delta U\) values (Tarabalka et al. 2010).

3.5 Evaluation maps

To be capable of comparing the results of the spatial regularization approach with existing linguistic region modeling approaches, two visualization techniques from a paper of Sibler et al. (2012) are reproduced. Both methods are on Thiessen polygon aggregation level around the data points.

3.5.1 Majority vote

The simplest way to visualize a surface of linguistic regions, is to conduct a majority vote on each sampling location and then color the area of the corresponding Thiessen polygon with the mostly occurring class.

3.5.2 Kernel density estimation

Another approach to create continuous surfaces of linguistic regions, is to calculate intensity values with a kernel density estimation at the sampling location for every possible variant of a phenomena and then search for the maximum intensity of the different variants at every sampling point. This, similarly to the majority vote, determines the class in which the corresponding Thiessen polygon gets colored. The complete methodology can be found in the paper of Rumpf et al. (2009). The bandwidth for the kernel density estimation is determined by Normal Reference Distribution, a well-supported rule-of-thumb for choosing the bandwidth of a Gaussian kernel density estimator. The mean of the different calculated bandwidths is close to 40’000m and therefore the bandwidth is set constantly to 40’000m.

4 Results

4.1 Energies after spatial regularization

Development of the global energies (in logarithmic scale) during the iterative minimization of the local energies. Spatial regularizaion with a) the inverse walking time and b) the gravity index describing the linguistic influence of two neighbors on each other.

Development of the temperatures during the iterative minimization of the local energies (annealing).

The mean energies of the four phenomena after the spatial regularization with a) the inverse walking time and b) the gravity index describing the linguistic influence of two neighbors on each other.

4.2 Inverse walking time and gravity index

4.2.1 Phenomenon i1

Classification result of the spatial regularization with a) the inverse walking time and b) the gravity index describing the linguistic influence of two neighbors on each other.

4.2.2 Phenomenon i6

Classification result of the spatial regularization with a) the inverse walking time and b) the gravity index describing the linguistic influence of two neighbors on each other.

4.2.3 Phenomenon i11

Classification result of the spatial regularization with a) the inverse walking time and b) the gravity index describing the linguistic influence of two neighbors on each other.

4.2.4 Phenomenon iv14

Classification result of the spatial regularization with a) the inverse walking time and b) the gravity index describing the linguistic influence of two neighbors on each other.

4.3 Method comparison maps

4.3.1 Phenomenon i1

Classification result of a) a majority voting on Thiessen polygon aggregation level, b) a kernel density estimation on Thiessen polygon aggregation level, c) a spatial regularization with the inverse walking time and d) a spatial regularization with the gravity index describing the linguistic influence of two neighbors on each other.

4.3.2 Phenomenon i6

4.3.3 Phenomenon i11

4.3.4 Phenomenon iv14

4.4 Difference maps

4.4.1 Phenomen i1

Difference images between KDE map and spatial regularization of a) a spatial regularization with the inverse walking time and d) a spatial regularization with the gravity index.

4.4.2 Phenomen i6

Difference images between KDE map and spatial regularization of a) a spatial regularization with the inverse walking time and d) a spatial regularization with the gravity index.

4.4.3 Phenomen i11

Difference images between KDE map and spatial regularization of a) a spatial regularization with the inverse walking time and d) a spatial regularization with the gravity index.

4.4.4 Phenomen iv14

Difference images between KDE map and spatial regularization of a) a spatial regularization with the inverse walking time and d) a spatial regularization with the gravity index.

5 Conclusion

References

Auguie, Baptiste. 2016. GridExtra: Miscellaneous Functions for “Grid” Graphics. https://CRAN.R-project.org/package=gridExtra.

Besag, Julian. 1986. “On the Statistical Analysis of Dirty Pictures.” Journal of the Royal Statistical Society 48 (3): 259–302.

Beygelzimer, Alina, Sham Kakadet, John Langford, Sunil Arya, David Mount, and Shengqiao Li. 2013. FNN: Fast Nearest Neighbor Search Algorithms and Applications. https://cran.r-project.org/package=FNN.

Bivand, Roger S, Edzer Pebesma, and Virgilio Gomez-Rubio. 2013. Applied spatial data analysis with {R}, Second edition. Springer, NY. http://www.asdar-book.org/.

Bivand, Roger, and Colin Rundel. 2016. rgeos: Interface to Geometry Engine - Open Source (GEOS). https://cran.r-project.org/package=rgeos.

Bivand, Roger, Tim Keitt, and Barry Rowlingson. 2016. rgdal: Bindings for the Geospatial Data Abstraction Library. https://cran.r-project.org/package=rgdal.

Brunsdon, Chris, and Hongyan Chen. 2014. GISTools: Some Further Gis Capabilities for R. https://CRAN.R-project.org/package=GISTools.

Bucheli, Claudia, and Elvira Glaser. 2002. “The Syntactic Atlas of Swiss German Dialects: Empirical and Methodological Problems.” In Syntactic Microvariation, edited by S Barbiers, L Cornips, and S van der Kleij, 2nd ed., 41–73. Amsterdam: Meertens Institute Electronic Publications in Linguistics.

Etten, Jacob van. 2017. “{R} Package {gdistance}: Distances and Routes on Geographical Grids.” Journal of Statistical Software 76 (13): 1–21. doi:10.18637/jss.v076.i13.

Evans, Jeffrey S. 2017. SpatialEco. https://CRAN.R-project.org/package=spatialEco.

Geman, Stuart, and Donald Geman. 1984. “Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images.” IEEE Transactions on Pattern Analysis and Machine Intelligence 6 (6): 721–41. doi:10.1109/TPAMI.1984.4767596.

Jeszenszky, Péter, and Robert Weibel. 2016. “Modeling transitions between syntactic variants in the dialect continuum.” In The 19th Agile International Conference on Geographic Information Science, Helsinki (Finnland), 14 June 2016 - 17 June 2016. June.

Karatzoglou, Alexandros, David Meyer, and Kurt Hornik. 2006. “Support Vector Algorithm in R.” Journal of Statistical Software 15 (9): 1–28.

Metropolis, Nicholas, Arianna W. Rosenbluth, Marshall N. Rosenbluth, Augusta H. Teller, and Edward Teller. 1953. “Equation of state calculations by fast computing machines.” Journal Chemical Physics 21 (6): 1087–92. doi:http://dx.doi.org/10.1063/1.1699114.

Meyer, David, Evgenia Dimitriadou, Kurt Hornik, Andreas Weingessel, and Friedrich Leisch. 2015. e1071: Misc Functions of the Department of Statistics, Probability Theory Group (Formerly: E1071), TU Wien. https://cran.r-project.org/package=e1071.

Pebesma, Edzer J, and Roger S Bivand. 2005. “Classes and methods for spatial data in {R}.” R News 5 (2): 9–13. https://cran.r-project.org/doc/Rnews/.

Perpinan, Oscar, and Robert Hijmans. 2016. rasterVis. http://oscarperpinan.github.io/rastervis/.

R Core Team. 2016. Foreign: Read Data Stored by Minitab, S, Sas, Spss, Stata, Systat, Weka, DBase, . https://CRAN.R-project.org/package=foreign.

Rumpf, Jonas, Simon Pickl, Stephan Elspa, Werner König, and Volker Schmidt. 2009. “Structural analysis of dialect maps using methods from spatial statistics.” Zeitschrift Fuer Dialektologie Und Linguistik 76 (3): 280.

Sibler, Pius, Robert Weibel, Elvira Glaser, and Gabriela Bart. 2012. “Cartographic Visualization in Support of Dialectology.” Proceedings - AutoCarto 2012 - Columbus (OH, USA), no. 9: 18.

Tarabalka, Yuliya, Mathieu Fauvel, Jocelyn Chanussot, and Jon Atli Benediktsson. 2010. “SVM and MRF-Based Method for Accurate Classification of Hyperspectral Images.” IEEE Geoscience and Remote Sensing Letters 7 (4). IEEE - Institute of Electrical; Electronics Engineers: 736–40.

Tobler, Waldo. 1993. “Non-Isotropic Geographic Modeling.” NCGIA Technical Reports 93 (1): 5.

Trudgill, Peter. 1974. “Linguistic change and diffusion: description and explanation in sociolinguistic dialect geography.” Language in Society 2. UZH Hauptbibliothek / Zentralbibliothek Zürich: 215–46. doi:10.1017/S0047404500004358.

Wickham, Hadley. 2009. ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York. http://ggplot2.org.

———. 2011. “The Split-Apply-Combine Strategy for Data Analysis.” Journal of Statistical Software 40 (1): 1–29. http://www.jstatsoft.org/v40/i01/.