Abstract
In this project a new approach to model linguistic regions is presented. To study the spatial distribution of linguistic phenomena concepts from Geography are applied on a linguistic data set, which was collected in the project ‘Syntactic Atlas of German-speaking Switzerland’ (SADS) during the years 2000 till 2002. Continuous spatial surfaces for the manifestations and intensity of the ‘infinitival complementizer’ linguistic phenomena in the Swiss-German language are generated by combining techniques from machine learning, like the Support Vector Machine (SVM) and a spatial regularization of the classification map (MRF) with other well-known geographic concepts (Tobler’s first law, hiking function and least cost paths) to create linguistic regions, which take the inherent spatial dependencies between the data points into account. Trudgill’s gravity index finds application to weight the linguistic influence of two neighboring points on each other.
The project is supervised by Curdin Derungs, head of the GISLab, a research group of the ‘UFSP Language and Space’ initiative at the University of Zurich.
The following libraries are needed, to execute the code:
In this project a new approach to model linguistic regions is presented. To study the spatial distribution of linguistic phenomena concepts from Geography are applied on a linguistic data set, which was collected in the project ‘Syntactic Atlas of German-speaking Switzerland’ (SADS) during the years 2000 till 2002. Continuous spatial surfaces for the manifestations and intensity of the ‘infinitival complementizer’ linguistic phenomena in the Swiss-German language are generated by combining techniques from machine learning, like the Support Vector Machine (SVM) and a spatial regularization of the classification map (MRF) with other well-known geographic concepts (Tobler’s first law, hiking function and least cost paths) to create linguistic regions, which take the inherent spatial dependencies between the data points into account. Trudgill’s gravity index finds application to weight the linguistic influence of two neighboring points on each other.
The database for this project is the Syntactic Atlas of German-speaking Switzerland (SADS). In the years between 2000 and 2002 about 3’200 participants answered four questionnaires at 383 different survey sites. The questions asked, were about syntactic phenomena of the Swiss German language. Compared to other linguistic surveys, SADS is special, because it has multiple participants respondents per study site and therefore covers the local linguistic diversity. Per study site from 3 to 26 persons (median: 8) were involved in the survey (Bucheli and Glaser 2002; Jeszenszky and Weibel 2016).
In this project, four phenomena of the SADS data set are used, which belong to a syntactical construct that is called ‘infinitival complementizer’. An infinitival complementizer is a so-called purposive infinitival clause, since the clause expresses a purpose. An example from the English language would be the difference between ‘in order’ to or ‘to’. The original data set has been aggregated to the level of the Swiss communities. This means each community is represented by a point (point data). Due to the strongly varying number of participants per survey site, a majority voting per survey site and phenomenon is not suitable. Therefore, a random sampling on survey site is applied to achieve a statistically more stable data base for the study. Each survey site was sampled 8 (median of persons per survey site) times with replacement. Then a majority vote of a phenomenon’s manifestations was applied on the random sampled data set. The majority vote values (class assignments) serve as input for the further steps.
The population data of the Swiss communities in the year 2013 is provided by the Federal Office for statistics (BfS). The data set (STATPOP 2013) represents the population of Switzerland aggregated to hectare raster cells. Each raster cell is stored as a polygon. For easier access on the data, the polygons have been rasterized to a GeoTiff (with 100m resolution) using the to raster module of ArcGIS.
The digital elevation model DHM200 of Swisstopo delivers the terrain height information used in this study. For visualization purposes the outline of Switzerland is used from the SwissBOUNDARIES3D data set from Swisstopo. The outline of Switzerland was simplified to save processing time, when plotting.
Easting | Northing | BFS | Name | Count | i1 | i6 | i11 | iv14 |
---|---|---|---|---|---|---|---|---|
786007.8 | 194653.8 | 3871_1 | Klosters-Serneus | 8 | 2 | 2 | 1 | 2 |
780616.8 | 204792.1 | 3893_1 | Sankt Antönien | 8 | 2 | 2 | 2 | 2 |
778152.1 | 198598.1 | 3882_1 | Küblis | 6 | 2 | 2 | 2 | 2 |
771204.8 | 204549.3 | 3962_1 | Schiers | 11 | 2 | 2 | 2 | 2 |
761946.6 | 196105.1 | 3945_1 | Trimmis | 7 | 2 | 2 | 2 | 2 |
Table 1 shows the structure of the processed data set. Each community is represented as a data point, which has a row in the table. The phenomena are represented by the majority vote (dominant variant) of the random sampling in a particular community.
To select the best parameters for the training of the SVM, with a radial basis function (RBF) kernel, a Grid Search is performed. The aim is to find the best combination of gamma and cost based on a previously given selection of possible values for each parameter. This step is repeatedly for the four linguistic phenomena and its corresponding data record. The grid search is performed by applying the tune.svm()
function from the e1071 package. In the following table, the best parameters for the training of a SVM on this data set, found by a grid search, are listed (Karatzoglou, Meyer, and Hornik 2006).
Gamma | Cost | Error | Dispersion | |
---|---|---|---|---|
i1 | 1.5 | 10 | 0.120 | 0.047 |
i6 | 1.5 | 1 | 0.089 | 0.055 |
i11 | 2.0 | 1 | 0.104 | 0.037 |
iv14 | 2.0 | 1 | 0.068 | 0.031 |
With the evaluated parameters from the previous section, a SVM is trained for each phenomenon. The trained SVM is of type ‘RBF’, which means it uses a Gaussian kernel and therefore is a non parametric/nonlinear algorithm.
A regular grid of data points is expanded over the area of Switzerland. The grid extent is taken from a buffered (8000m) convex hull drawn around all data points.
Min | Max | Pixel | Resolution | |
---|---|---|---|---|
Easting | 566091 | 794007 | 152 | 1500 |
Northing | 88511 | 300150 | 141 | 1500 |
By applying the extract()
function from the raster package on the grid data points and the data layers of the population STATPOP13 and the height DHM25_100m the values at the grid data points position are extracted.
The nearest neighbors of each grid point are extracted by applying the spDists()
function from the sp package on the grid. Then the knn.index()
function from the sp package extracts the indices of the k nearest neighbors.
Tobler (1993) defined a hiking function, which is representing the walking velocity \(W\) [\(\frac ms\)] dependent on the slope of a path: \[W = \frac{6e^{-3.5|m + 0,05|}}{3.6}\] Because the units in this study are meters and seconds, the function has to be divided by 3.6 to get from \(\frac {km}h\) to \(\frac ms\)
By using the the gdistance package, least cost paths between all neighbors are estimated and the time for walking these paths is extracted with the costDistance()
function. The least cost path estimation is based on a transition layer (conductance), which is generated with the height differences of the pixels in the DHM. In a next step Toblers hiking function is applied on the Slope in 8 directions of each pixel. This produces a transition layer with the walking velocity \(W\) in the 8 possible directions at every pixel in the DHM.
Because searching least cost paths on such a huge (global) transition layer is computationally expensive, multiple local (smaller) transition layers are generated with a moving window approach that has a variable size. By doing so for every grid point and its \(k\) neighbors, a local transition layer is generated and the walking time for the least costs paths to the neighbors is computed locally.
After the walking time extraction, one big data frame, containing the edges (ind1, ind2) is created and completed with population values (p1, p2) for the start and the end point of each edge. Just for completeness the heights (h1, h2) and the euclidean distance (dx) between the points is added. Then the mean slope (m) for each edge is calculated. It is important to state that the edges appear twice in the data frame, this is to represent the two possible directions of the edges.
To get a meaningful index of the linguistic influence of two neighbors on each other, a concept from Trudgill (1974) is applied: \[I_{ij}=\frac{P_i P_j}{d_{ij}^2}\times\frac{P_i}{P_i+P_j}\] \(P_i\) stands for the population size of the data point \(i\) (community) and \(d_{ij}\) is representing the walking time between the two data points \(i\) and \(j\).
ind1 | ind2 | p1 | p2 | h1 | h2 | dx | m | t | I |
---|---|---|---|---|---|---|---|---|---|
13425 | 13426 | 306 | 1 | 251.199 | 257.025 | 1509.380 | -0.004 | 1086.908 | 0.000 |
13425 | 13424 | 306 | 158 | 251.199 | 266.575 | 1509.380 | -0.010 | 1113.375 | 0.026 |
13425 | 13325 | 306 | 8 | 251.199 | 257.800 | 1500.987 | -0.004 | 1100.758 | 0.002 |
13425 | 13523 | 306 | 32 | 251.199 | 247.144 | 1500.987 | 0.003 | 1062.060 | 0.008 |
13425 | 13324 | 306 | 93 | 251.199 | 276.977 | 2128.660 | -0.012 | 1584.168 | 0.009 |
13425 | 13524 | 306 | 1 | 251.199 | 254.737 | 2128.660 | -0.002 | 1525.586 | 0.000 |
The Markov Random Field (MRF) is a statistical method based on a Bayesian approach, which considers spatial dependencies in the decision rule, instead of integrating them in the classification stage. The optimization is done by Iterated Conditional Modes (ICM), which is defined over the minimum energy per data point: \[y^* = \underset{y\in C}{\operatorname{argmin}} - [\sum\limits_{i=1}^n\log(p(y_i|x_i)) + \beta\sum\limits_{i=1}^n\sum_{j\in N_i}V(y_i,y_j)]\] In the MRF the probabilities of the pixel-wise SVM classification \(p(y_i|x_i)\) are used as starting point (a priori) for a spatial regularization of the classification map. The spatial dependencies are represented by the term \(V(y_i,y_j)\). Because it is not possible to include all the surrounding data points into the decision rule, the influence is limited to local neighborhoods \(N\) (also called cliques). In our case the \(N\) is given by the nearest grid neighbors. The minima of the global energy is approached by minimizing the local energies in the MRF. A new classification map is obtained by taking the most probable class affiliation \(y^*\) (of all possible classes \(y\in C\)) based on the posterior probabilities of the spatial regularization. (Besag 1986).
The MRF is constructed using the following components as starting point: (I) probabilities - These are extracted from a prediction done by the SVM trained in the previous part; (II) weighted edges - These are given by the asymmetric adjacency matrix created of the nearest grid neighbors and the index of linguistic influence \(I_{ij}\) and (III) the class affiliation of the data points given through the SVM prediction.
To minimize the global energy of the MRF, there has to be defined an energy function for the local energies of the data points. In this project the energy function definition leans on a paper of Tarabalka et al. (2010). So the the local energy of a data point is defined as: \[U(x_{i})=U_{linguistic}(x_{i}) + U_{spatial}(x_{i})\] The linguistic energy term only uses the SVM probabilities and is computed as: \[U_{linguistic}(x_{i}) = -ln\{P(x_i|L_i) \}\] And the spatial energy term is calculated using the following equation: \[U_{spatial}(x_{i}) = \beta \frac{\sum_{x_j\in N_i} I_{ij}(1-\delta(L_i, L_j))}{\sum_{x_j\in N_i} I_{ij}}\] where \(\delta(.,.)\) is the Kronecker delta function (\(\delta(a, b) = 1\) if \(a =b\), and \(\delta(a, b) = 0\) otherwise) and \(\beta\) is a parameter that controls the importance of the spatial versus spectral energy terms. \(I_{ij}\) is the index of the linguistic influence of between two neighbors, defined in the a previous section.
The global energy of the MRF is defined as the sum of all local energies: \[E = \sum\limits_{i=1}^NU(x_{i})\] The global energy is minimized through the iterative minimization of the local energies. The iterative minimization of the local energies is performed by an adoption of the Metropolis algorithm from Metropolis et al. (1953), based on stochastic relaxation and annealing described by S. Geman and Geman (1984). Per iteration a randomly chosen data point is assigned to a random class. Based on the observed change \(\Delta U = U^{new}(x_{i})-U^{old}(x_{i})\) of the local energy, the random class assignment is either kept or discarded. If \(\Delta U\) is negative, the new class assignment is accepted. To avoid the problem of converging to a local minima, also some class changes with a positive \(\Delta U\) are allowed. But these are only accepted with the probability \(p = exp(\frac{-\Delta U}{T})\) (stochastic relaxation), where \(T\) (temperature) is a control parameter, which is getting smaller with every iteration (annealing) and thus allows fewer class changes with positive \(\Delta U\) values (Tarabalka et al. 2010).
To be capable of comparing the results of the spatial regularization approach with existing linguistic region modeling approaches, two visualization techniques from a paper of Sibler et al. (2012) are reproduced. Both methods are on Thiessen polygon aggregation level around the data points.
The simplest way to visualize a surface of linguistic regions, is to conduct a majority vote on each sampling location and then color the area of the corresponding Thiessen polygon with the mostly occurring class.
Another approach to create continuous surfaces of linguistic regions, is to calculate intensity values with a kernel density estimation at the sampling location for every possible variant of a phenomena and then search for the maximum intensity of the different variants at every sampling point. This, similarly to the majority vote, determines the class in which the corresponding Thiessen polygon gets colored. The complete methodology can be found in the paper of Rumpf et al. (2009). The bandwidth for the kernel density estimation is determined by Normal Reference Distribution, a well-supported rule-of-thumb for choosing the bandwidth of a Gaussian kernel density estimator. The mean of the different calculated bandwidths is close to 40’000m and therefore the bandwidth is set constantly to 40’000m.
Auguie, Baptiste. 2016. GridExtra: Miscellaneous Functions for “Grid” Graphics. https://CRAN.R-project.org/package=gridExtra.
Besag, Julian. 1986. “On the Statistical Analysis of Dirty Pictures.” Journal of the Royal Statistical Society 48 (3): 259–302.
Beygelzimer, Alina, Sham Kakadet, John Langford, Sunil Arya, David Mount, and Shengqiao Li. 2013. FNN: Fast Nearest Neighbor Search Algorithms and Applications. https://cran.r-project.org/package=FNN.
Bivand, Roger S, Edzer Pebesma, and Virgilio Gomez-Rubio. 2013. Applied spatial data analysis with {R}, Second edition. Springer, NY. http://www.asdar-book.org/.
Bivand, Roger, and Colin Rundel. 2016. rgeos: Interface to Geometry Engine - Open Source (GEOS). https://cran.r-project.org/package=rgeos.
Bivand, Roger, Tim Keitt, and Barry Rowlingson. 2016. rgdal: Bindings for the Geospatial Data Abstraction Library. https://cran.r-project.org/package=rgdal.
Brunsdon, Chris, and Hongyan Chen. 2014. GISTools: Some Further Gis Capabilities for R. https://CRAN.R-project.org/package=GISTools.
Bucheli, Claudia, and Elvira Glaser. 2002. “The Syntactic Atlas of Swiss German Dialects: Empirical and Methodological Problems.” In Syntactic Microvariation, edited by S Barbiers, L Cornips, and S van der Kleij, 2nd ed., 41–73. Amsterdam: Meertens Institute Electronic Publications in Linguistics.
Etten, Jacob van. 2017. “{R} Package {gdistance}: Distances and Routes on Geographical Grids.” Journal of Statistical Software 76 (13): 1–21. doi:10.18637/jss.v076.i13.
Evans, Jeffrey S. 2017. SpatialEco. https://CRAN.R-project.org/package=spatialEco.
Geman, Stuart, and Donald Geman. 1984. “Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images.” IEEE Transactions on Pattern Analysis and Machine Intelligence 6 (6): 721–41. doi:10.1109/TPAMI.1984.4767596.
Jeszenszky, Péter, and Robert Weibel. 2016. “Modeling transitions between syntactic variants in the dialect continuum.” In The 19th Agile International Conference on Geographic Information Science, Helsinki (Finnland), 14 June 2016 - 17 June 2016. June.
Karatzoglou, Alexandros, David Meyer, and Kurt Hornik. 2006. “Support Vector Algorithm in R.” Journal of Statistical Software 15 (9): 1–28.
Metropolis, Nicholas, Arianna W. Rosenbluth, Marshall N. Rosenbluth, Augusta H. Teller, and Edward Teller. 1953. “Equation of state calculations by fast computing machines.” Journal Chemical Physics 21 (6): 1087–92. doi:http://dx.doi.org/10.1063/1.1699114.
Meyer, David, Evgenia Dimitriadou, Kurt Hornik, Andreas Weingessel, and Friedrich Leisch. 2015. e1071: Misc Functions of the Department of Statistics, Probability Theory Group (Formerly: E1071), TU Wien. https://cran.r-project.org/package=e1071.
Pebesma, Edzer J, and Roger S Bivand. 2005. “Classes and methods for spatial data in {R}.” R News 5 (2): 9–13. https://cran.r-project.org/doc/Rnews/.
Perpinan, Oscar, and Robert Hijmans. 2016. rasterVis. http://oscarperpinan.github.io/rastervis/.
R Core Team. 2016. Foreign: Read Data Stored by Minitab, S, Sas, Spss, Stata, Systat, Weka, DBase, . https://CRAN.R-project.org/package=foreign.
Rumpf, Jonas, Simon Pickl, Stephan Elspa, Werner König, and Volker Schmidt. 2009. “Structural analysis of dialect maps using methods from spatial statistics.” Zeitschrift Fuer Dialektologie Und Linguistik 76 (3): 280.
Sibler, Pius, Robert Weibel, Elvira Glaser, and Gabriela Bart. 2012. “Cartographic Visualization in Support of Dialectology.” Proceedings - AutoCarto 2012 - Columbus (OH, USA), no. 9: 18.
Tarabalka, Yuliya, Mathieu Fauvel, Jocelyn Chanussot, and Jon Atli Benediktsson. 2010. “SVM and MRF-Based Method for Accurate Classification of Hyperspectral Images.” IEEE Geoscience and Remote Sensing Letters 7 (4). IEEE - Institute of Electrical; Electronics Engineers: 736–40.
Tobler, Waldo. 1993. “Non-Isotropic Geographic Modeling.” NCGIA Technical Reports 93 (1): 5.
Trudgill, Peter. 1974. “Linguistic change and diffusion: description and explanation in sociolinguistic dialect geography.” Language in Society 2. UZH Hauptbibliothek / Zentralbibliothek Zürich: 215–46. doi:10.1017/S0047404500004358.
Wickham, Hadley. 2009. ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York. http://ggplot2.org.
———. 2011. “The Split-Apply-Combine Strategy for Data Analysis.” Journal of Statistical Software 40 (1): 1–29. http://www.jstatsoft.org/v40/i01/.