class: left, bottom, title-slide .title[ # Efficient and reliable geocoding of Twitter data ] .subtitle[ ## for spatial linkage with official statistics ] .author[ ###
Long Nguyen
, Dorian Tsolak, Anna Karmann,
Stefan Knauff, Simon Kühne ] .date[ ### AAPOR 2022 · Chicago, IL ] --- class: middle
## Agenda - Potentials of Twitter data for regional analyses of public opinion<br><br> - Data<br><br> - Geocoding<br><br> - Evaluation<br><br> - Example use cases<br><br> - Discussion<br><br> .footnote[Link to slides: https://lo-ng.netlify.app/slides/2022-05-13-aapor-twitter-geocoding] --- class: inverse center middle # Potentials of Twitter data for regional analyses of public opinion --- class: highlight-last <br><br> - Twitter: one of the most common sources of digital data on public attitudes and behaviour Data on tweets and profiles comparatively easy to collect thanks to publicly accessible API -- - Use of geographic information about tweets or profiles (typically geolocation provided by Twitter in the form of geocoordinates) for regional analyses: Political polling (Beauchamp 2017), conspiracy theories (Stephens 2020), anti-immigrant attitudes (Menshikova & van Tubergen 2022), health behaviour (Martinez et al. 2018, Wiedener & Li 2014), happiness (Mitchell et al. 2013) -- - Mostly in combination with data from other sources (survey data, official statistics, etc.) -- → Geographic indicators needed to aggregate data at a specific (e.g. administrative) level and combine it with data from other sources --- background-color: white <br> #### Problem: Ready-to-use geoinformation – i.e., GPS geotags by Twitter – is only available for a tiny part of the tweets. - Sloan & Morgan (2015): 0.85% of tweets worldwide have Twitter geotags -- #### But: - Alternative source of geoinformation: location given by user as free text in profile .center[ <img src="./img/twitter-long.png" width="640px" style="display: block; margin: auto;" /> ] --- class: middle ### Research question How can we use user profile locations to increase the amount of geolocated Twitter data? --- class: inverse center middle # Data --- ### Data collection - Real-time collection of tweets since October 2018 via the Twitter [Filter-API](https://developer.twitter.com/en/docs/twitter-api/v1/tweets/filter-realtime/overview) - so-called 1% API: ≤ 1% of the global rate (≈ 6000 tweets per second, Tromble et al. 2017) can be collected at any given time -- - Our filter: - Tweets that were marked as German by Twitter's speech recognition and... - ... contained one of the [100 most frequent words in the German language](https://www.ids-mannheim.de/digspra/kl/projekte/methoden/derewo) (IDS Mannheim). -- - Over 1.1 billion tweets; on average 15.7 tweets per second -- - For the analysis in this presentation: Subset 10/15/2018 to 10/14/2021, excluding retweets, excluding professional ("verified") accounts: 866 million tweets --- ### Ready-to-use geoinformation - Only **1.52 million** (**0.18%**) of the collected tweets were geotagged with coordinates by Twitter - **51 180** (**0.31%**) of the total 16.6 million users -- <br> NUTS-3 regions (districts or major cities) with the fewest users based on Twitter geotags: .small[ |NUTS-3 |Name | Users|NUTS-3 |Name | Users|NUTS-3 |Name | Users| |:------|:-----------------|-----:|:------|:-----------------|-----:|:------|:---------------------------|-----:| |DEB3G |Kusel | 6|DE247 |Coburg | 11|DE23A |Tirschenreuth | 12| |DE255 |Schwabach | 9|DE267 |Haßberge | 11|DEB37 |Pirmasens, kreisfreie Stadt | 12| |DEG0D |Sömmerda | 9|DEG0N |Eisenach, Stadt | 11|DEG0M |Altenburger Land | 13| |DE272 |Kaufbeuren | 10|DE234 |Amberg-Sulzbach | 12|DEG0L |Greiz | 13| |DE926 |Holzminden | 11|DEG06 |Eichsfeld | 12|DE231 |Amberg | 13| |DE22C |Dingolfing-Landau | 11|DEG0A |Kyffhäuserkreis | 12|DE277 |Dillingen a.d.Donau | 13| ] --- ### Profile locations - **569 million** (**65.66%**) of the collected tweets were posted by users who provided a profile location - **9.2 million** users (**59.15%**) --- ### Profile locations .pull-left[ ``` Domsühl, Deutschland AUETAL Hitzacker, Germany 58710 Menden Salem, Deutschland Harlingerode Leichlingen Allemagne Taunusstein Bleidenstadt Burtscheid, Deutschland Drögen Groß Leuthen Dotzheim, Wiesbaden 48.478008,13.176936 Wessum Germany, Heinsberg Schiltberg, Bavaria, Germany Tönisvorst, DE Germany, Niedersachsen, ``` ] -- .pull-right[ ``` They/Them | Ze/Zir | He/Him Old Man Kelsey's Woods she//her 💓 loving sangie ♡ Seo Soojin is the definition of sexiness and cuteness robworld harley she/they queer 17 exo planet | vivi's mom 🐕 Side View 💥🍸🐇 Minding My Business Blvd. Out here init starving somewhere Avocado Toast she / they ; 18 ; ar 55 (na) To na sua kkj she | ᵃᵇᵇˢ 𝐭|𝐧|𝐯|𝐤|𝐤|𝐥|𝐦|𝐤|𝐚 ``` ] --- .pull-left[ ``` Alemanha - Berlim Alemanha - Berlin Alemania(Berlin) Alemania, Berlin Alemania/Berlin Allemagne Berlin Allemagne, Berlin almanya berlin almanya, berlin almanya/berlin bärlin bärlin ♥ bÄrlin 🐻Bärlin Bärlin Bärlin 😏 ``` ] .pull-right[ ``` Berlino, Germania 🏡 Berlino, Germanio Berlino, Germanujo Berlino, Germany Deutschland 🇩🇪, Berlin Deutschland 🇪🇺🇩🇪 Berlin Deutschland Berlin Deutschland, Berlin Deutschland, Berlin. Deutschland,Berlin Deutschland- Berlin Deutschland-Berlin Βερολίνο берлин برلين ベルリン ``` ] -- <br> - Challenge: generate usable regional indicators from the free text in the profile location --- class: inverse center middle # Geocoding --- [**Nominatim**](https://nominatim.openstreetmap.org): text search engine for [OpenStreetMap](https://openstreetmap.org) data .smaller[*Demo: Public Nominatim instance at https://nominatim.openstreetmap.org*] .center[ <iframe src="https://nominatim.openstreetmap.org/ui/search.html" width="96%" height="450px" data-external="1"></iframe> ] --- ### Self-hosted Nominatim database ... to make geocoding more efficient - Public Nominatim has a rate limit of 1 query per second → It would take years to geocode the hundreds of millions of user profile locations -- - Tailor the OpenStreetMap database used for geocoding to our needs -- - Flexible access to geodata in the database - Geocoding results not limited by API output - (Spatial) joins directly in the database --- class: highlight-last ### Implementation details - Normalisation of the profile location text -- - Places with the same name are ordered according to Nominatim search ranking and Wikipedia importance Example: New York the US city > New York the hair salon in Munich -- - Locations matched at the level of street addresses are excluded (except train stations) to protect user privacy -- - Geocoding output: corresponding **spatial geometry** ("shape") for each profile location string --- <br> #### Determine suitable aggregation levels for regional analyses: - **Spatial join** of the geometries of the profile locations with the geometries of German administrative regions at NUTS-1, NUTS-2, and NUTS-3 level. → Match if the geometry of the profile location lies entirely within the geometry of the administrative region ([ST_CoveredBy](https://postgis.net/docs/ST_CoveredBy.html)) .center[ <img src="https://gisgeography.com/wp-content/uploads/2019/04/Spatial-Join-Completely-Within.png" width="400px" style="display: block; margin: auto;" /> .smaller[Source: https://gisgeography.com/spatial-join] ] --- ### Results |Profile location |NUTS-1 |NUTS-2 |NUTS-3 | |:---------------------------|------:|------:|------:| |Titisee-Neustadt, Germany |DE1 |DE13 |DE132 | |fRaNkFuRt |DE7 |DE71 |DE712 | |Deutschland Aalen |DE1 |DE11 |DE11D | |Saarland / Ensheim |DEC |DEC0 |DEC01 | |Hardegsen, Niedersachsen |DE9 |DE91 |DE918 | |hh. |DE6 |DE60 |DE600 | |Wartjenstedt |DE9 |DE91 |DE91B | |nrw* |DE2 |*NA* |*NA* | |Usingen, Deutschland |DE7 |DE71 |DE718 | |Hochzoll-Süd, Augsburg |DE2 |DE27 |DE271 | --- ### Results - **229 million** (**26.42%**) tweets -- - 15 000% increase for the number of geolocated tweets -- - **970 631** users (**6.23%**) -- - Explanation: Seemingly more active users (in our dataset!) are more likely to provide geocodable profile locations |Geoinformation | Mean| Median| SD| Max| |:------------------|-----:|------:|-----:|------:| |geocoded by us | 230.0| 9| 1939| 792298| |Twitter geotag | 29.8| 2| 1108| 226900| |*NA* | 42.9| 1| 669| 447564| --- class: full center background-color: white <img src="./img/twgeo-vs-bigeo-choro.png" width="944" style="display: block; margin: auto;" /> --- class: inverse center middle # Evaluation --- background-color: white ### Accuracy of geocoding <br> Gold standard: Twitter GPS geotags<sup>[1]</sup> .center[ <img src="./img/geocoding-accuracy.png" width="80%" style="display: block; margin: auto;" /> ] .footnote.small[ [1] Tweet geotags are aggregated at user level. Only users are selected whose majority of geotags (at least 2) are in the same NUTS-3 region. For users with multiple unique geotags, centroid is calculated. ] --- background-color: white ### Spatial distribution of geocoded users
--- ### Content of the geocoded tweets <br><br> Comparison of geolocated vs. non-geolocated tweets as bags of words -- - Cosine similarity: 0.996 - Jaccard coefficient: 0.935 --- class: inverse center middle # Example use cases --- background-color: white #### Support for the German Green Party<sup>[2]</sup> .pull-left.w35[ <img src="./img/gruene-hashtags.png" width="446" style="display: block; margin: auto;" /> ] .pull-right.w60[ <br><br><br> Correlation with Green votes per inhabitant:
r
(
35
)
=
0.058
,
p
<
0.001
r(35) = 0.058,\; p < 0.001
r
(
35
)
=
0.058
,
p
<
0.001
.footnote.small[[2] During the one-month period leading to the 2021 German federal election] ] --- background-color: white #### Usage of gender-inclusive language: urban vs rural .pull-left.w35[ <img src="./img/gendered-speech.png" width="446" style="display: block; margin: auto;" /> ] -- .pull-right.w65[ .center[Regression models of the proportion of users who tweeted with gender-inclusive language<sup>[3]</sup>] <img src="./img/gendered-speech-lm.png" width="580" style="display: block; margin: auto;" /> .footnote.smaller[[3] Model 4:
y
=
X
β
+
u
\mathbf{y} = \mathbf{X}\beta + \mathbf{u}
y
=
X
β
+
u
,
u
=
λ
W
u
+
ε
\mathbf{u} = \lambda\mathbf{Wu} + \varepsilon
u
=
λ
Wu
+
ε
where
W
\mathbf{W}
W
is the spatial neighbour weights matrix] ] --- class: inverse center middle # Discussion --- <br><br> - Potential for filling data gaps when analysing small regions - Usable for certain research questions -- - Benefits of standardised output as official region identifiers: - Easily combine with data from other sources - Less privacy-sensitive compared to exact point coding -- - For the users who don't have profile location: - More complex geocoding methods available, e.g. using follower network or tweet content - Much more computationally expensive and time consuming; much more difficult to validate --- ### Open science - Code will be made public for reproduction - Geocoding results with tweet IDs can be released for rehydration - R package [{nutscoder}](https://github.com/long39ng/nutscoder) (https://github.com/long39ng/nutscoder) - input: text strings; output: administrative regions (not limited to NUTS regions in Germany) ``` r library(nutscoder) nuts_geocode(c("Hamburgo", "هامبورغ", "HH", "Berlin", "🐻Bärlin", "ベルリン")) #> # A tibble: 8 × 5 #> location name nuts_1 nuts_2 nuts_3 #> <chr> <chr> <chr> <chr> <chr> #> 1 Hamburgo Hamburg DE6 DE60 DE600 #> 2 هامبورغ Hamburg DE6 DE60 DE600 #> 3 HH Hamburg DE6 DE60 DE600 #> 4 Berlin Berlin DE3 DE30 DE300 #> 5 🐻Bärlin Berlin DE3 DE30 DE300 #> 6 ベルリン Berlin DE3 DE30 DE300 ``` --- class: center middle ## Thank you for your attention! <br> <svg viewBox="0 0 512 512" style="height:1em;position:relative;display:inline-block;top:.1em;" xmlns="http://www.w3.org/2000/svg"> <path d="M459.37 151.716c.325 4.548.325 9.097.325 13.645 0 138.72-105.583 298.558-298.558 298.558-59.452 0-114.68-17.219-161.137-47.106 8.447.974 16.568 1.299 25.34 1.299 49.055 0 94.213-16.568 130.274-44.832-46.132-.975-84.792-31.188-98.112-72.772 6.498.974 12.995 1.624 19.818 1.624 9.421 0 18.843-1.3 27.614-3.573-48.081-9.747-84.143-51.98-84.143-102.985v-1.299c13.969 7.797 30.214 12.67 47.431 13.319-28.264-18.843-46.781-51.005-46.781-87.391 0-19.492 5.197-37.36 14.294-52.954 51.655 63.675 129.3 105.258 216.365 109.807-1.624-7.797-2.599-15.918-2.599-24.04 0-57.828 46.782-104.934 104.934-104.934 30.213 0 57.502 12.67 76.67 33.137 23.715-4.548 46.456-13.32 66.599-25.34-7.798 24.366-24.366 44.833-46.132 57.827 21.117-2.273 41.584-8.122 60.426-16.243-14.292 20.791-32.161 39.308-52.628 54.253z"></path></svg> <svg viewBox="0 0 496 512" style="height:1em;position:relative;display:inline-block;top:.1em;" xmlns="http://www.w3.org/2000/svg"> <path d="M165.9 397.4c0 2-2.3 3.6-5.2 3.6-3.3.3-5.6-1.3-5.6-3.6 0-2 2.3-3.6 5.2-3.6 3-.3 5.6 1.3 5.6 3.6zm-31.1-4.5c-.7 2 1.3 4.3 4.3 4.9 2.6 1 5.6 0 6.2-2s-1.3-4.3-4.3-5.2c-2.6-.7-5.5.3-6.2 2.3zm44.2-1.7c-2.9.7-4.9 2.6-4.6 4.9.3 2 2.9 3.3 5.9 2.6 2.9-.7 4.9-2.6 4.6-4.6-.3-1.9-3-3.2-5.9-2.9zM244.8 8C106.1 8 0 113.3 0 252c0 110.9 69.8 205.8 169.5 239.2 12.8 2.3 17.3-5.6 17.3-12.1 0-6.2-.3-40.4-.3-61.4 0 0-70 15-84.7-29.8 0 0-11.4-29.1-27.8-36.6 0 0-22.9-15.7 1.6-15.4 0 0 24.9 2 38.6 25.8 21.9 38.6 58.6 27.5 72.9 20.9 2.3-16 8.8-27.1 16-33.7-55.9-6.2-112.3-14.3-112.3-110.5 0-27.5 7.6-41.3 23.6-58.9-2.6-6.5-11.1-33.3 2.6-67.9 20.9-6.5 69 27 69 27 20-5.6 41.5-8.5 62.8-8.5s42.8 2.9 62.8 8.5c0 0 48.1-33.6 69-27 13.7 34.7 5.2 61.4 2.6 67.9 16 17.7 25.8 31.5 25.8 58.9 0 96.5-58.9 104.2-114.8 110.5 9.2 7.9 17 22.9 17 46.4 0 33.7-.3 75.4-.3 83.6 0 6.5 4.6 14.4 17.3 12.1C428.2 457.8 496 362.9 496 252 496 113.3 383.5 8 244.8 8zM97.2 352.9c-1.3 1-1 3.3.7 5.2 1.6 1.6 3.9 2.3 5.2 1 1.3-1 1-3.3-.7-5.2-1.6-1.6-3.9-2.3-5.2-1zm-10.8-8.1c-.7 1.3.3 2.9 2.3 3.9 1.6 1 3.6.7 4.3-.7.7-1.3-.3-2.9-2.3-3.9-2-.6-3.6-.3-4.3.7zm32.4 35.6c-1.6 1.3-1 4.3 1.3 6.2 2.3 2.3 5.2 2.6 6.5 1 1.3-1.3.7-4.3-1.3-6.2-2.2-2.3-5.2-2.6-6.5-1zm-11.4-14.7c-1.6 1-1.6 3.6 0 5.9 1.6 2.3 4.3 3.3 5.6 2.3 1.6-1.3 1.6-3.9 0-6.2-1.4-2.3-4-3.3-5.6-2z"></path></svg> [long39ng](https://twitter.com/long39ng) · <svg viewBox="0 0 496 512" style="height:1em;position:relative;display:inline-block;top:.1em;" xmlns="http://www.w3.org/2000/svg"> <path d="M336.5 160C322 70.7 287.8 8 248 8s-74 62.7-88.5 152h177zM152 256c0 22.2 1.2 43.5 3.3 64h185.3c2.1-20.5 3.3-41.8 3.3-64s-1.2-43.5-3.3-64H155.3c-2.1 20.5-3.3 41.8-3.3 64zm324.7-96c-28.6-67.9-86.5-120.4-158-141.6 24.4 33.8 41.2 84.7 50 141.6h108zM177.2 18.4C105.8 39.6 47.8 92.1 19.3 160h108c8.7-56.9 25.5-107.8 49.9-141.6zM487.4 192H372.7c2.1 21 3.3 42.5 3.3 64s-1.2 43-3.3 64h114.6c5.5-20.5 8.6-41.8 8.6-64s-3.1-43.5-8.5-64zM120 256c0-21.5 1.2-43 3.3-64H8.6C3.2 212.5 0 233.8 0 256s3.2 43.5 8.6 64h114.6c-2-21-3.2-42.5-3.2-64zm39.5 96c14.5 89.3 48.7 152 88.5 152s74-62.7 88.5-152h-177zm159.3 141.6c71.4-21.2 129.4-73.7 158-141.6h-108c-8.8 56.9-25.6 107.8-50 141.6zM19.3 352c28.6 67.9 86.5 120.4 158 141.6-24.4-33.8-41.2-84.7-50-141.6h-108z"></path></svg> https://lo-ng.netlify.app · <svg viewBox="0 0 512 512" style="height:1em;position:relative;display:inline-block;top:.1em;" xmlns="http://www.w3.org/2000/svg"> <path d="M464 64H48C21.49 64 0 85.49 0 112v288c0 26.51 21.49 48 48 48h416c26.51 0 48-21.49 48-48V112c0-26.51-21.49-48-48-48zm0 48v40.805c-22.422 18.259-58.168 46.651-134.587 106.49-16.841 13.247-50.201 45.072-73.413 44.701-23.208.375-56.579-31.459-73.413-44.701C106.18 199.465 70.425 171.067 48 152.805V112h416zM48 400V214.398c22.914 18.251 55.409 43.862 104.938 82.646 21.857 17.205 60.134 55.186 103.062 54.955 42.717.231 80.509-37.199 103.053-54.947 49.528-38.783 82.032-64.401 104.947-82.653V400H48z"></path></svg> [long.nguyen@uni-bielefeld.de](mailto:long.nguyen@uni-bielefeld.de)