First steps

Throughout this document we’ll use the rtweet package (link) to retrieve Twitter data from its API. Its use is quite straightforward once the authentication steps are completed…

App creation

All instructions on how to set up application and oauth are gathered here: (


I have stored my secret access info in a git-ignored script…

twitter_token <- create_token(
    app =my_app ,
    consumer_key = my_API_key,
    consumer_secret = my_API_secret,

Now we are ready to start gathering twitter info

Search tweets

We can access tweets published during the last 7 days based on a keywords search (you can find out details about search rules here:

tib_tweets <- search_tweets("(flat OR house) (sale OR sell OR selling)", n=500)
## [1] 468  88

There are no less than 88 variables related to these tweets.

The number of tweet returned by default is 100… You can get many more than that if you settle on a broad keywords search… Beware of not exceeding the rate limit, though!

The object returned by search_tweets() is actually a nested tibble, that allows for multiple information to be stored for one tweet (i.e. multiple urls , multiple mentions of Twitter users in a single tweet, etc.): this means that some columns are not vectors, but lists:

I select some relevant, not-too-sparse variables just to display the kind of data we get out of search_tweets()

tib_tweets %>%
  filter(lang=="en") %>% 
## # A tibble: 451 x 4
##    screen_name     text                       created_at          location
##    <chr>           <chr>                      <dttm>              <chr>   
##  1 lucy_hdh_Oxford Tickets now on sale for m~ 2018-07-20 15:41:51 Oxford,~
##  2 ma_mortgages    Selling a house? 8 steps ~ 2018-07-20 15:41:43 Massach~
##  3 martyndix       @BrexitCentral @andreajen~ 2018-07-20 15:41:40 ""      
##  4 strong_opinions If it doesn’t have subway~ 2018-07-20 15:41:34 Grand R~
##  5 DmvMusicPlug    "Everyone know MD's @Endl~ 2018-07-20 15:41:07 DMV     
##  6 KulganofCrydee  "Well done to the solicit~ 2018-07-20 15:40:52 Hinckle~
##  7 jawsuhlynn      If you shop at Urban Outf~ 2018-07-20 15:40:52 Houston~
##  8 uchebaby_       If you shop at Urban Outf~ 2018-07-20 15:40:43 Chicago~
##  9 JulieBonebrake  Preparing a house for sal~ 2018-07-20 15:40:43 Frederi~
## 10 ABBUKA          "Learn About Earth's Near~ 2018-07-20 15:40:21 CHICAGO 
## # ... with 441 more rows
map_lgl(tib_tweets$geo_coords,[1])) %>% 
## .
##     6   462

The tweets are very seldomly located (variable geo_coords). On the other hand Twitter accounts often come with some information regarding location:

tib_tweets %>% 
  select(location) %>% 
## # A tibble: 30 x 1
##    location                
##    <chr>                   
##  1 Chicago, IL             
##  2 Detroit/ Windsor        
##  3 ""                      
##  4 ""                      
##  5 St. John's, Newfoundland
##  6 Leesburg, VA            
##  7 Los Angeles             
##  8 ""                      
##  9 Houston, TX             
## 10 London                  
## # ... with 20 more rows

These character strings related to location might or might not really make sense as geographical data… Anyway we can run some geocoding function to try and make them correspond to geographical coordinates:

Here I use Data Science Toolkit (“dsk”) as source:

geocode("NY",source="dsk", output="all")
## $status
## [1] "OK"
## $results
## $results[[1]]
## $results[[1]]$geometry
## $results[[1]]$geometry$location_type
## $results[[1]]$geometry$location
## $results[[1]]$geometry$location$lng
## [1] -75.4999
## $results[[1]]$geometry$location$lat
## [1] 43.00035
## $results[[1]]$geometry$viewport
## $results[[1]]$geometry$viewport$southwest
## $results[[1]]$geometry$viewport$southwest$lng
## [1] -79.76259
## $results[[1]]$geometry$viewport$southwest$lat
## [1] 40.4774
## $results[[1]]$geometry$viewport$northeast
## $results[[1]]$geometry$viewport$northeast$lng
## [1] -71.77749
## $results[[1]]$geometry$viewport$northeast$lat
## [1] 45.01586
## $results[[1]]$types
## [1] "administrative_area_level_1" "political"                  
## $results[[1]]$address_components
## $results[[1]]$address_components[[1]]
## $results[[1]]$address_components[[1]]$types
## [1] "administrative_area_level_1" "political"                  
## $results[[1]]$address_components[[1]]$short_name
## [1] "New York"
## $results[[1]]$address_components[[1]]$long_name
## [1] "New York, US"
## $results[[1]]$address_components[[2]]
## $results[[1]]$address_components[[2]]$types
## [1] "country"   "political"
## $results[[1]]$address_components[[2]]$short_name
## [1] "US"
## $results[[1]]$address_components[[2]]$long_name
## [1] "United States"

The direct result of a call to function geocode() is quite messy so I tailored a function that extracts just the few pieces of information I need as a table:

  locations=locs %>% 
    map(safely(geocode),source="dsk", output="all") %>% 
    map("result") %>% 
  coords=locations %>% 
    map(1) %>% 
    map("geometry") %>% 
    map("location") %>% 
    map(unlist) %>% 
    map(function(x){if(is.null(x)) x=c(lng=NA,lat=NA) else x=x}),coords)
  comp=locations %>% 
    map(1) %>% 
    map("address_components") %>% 
    map(function(x){map(x,safely(,stringsAsFactors=FALSE )}) %>% 
    map(function(x) map(x,"result")) %>% 
  country=comp %>% 
    map(safely(function(x){filter(x,types=="country")})) %>% 
    map("result") %>% 
    map("long_name") %>% 
    map(function(x){if(is.null(x)) x=NA else x=x}) %>% 
  locality=comp %>% 
    )) %>% 
    map("result") %>% 
    map("short_name") %>% 
    map(function(x){if(is.null(x)) x=NA else x=x}) %>% 
  area=comp %>% 
    )) %>% 
    map("result") %>% 
    map("short_name") %>% 
    map(function(x){if(is.null(x)) x=NA else x=x}) %>% 

This function takes a table with column location and completes it with coordinates (latitude-longitude), country, locality, area.

tib_trial <- tibble(location=c("NY",
                               "California, baby!",
                               "Midland, TX",
                               "Port Harcourt, Nigeria",
                               "All around the world!",
                               "La butte Montmartre",
                               "Quartier Latin",
                               "20 rue Mérieux Lyon"))
## # A tibble: 10 x 6
##    location                   lng   lat country      locality       area  
##    <chr>                    <dbl> <dbl> <chr>        <chr>          <chr> 
##  1 NY                     - 75.5  43.0  United Stat~ <NA>           New Y~
##  2 NYC                    - 74.0  40.7  United Stat~ New York       <NA>  
##  3 California, baby!      -120    36.2  United Stat~ <NA>           CA    
##  4 Midland, TX            -102    32.0  United Stat~ Midland        TX    
##  5 Port Harcourt, Nigeria    7.01  4.78 Nigeria      Port Harcourt  <NA>  
##  6 All around the world!    55.2  25.2  United Arab~ <NA>           <NA>  
##  7 Montmartre             -103    50.2  Canada       Montmartre     <NA>  
##  8 La butte Montmartre       2.34 48.9  France       Paris 18 Butt~ <NA>  
##  9 Quartier Latin            2.34 48.9  France       <NA>           <NA>  
## 10 20 rue Mérieux Lyon       4.85 45.7  France       Lyon           <NA>

Other types of retrievable info

Profile info a a particular user, friends and followers

## Classes 'tbl_df', 'tbl' and 'data.frame':    778 obs. of  1 variable:
##  $ user_id: chr  "968880811314475008" "939628607470792705" "1854868530" "1018834465055965187" ...
##  - attr(*, "next_cursor")= chr "0"
## Classes 'tbl_df', 'tbl' and 'data.frame':    4 obs. of  2 variables:
##  $ user   : chr  "realtyWW" "realtyWW" "realtyWW" "realtyWW"
##  $ user_id: chr  "576056458" "1380451" "15134782" "61268528"
##  - attr(*, "next_cursor")= int 0


We can retrieve the tweets in the timeline of one Twitter-user. In that case the information retrieval is not limited to a certain time-window, but is limited to a certain number of tweets of the timeline (up to 3200).

Here we just retrieve 100 (the default number) tweets of user @realtyWW:

timeline <- get_timeline("@realtyWW")
## [1] 100  88

Again, the number of variables is quite overwhelming:

str(timeline, max.level=1)
timeline %>% 
## # A tibble: 100 x 3
##    screen_name text                                    created_at         
##  * <chr>       <chr>                                   <dttm>             
##  1 realtyWW    2 Bed Flat To Rent In West London, Lon~ 2018-07-20 14:42:18
##  2 realtyWW    2 Bed Flat To Rent In South London, Lo~ 2018-07-20 14:42:17
##  3 realtyWW    2 Bed Flat To Rent In Colindale, Londo~ 2018-07-20 14:42:16
##  4 realtyWW    Studio To Rent In West London, London ~ 2018-07-20 14:42:15
##  5 realtyWW    Residential For Sale In Markham, Ontar~ 2018-07-20 14:42:14
##  6 realtyWW    2 Bed Flat To Rent In East London, Ess~ 2018-07-20 14:42:13
##  7 realtyWW    2 Bed Flat To Rent In East London, Lon~ 2018-07-20 14:42:12
##  8 realtyWW    2 Bed Flat To Rent In Birmingham, West~ 2018-07-20 14:42:10
##  9 realtyWW    2 Bed Flat To Rent In North London, Lo~ 2018-07-20 14:42:09
## 10 realtyWW    Studio To Rent In Harrow (london Borou~ 2018-07-20 14:42:07
## # ... with 90 more rows