R is known to be a very powerful language for the analysis and visualization. For visualizing data, there are better existing solutions that provide quite bit of interactivity with the ActivityInfo data e.g. ActivityInfo's built-in visualization tools, Power BI, Tableau etc.
However, there are number of things one can only do with R like advanced analysis such as prediction, statistical analysis, text mining and so on.
R is a batteries-included language that has many built-in calls for many statistical analysis. Besides, it has external package system that one of the best ones is ggplot
that provides powerful graphics to the users.
library(activityinfo)
Build a regression model
We have an example fake dataset illustrating the building maintainance of the high schools in the Netherlands.
We first pull the data from ActivityInfo and we then use R calls to perform analyses. The schools
table below is the saved & cleaned version of the data pulled via the ActivityInfo API R client's queryTable()
call.
pat <- system.file("extdata", "schools.csv", package = "activityinfo")
schools <- read.csv(pat, stringsAsFactors = FALSE)
head(schools)
#> school_name date_of_school_establishment date_of_visit type
#> 1 Amsterdam High School 1982-01-01 2019-12-04 Paint
#> 2 Tilburg High School 1986-01-02 2019-12-05 Brick
#> 3 Breda High School 1882-01-03 2019-12-06 Paint
#> 4 Rotterdam High School 1922-01-04 2019-12-07 Paint
#> 5 Haarlem High School 1860-01-05 2019-12-08 Paint
#> 6 The Hague High School 1982-01-06 2019-12-09 Paint
#> situation_description
#> 1 Lorem ipsum dolor sit amet,\n consectetur adipiscing elit,\n sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.
#> 2 Lorem ipsum dolor sit amet,\n consectetur adipiscing elit,\n sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut
#> 3 Lorem ipsum dolor sit amet,\n consectetur adipiscing elit,\n sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. U
#> 4 Lorem ipsum dolor sit amet,\n consectetur adipiscing elit,\n sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. U
#> 5 Lorem ipsum dolor sit amet,\n consectetur adipiscing elit,\n sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. U
#> 6 Lorem ipsum dolor sit amet,\n consectetur adipiscing elit,\n sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. U
#> building_value square_meters_of_school is_painted
#> 1 1100000 1000 1
#> 2 750000 500 0
#> 3 2750000 2000 1
#> 4 1250000 1200 1
#> 5 975000 1000 1
#> 6 850000 550 1
As an advanced analysis, we try to predict the building value based on the square meters of the building. For this analysis we use a linear regression model. On the one column, we have a high school that it has the square meters of the building and on the other column, we have a building value information.
library(ggplot2)
ggplot(schools, aes(square_meters_of_school, building_value)) +
geom_point(aes(color = as.factor(is_painted))) +
geom_smooth(method = "lm", se = FALSE) +
scale_y_continuous(labels = scales::dollar_format(
scale = 1 / 1e6,
prefix = "\u20ac",
suffix = "M"
)) +
scale_color_discrete(
name = "The building painted",
breaks = c(1, 0),
labels = c("Yes", "No")
) +
labs(
title = "Building value vs. sqm of building",
x = "Square meters of building",
y = "Building value"
) +
theme_minimal()
#> `geom_smooth()` using formula = 'y ~ x'
Text analysis
We count the number of words in the description field, which is a multi-line narrative field in the ActivityInfo, and produce a bar chart.
library(quanteda)
#> Error in library(quanteda): there is no package called 'quanteda'
n_tokens <- ntoken(char_tolower(schools$situation_description), remove_punct = TRUE)
#> Error in ntoken(char_tolower(schools$situation_description), remove_punct = TRUE): could not find function "ntoken"
n_tokens <- as.integer(n_tokens)
#> Error in eval(expr, envir, enclos): object 'n_tokens' not found
df <- data.frame(
school = schools$school_name,
n_tokens = n_tokens,
stringsAsFactors = FALSE
)
#> Error in data.frame(school = schools$school_name, n_tokens = n_tokens, : object 'n_tokens' not found
df
#> function (x, df1, df2, ncp, log = FALSE)
#> {
#> if (missing(ncp))
#> .Call(C_df, x, df1, df2, log)
#> else .Call(C_dnf, x, df1, df2, ncp, log)
#> }
#> <bytecode: 0x55cfa4d4bad8>
#> <environment: namespace:stats>
ggplot(df, aes(x = reorder(school, n_tokens), y = n_tokens)) +
geom_bar(stat = "identity", fill = "steelblue") +
coord_flip() +
theme_minimal()
#> Error in `ggplot()`:
#> ! `data` cannot be a function.
#> ℹ Have you misspelled the `data` argument in `ggplot()`
For deeper information how the textual data stored in the ActivityInfo is analyzed, please check the QualMiner project.