[RFD PATCH 00/10] cpuidle: Predict the next events with the IO latencies

Discussion:

Daniel Lezcano

2014-10-22 13:57:43 UTC

This patchset is not intended to be merged upstream as it is. It is a
proof of concept giving the rough idea of the concept.

In the discussions on how to make the scheduler energy aware, we tried
to make the different PM subsystems to communicate with the scheduler.

We realized that some code is duplicated across the PM subsystems and
the scheduler leading to an inconsistent way to integrate the PM
informations.

Ingo Molnar put a line in the sand [1] and clearly worded that no more
PM stuff will be integrated into the scheduler until the different PM
blocks are not redesigned to be part of the scheduler.

This patchset is a sub part of this integration work. How to integrate
Cpuidle with the scheduler ?

The idea is to get rid of the governors and let the scheduler to tell
the Cpuidle framework : "I expect to sleep <x> nsec and I have a <y>
nsec latency requirement" as stated by Peter Zijlstra [2].

How to achieve this ?

We want to prevent to just move code around and put the prediction of
the next event inside the scheduler directly with a legacy menu
governor. After investigating, it appears the menu governor is not
behaving in a stable way, it is erratic. Using the IO latencies +
the timers give much better results than the menu governor which takes
into account all the source of wakeups [3].

After discussing at the LPC2014 Dusseldorf, it appears the idea is
good but the approach is wrong. The latency tracking must be done at
the device level, per device and not in the task as what is doing this
patchset.

Any comment, suggestion or help is welcome !

If I missed anyone who may be interested in this feature, please let
me know.

Thanks.

-- Daniel

[1] http://lwn.net/Articles/552885/
[2] https://lkml.org/lkml/2013/11/11/353
[3] http://events.linuxfoundation.org/sites/events/files/slides/IOlatencyPrediction.pdf

Daniel Lezcano (10):
sched: add io latency framework
cpuidle: Checking the zero latency inside the governors does not make
sense.
sched: idle: cpudidle: Pass the latency req from idle.c
sched: idle: Compute next timer event and pass it the cpuidle
framework
cpuidle: Remove unused headers for tick
sched: idle: Add io latency information for the next event
cpuidle: Add a simple select governor
cpuidle: select: hack - increase rating to have this governor as
default
cpuidle: sysfs: Add per cpu idle state prediction statistics
sched: io_latency: Tracking via buckets

drivers/cpuidle/Kconfig | 4 +
drivers/cpuidle/cpuidle.c | 17 +-
drivers/cpuidle/governors/Makefile | 1 +
drivers/cpuidle/governors/ladder.c | 11 +-
drivers/cpuidle/governors/menu.c | 15 +-
drivers/cpuidle/governors/select.c | 55 +++++
drivers/cpuidle/sysfs.c | 156 +++++++++++++
include/linux/cpuidle.h | 22 +-
include/linux/sched.h | 21 ++
init/Kconfig | 11 +
kernel/exit.c | 1 +
kernel/fork.c | 5 +-
kernel/sched/Makefile | 1 +
kernel/sched/core.c | 7 +
kernel/sched/idle.c | 27 ++-
kernel/sched/io_latency.c | 441 +++++++++++++++++++++++++++++++++++++
kernel/sched/io_latency.h | 38 ++++
17 files changed, 800 insertions(+), 33 deletions(-)
create mode 100644 drivers/cpuidle/governors/select.c
create mode 100644 kernel/sched/io_latency.c
create mode 100644 kernel/sched/io_latency.h

--
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Daniel Lezcano

2014-10-22 13:57:44 UTC

Permalink

In order to have a good prediction of when will occur the next event,
the cpuidle menu governor does some statistics about the occurences of
an event waking up a cpu. For more details, refer to the menu.c's
header file located in drivers/cpuidle/governors.

A part of the prediction is taking into account the number of pending
IO on the cpu and depending on this number it will use a 'magic'
number to force the selection of shallow states. It makes sense and
provided a good improvement in terms of system latencies for
server. Unfortunately there are some drawbacks of this approach. The
first one is the empirical approach, based on measurements for a
specific hardware and architecture giving the magic 'performance
multiplier' which may not fit well for different architecture as well
as new hardware which evolve during time. The second one is the mix of
all the wakeup sources making impossible to track when a task is
migrated across the cpus. And the last one, is the lack of correctly
tracking what is happening on the system.

In order to improve that, we can classify three kind of events:

1. totally predictable events : this is the case for the timers

2. partially predictable events : for example: hard disk accesses with
sleep time which are more or less inside a reasonable interval, the
same for SSD or SD-card

3. difficult to predict events : network incoming packet, keyboard or
mouse event. These ones need a statistical approach.

At this moment, 1., 2. and 3. are all mixed in the governors
statistics.

This patchset provides a simplified version of an io latency tracking
mechanism in order to separate and improve the category 2, that is the
partially predictable events. As the scheduler is a good place to
measure how long a task is blocked in an IO, all the code of this
patchset is tied with it.

The sched entity and the io latency tracking share the same design:

There is a rb tree per cpu. Each time a task is blocked on an IO, it
is inserted into the tree. When the IO is complete and the task is
woken up, its avg latency is updated with the time spent to wait the
IO and it is removed from the tree. The next time, it will be inserted
into the tree again in case of io_schedule.

If there are several tasks blocked on an IO, the left most node of the
tree is the minimal latency, in other words it gives for the IO the
next event. This information may need to be balanced regarding the
number of pendings IO (the more the disk is accessing, the slower it
is).

By splitting the three kind of categories we can guess more accurately
the next event on the system in conjunction with the next timer and
some simplified statistics from the menu governor. Furthermore, that
has the benefit to take into account a task migration as the information
is embedded with it.

This is a simplified version because the tracking could be greatly
improved by a polynomial regression like the sched entity tracking and
the latencies should also be per device but that implies a bigger
impact as the different callers of io_schedule have to provide the
dev_t parameter. A too complex patch won't help the understanding.

Signed-off-by: Daniel Lezcano <***@linaro.org>
---
include/linux/sched.h | 13 ++++
init/Kconfig | 11 +++
kernel/fork.c | 4 +-
kernel/sched/Makefile | 1 +
kernel/sched/core.c | 6 ++
kernel/sched/io_latency.c | 174 ++++++++++++++++++++++++++++++++++++++++++++++
kernel/sched/io_latency.h | 38 ++++++++++
7 files changed, 246 insertions(+), 1 deletion(-)
create mode 100644 kernel/sched/io_latency.c
create mode 100644 kernel/sched/io_latency.h

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 5c2c885..6af032b 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1221,6 +1221,16 @@ enum perf_event_task_context {
perf_nr_task_contexts,
};

+
+#ifdef CONFIG_SCHED_IO_LATENCY
+struct io_latency_node {
+ struct rb_node node;
+ unsigned int avg_latency;
+ ktime_t start_time;
+ ktime_t end_time;
+};
+#endif
+
struct task_struct {
volatile long state; /* -1 unrunnable, 0 runnable, >0 stopped */
void *stack;
@@ -1644,6 +1654,9 @@ struct task_struct {
unsigned int sequential_io;
unsigned int sequential_io_avg;
#endif
+#ifdef CONFIG_SCHED_IO_LATENCY
+ struct io_latency_node io_latency;
+#endif
};

/* Future-safe accessor for struct task_struct's cpus_allowed. */
diff --git a/init/Kconfig b/init/Kconfig
index e84c642..b6d2387 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1228,6 +1228,17 @@ config SCHED_AUTOGROUP
desktop applications. Task group autogeneration is currently based
upon task session.

+config SCHED_IO_LATENCY
+ bool "IO latency tracking for the scheduler"
+ depends on SMP
+ help
+ This option tracks for each task the io latency average time it has
+ been blocked on for each cpu. It helps to have more information
+ regarding how long a cpu will be idle and to take better scheduling
+ decision.
+
+ If unsure, say Y.
+
config SYSFS_DEPRECATED
bool "Enable deprecated sysfs features to support old userspace tools"
depends on SYSFS
diff --git a/kernel/fork.c b/kernel/fork.c
index 0cf9cdb..7201bc4 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -345,7 +345,9 @@ static struct task_struct *dup_task_struct(struct task_struct *orig)
#endif
tsk->splice_pipe = NULL;
tsk->task_frag.page = NULL;
-
+#ifdef CONFIG_SCHED_IO_LATENCY
+ tsk->io_latency.avg_latency = 0;
+#endif
account_kernel_stack(ti, 1);

return tsk;
diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile
index ab32b7b..4197807 100644
--- a/kernel/sched/Makefile
+++ b/kernel/sched/Makefile
@@ -19,3 +19,4 @@ obj-$(CONFIG_SCHED_AUTOGROUP) += auto_group.o
obj-$(CONFIG_SCHEDSTATS) += stats.o
obj-$(CONFIG_SCHED_DEBUG) += debug.o
obj-$(CONFIG_CGROUP_CPUACCT) += cpuacct.o
+obj-$(CONFIG_SCHED_IO_LATENCY) += io_latency.o
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index ec1a286..64181f6 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -84,6 +84,7 @@
#endif

#include "sched.h"
+#include "io_latency.h"
#include "../workqueue_internal.h"
#include "../smpboot.h"

@@ -4338,7 +4339,9 @@ void __sched io_schedule(void)
atomic_inc(&rq->nr_iowait);
blk_flush_plug(current);
current->in_iowait = 1;
+ io_latency_begin(rq, current);
schedule();
+ io_latency_end(rq, current);
current->in_iowait = 0;
atomic_dec(&rq->nr_iowait);
delayacct_blkio_end();
@@ -4354,7 +4357,9 @@ long __sched io_schedule_timeout(long timeout)
atomic_inc(&rq->nr_iowait);
blk_flush_plug(current);
current->in_iowait = 1;
+ io_latency_begin(rq, current);
ret = schedule_timeout(timeout);
+ io_latency_end(rq, current);
current->in_iowait = 0;
atomic_dec(&rq->nr_iowait);
delayacct_blkio_end();
@@ -7030,6 +7035,7 @@ void __init sched_init(void)
#endif
init_rq_hrtick(rq);
atomic_set(&rq->nr_iowait, 0);
+ io_latency_init(rq);
}

set_load_weight(&init_task);
diff --git a/kernel/sched/io_latency.c b/kernel/sched/io_latency.c
new file mode 100644
index 0000000..2d56a38
--- /dev/null
+++ b/kernel/sched/io_latency.c
@@ -0,0 +1,174 @@
+/*
+ * Copyright (c) 2014 ARM/Linaro
+ *
+ * Author: Daniel Lezcano <***@linaro.org>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ */
+#include <linux/percpu.h>
+#include <linux/ktime.h>
+#include <linux/rbtree.h>
+#include <linux/sched.h>
+#include <linux/slab.h>
+#include <linux/spinlock.h>
+
+#include "sched.h"
+
+struct io_latency_tree {
+ spinlock_t lock;
+ struct rb_root tree;
+ struct io_latency_node *left_most;
+};
+
+static DEFINE_PER_CPU(struct io_latency_tree, latency_trees);
+
+/**
+ * io_latency_init : initialization routine to be called for each possible cpu.
+ *
+ * @rq: the runqueue associated with the cpu
+ *
+ */
+void io_latency_init(struct rq *rq)
+{
+ int cpu = rq->cpu;
+ struct io_latency_tree *latency_tree = &per_cpu(latency_trees, cpu);
+ struct rb_root *root = &latency_tree->tree;
+
+ spin_lock_init(&latency_tree->lock);
+ latency_tree->left_most = NULL;
+ root->rb_node = NULL;
+}
+
+/**
+ * io_latency_get_sleep_length: compute the expected sleep time
+ *
+ * @rq: the runqueue associated with the cpu
+ *
+ * Returns the minimal estimated remaining sleep time for the pending IOs
+ */
+s64 io_latency_get_sleep_length(struct rq *rq)
+{
+ int cpu = rq->cpu;
+ struct io_latency_tree *latency_tree = &per_cpu(latency_trees, cpu);
+ struct io_latency_node *node;
+ ktime_t now = ktime_get();
+ s64 diff;
+
+ node = latency_tree->left_most;
+
+ if (!node)
+ return 0;
+
+ diff = ktime_to_us(ktime_sub(now, node->start_time));
+ diff = node->avg_latency - diff;
+
+ /* Estimation was wrong, return 0 */
+ if (diff < 0)
+ return 0;
+
+ return diff;
+}
+
+/**
+ * io_latency_avg: compute the io latency sliding average value
+ *
+ * @node: a rb tree node belonging to a task
+ *
+ */
+static void io_latency_avg(struct io_latency_node *node)
+{
+ /* MA*[i]= MA*[i-1] + X[i] - MA*[i-1]/N */
+ s64 latency = ktime_to_us(ktime_sub(node->end_time, node->start_time));
+ s64 diff = latency - node->avg_latency;
+
+ node->avg_latency = node->avg_latency + (diff >> 6);
+}
+
+/**
+ * io_latency_begin - insert the node in the rb tree
+ *
+ * @rq: the runqueue the task is running on
+ * @task: the task being blocked on an IO
+ *
+ * Inserts the node in the rbtree in an ordered manner. If this task
+ * has the minimal io latency of all the tasks blocked on IO, it falls
+ * at the left most node and a shortcut is used. Stores the start
+ * time of the io schedule.
+ *
+ */
+int io_latency_begin(struct rq *rq, struct task_struct *tsk)
+{
+ int cpu = rq->cpu;
+ struct io_latency_tree *latency_tree = &per_cpu(latency_trees, cpu);
+ struct rb_root *root = &latency_tree->tree;
+ struct io_latency_node *node = &tsk->io_latency;
+ struct rb_node **new = &root->rb_node, *parent = NULL;
+ struct io_latency_node *lat;
+ int leftmost = 1;
+
+ node->start_time = ktime_get();
+
+ spin_lock(&latency_tree->lock);
+
+ while (*new) {
+ lat = rb_entry(*new, struct io_latency_node, node);
+
+ parent = *new;
+
+ if (lat->avg_latency > node->avg_latency)
+ new = &parent->rb_left;
+ else {
+ new = &parent->rb_right;
+ leftmost = 0;
+ }
+ }
+
+ if (leftmost)
+ latency_tree->left_most = node;
+
+ rb_link_node(&node->node, parent, new);
+ rb_insert_color(&node->node, root);
+
+ spin_unlock(&latency_tree->lock);
+
+ return 0;
+}
+
+/**
+ * io_latency_end - Removes the node from the rb tree
+ *
+ * @rq: the runqueue the task belongs to
+ * @tsk: the task woken up after an IO completion
+ *
+ * Removes the node for the rb tree for this cpu. Update the left most
+ * node with the next node if itself it is the left most
+ * node. Retrieves the end time after the io has complete and update
+ * the io latency average time
+ */
+void io_latency_end(struct rq *rq, struct task_struct *tsk)
+{
+ int cpu = rq->cpu;
+ struct io_latency_tree *latency_tree = &per_cpu(latency_trees, cpu);
+ struct rb_root *root = &latency_tree->tree;
+ struct io_latency_node *old = &tsk->io_latency;
+
+ old->end_time = ktime_get();
+
+ spin_lock(&latency_tree->lock);
+
+ if (latency_tree->left_most == old) {
+ struct rb_node *next_node =
+ rb_next(&latency_tree->left_most->node);
+ latency_tree->left_most =
+ rb_entry(next_node, struct io_latency_node, node);
+ }
+
+ rb_erase(&old->node, root);
+
+ spin_unlock(&latency_tree->lock);
+
+ io_latency_avg(old);
+}
diff --git a/kernel/sched/io_latency.h b/kernel/sched/io_latency.h
new file mode 100644
index 0000000..62ece7c
--- /dev/null
+++ b/kernel/sched/io_latency.h
@@ -0,0 +1,38 @@
+/*
+ * Copyright (c) 2014 ARM/Linaro
+ *
+ * Author: Daniel Lezcano <***@linaro.org>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * Maintainer: Daniel Lezcano <***@linaro.org>
+ */
+
+#ifdef CONFIG_SCHED_IO_LATENCY
+extern void io_latency_init(struct rq *rq);
+extern int io_latency_begin(struct rq *rq, struct task_struct *tsk);
+extern void io_latency_end(struct rq *rq, struct task_struct *tsk);
+extern int io_latency_get_sleep_length(struct rq *rq);
+#else
+static inline void io_latency_init(struct rq *rq)
+{
+ ;
+}
+
+static inline int io_latency_begin(struct rq *rq, struct task_struct *tsk)
+{
+ return 0;
+}
+
+static inline void io_latency_end(struct rq *rq, struct task_struct *tsk)
+{
+ ;
+}
+
+static inline int io_latency_get_sleep_length(struct rq *rq)
+{
+ return 0;
+}
+#endif

--
1.9.1

Daniel Lezcano

2014-10-22 13:57:49 UTC

Permalink

As we want to improve the sleep duration estimation, the IO latency expected
duration is passed to the cpuidle framework. The governors will have to deal
with if they are interested in this information.

Signed-off-by: Daniel Lezcano <***@linaro.org>
---
drivers/cpuidle/cpuidle.c | 4 ++--
drivers/cpuidle/governors/ladder.c | 3 ++-
drivers/cpuidle/governors/menu.c | 5 +++--
include/linux/cpuidle.h | 12 +++++++++---
kernel/sched/idle.c | 27 ++++++++++++++++-----------
5 files changed, 32 insertions(+), 19 deletions(-)

diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c
index 1abd5a0..bf42e17 100644
--- a/drivers/cpuidle/cpuidle.c
+++ b/drivers/cpuidle/cpuidle.c
@@ -155,7 +155,7 @@ int cpuidle_enter_state(struct cpuidle_device *dev, struct cpuidle_driver *drv,
* Returns the index of the idle state.
*/
int cpuidle_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
- int latency_req, int next_event)
+ struct cpuidle_times *times)
{
if (off || !initialized)
return -ENODEV;
@@ -166,7 +166,7 @@ int cpuidle_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
if (unlikely(use_deepest_state))
return cpuidle_find_deepest_state(drv, dev);

- return cpuidle_curr_governor->select(drv, dev, latency_req, next_event);
+ return cpuidle_curr_governor->select(drv, dev, times);
}

/**
diff --git a/drivers/cpuidle/governors/ladder.c b/drivers/cpuidle/governors/ladder.c
index 17381c3..a84993c 100644
--- a/drivers/cpuidle/governors/ladder.c
+++ b/drivers/cpuidle/governors/ladder.c
@@ -65,10 +65,11 @@ static inline void ladder_do_selection(struct ladder_device *ldev,
*/
static int ladder_select_state(struct cpuidle_driver *drv,
struct cpuidle_device *dev,
- int latency_req, int next_event)
+ struct cpuidle_times *times)
{
struct ladder_device *ldev = &__get_cpu_var(ladder_devices);
struct ladder_device_state *last_state;
+ int latency_req = times->latency_req;
int last_residency, last_idx = ldev->last_state_idx;

last_state = &ldev->states[last_idx];
diff --git a/drivers/cpuidle/governors/menu.c b/drivers/cpuidle/governors/menu.c
index 9da11ce..88382d5 100644
--- a/drivers/cpuidle/governors/menu.c
+++ b/drivers/cpuidle/governors/menu.c
@@ -284,9 +284,10 @@ again:
* @dev: the CPU
*/
static int menu_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
- int latency_req, int next_event)
+ struct cpuidle_times *times)
{
struct menu_device *data = &__get_cpu_var(menu_devices);
+ int latency_req = times->latency_req;
int i;
unsigned int interactivity_req;
unsigned long nr_iowaiters, cpu_load;
@@ -299,7 +300,7 @@ static int menu_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
data->last_state_idx = CPUIDLE_DRIVER_STATE_START - 1;

/* determine the expected residency time, round up */
- data->next_timer_us = next_event;
+ data->next_timer_us = times->next_timer_event;

get_iowait_load(&nr_iowaiters, &cpu_load);
data->bucket = which_bucket(data->next_timer_us, nr_iowaiters);
diff --git a/include/linux/cpuidle.h b/include/linux/cpuidle.h
index b379ae5..e99823f 100644
--- a/include/linux/cpuidle.h
+++ b/include/linux/cpuidle.h
@@ -83,6 +83,12 @@ struct cpuidle_device {
#endif
};

+struct cpuidle_times {
+ unsigned int latency_req;
+ unsigned int next_timer_event;
+ unsigned int next_io_event;
+};
+
DECLARE_PER_CPU(struct cpuidle_device *, cpuidle_devices);
DECLARE_PER_CPU(struct cpuidle_device, cpuidle_dev);

@@ -123,7 +129,7 @@ extern void disable_cpuidle(void);

extern int cpuidle_select(struct cpuidle_driver *drv,
struct cpuidle_device *dev,
- int latency_req, int next_event);
+ struct cpuidle_times *times);
extern int cpuidle_enter(struct cpuidle_driver *drv,
struct cpuidle_device *dev, int index);
extern void cpuidle_reflect(struct cpuidle_device *dev, int index);
@@ -152,7 +158,7 @@ extern struct cpuidle_driver *cpuidle_get_cpu_driver(struct cpuidle_device *dev)
static inline void disable_cpuidle(void) { }
static inline int cpuidle_select(struct cpuidle_driver *drv,
struct cpuidle_device *dev,
- int latency_req, int next_event)
+ struct cpuidle_times *times)
{return -ENODEV; }
static inline int cpuidle_enter(struct cpuidle_driver *drv,
struct cpuidle_device *dev, int index)
@@ -208,7 +214,7 @@ struct cpuidle_governor {

int (*select) (struct cpuidle_driver *drv,
struct cpuidle_device *dev,
- int latency_req, int next_event);
+ struct cpuidle_times *times);
void (*reflect) (struct cpuidle_device *dev, int index);

struct module *owner;
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index 7b6a148..6057020 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -4,7 +4,7 @@
#include <linux/sched.h>
#include <linux/cpu.h>
#include <linux/cpuidle.h>
-#include <linux/tick.h>
+#include <linux/ktime.h>
#include <linux/pm_qos.h>
#include <linux/mm.h>
#include <linux/stackprotector.h>
@@ -14,6 +14,7 @@
#include <trace/events/power.h>

#include "sched.h"
+#include "io_latency.h"

static int __read_mostly cpu_idle_force_poll;

@@ -79,9 +80,9 @@ static void cpuidle_idle_call(void)
{
struct cpuidle_device *dev = __this_cpu_read(cpuidle_devices);
struct cpuidle_driver *drv = cpuidle_get_cpu_driver(dev);
- struct timespec t;
- int latency_req = pm_qos_request(PM_QOS_CPU_DMA_LATENCY);
- unsigned int broadcast;
+ struct cpuidle_times times;
+ int next_state, entered_state;
+ bool broadcast;

/*
* Check if the idle task must be rescheduled. If it is the
@@ -105,25 +106,29 @@ static void cpuidle_idle_call(void)
*/
rcu_idle_enter();

+ times.latency_req = pm_qos_request(PM_QOS_CPU_DMA_LATENCY);
/*
* The latency requirement does not allow any latency, jump to
- * the default idle function
+ * the default idle function without entering the cpuidle code
*/
- if (latency_req == 0)
+ if (times.latency_req == 0)
goto use_default;

- t = ktime_to_timespec(tick_nohz_get_sleep_length());
+ /*
+ * Retrieve the next timer event
+ */
+ times.next_timer_event = ktime_to_us(tick_nohz_get_sleep_length());

- /*
- * The next timer event for this in us
+ /*
+ * Retrieve the next IO guessed event
*/
- next_event = t.tv_sec * USEC_PER_SEC + t.tv_nsec / NSEC_PER_USEC;
+ times.next_io_event = io_latency_get_sleep_length(this_rq());

/*
* Ask the cpuidle framework to choose a convenient idle state.
* Fall back to the default arch idle method on errors.
*/
- next_state = cpuidle_select(drv, dev, latency_req, next_event);
+ next_state = cpuidle_select(drv, dev, &times);
if (next_state < 0) {
use_default:
/*

--
1.9.1

Daniel Lezcano

2014-10-22 13:57:47 UTC

Permalink

Following the logic of the previous patch, retrieve from the idle task the
expected timer sleep duration and pass it to the cpuidle framework.

Signed-off-by: Daniel Lezcano <***@linaro.org>
---
drivers/cpuidle/cpuidle.c | 4 ++--
drivers/cpuidle/governors/ladder.c | 3 ++-
drivers/cpuidle/governors/menu.c | 8 ++------
include/linux/cpuidle.h | 8 +++++---
kernel/sched/idle.c | 11 +++++++++--
5 files changed, 20 insertions(+), 14 deletions(-)

diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c
index 372c36f..658be9f 100644
--- a/drivers/cpuidle/cpuidle.c
+++ b/drivers/cpuidle/cpuidle.c
@@ -159,7 +159,7 @@ int cpuidle_enter_state(struct cpuidle_device *dev, struct cpuidle_driver *drv,
* Returns the index of the idle state.
*/
int cpuidle_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
- int latency_req)
+ int latency_req, int next_event)
{
if (off || !initialized)
return -ENODEV;
@@ -170,7 +170,7 @@ int cpuidle_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
if (unlikely(use_deepest_state))
return cpuidle_find_deepest_state(drv, dev);

- return cpuidle_curr_governor->select(drv, dev, latency_req);
+ return cpuidle_curr_governor->select(drv, dev, latency_req, next_event);
}

/**
diff --git a/drivers/cpuidle/governors/ladder.c b/drivers/cpuidle/governors/ladder.c
index 18f0da9..17381c3 100644
--- a/drivers/cpuidle/governors/ladder.c
+++ b/drivers/cpuidle/governors/ladder.c
@@ -64,7 +64,8 @@ static inline void ladder_do_selection(struct ladder_device *ldev,
* @dev: the CPU
*/
static int ladder_select_state(struct cpuidle_driver *drv,
- struct cpuidle_device *dev, int latency_req)
+ struct cpuidle_device *dev,
+ int latency_req, int next_event)
{
struct ladder_device *ldev = &__get_cpu_var(ladder_devices);
struct ladder_device_state *last_state;
diff --git a/drivers/cpuidle/governors/menu.c b/drivers/cpuidle/governors/menu.c
index 96f8fb0..9da11ce 100644
--- a/drivers/cpuidle/governors/menu.c
+++ b/drivers/cpuidle/governors/menu.c
@@ -13,10 +13,6 @@
#include <linux/kernel.h>
#include <linux/cpuidle.h>
#include <linux/pm_qos.h>
-#include <linux/time.h>
-#include <linux/ktime.h>
-#include <linux/hrtimer.h>
-#include <linux/tick.h>
#include <linux/sched.h>
#include <linux/math64.h>
#include <linux/module.h>
@@ -288,7 +284,7 @@ again:
* @dev: the CPU
*/
static int menu_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
- int latency_req)
+ int latency_req, int next_event)
{
struct menu_device *data = &__get_cpu_var(menu_devices);
int i;
@@ -303,7 +299,7 @@ static int menu_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
data->last_state_idx = CPUIDLE_DRIVER_STATE_START - 1;

/* determine the expected residency time, round up */
- data->next_timer_us = ktime_to_us(tick_nohz_get_sleep_length());
+ data->next_timer_us = next_event;

get_iowait_load(&nr_iowaiters, &cpu_load);
data->bucket = which_bucket(data->next_timer_us, nr_iowaiters);
diff --git a/include/linux/cpuidle.h b/include/linux/cpuidle.h
index fb465c1..b379ae5 100644
--- a/include/linux/cpuidle.h
+++ b/include/linux/cpuidle.h
@@ -122,7 +122,8 @@ struct cpuidle_driver {
extern void disable_cpuidle(void);

extern int cpuidle_select(struct cpuidle_driver *drv,
- struct cpuidle_device *dev, int latency_req);
+ struct cpuidle_device *dev,
+ int latency_req, int next_event);
extern int cpuidle_enter(struct cpuidle_driver *drv,
struct cpuidle_device *dev, int index);
extern void cpuidle_reflect(struct cpuidle_device *dev, int index);
@@ -150,7 +151,8 @@ extern struct cpuidle_driver *cpuidle_get_cpu_driver(struct cpuidle_device *dev)
#else
static inline void disable_cpuidle(void) { }
static inline int cpuidle_select(struct cpuidle_driver *drv,
- struct cpuidle_device *dev, int latency_req)
+ struct cpuidle_device *dev,
+ int latency_req, int next_event)
{return -ENODEV; }
static inline int cpuidle_enter(struct cpuidle_driver *drv,
struct cpuidle_device *dev, int index)
@@ -206,7 +208,7 @@ struct cpuidle_governor {

int (*select) (struct cpuidle_driver *drv,
struct cpuidle_device *dev,
- int latency_req);
+ int latency_req, int next_event);
void (*reflect) (struct cpuidle_device *dev, int index);

struct module *owner;
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index 72955e9..7b6a148 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -79,8 +79,8 @@ static void cpuidle_idle_call(void)
{
struct cpuidle_device *dev = __this_cpu_read(cpuidle_devices);
struct cpuidle_driver *drv = cpuidle_get_cpu_driver(dev);
+ struct timespec t;
int latency_req = pm_qos_request(PM_QOS_CPU_DMA_LATENCY);
- int next_state, entered_state;
unsigned int broadcast;

/*
@@ -112,11 +112,18 @@ static void cpuidle_idle_call(void)
if (latency_req == 0)
goto use_default;

+ t = ktime_to_timespec(tick_nohz_get_sleep_length());
+
+ /*
+ * The next timer event for this in us
+ */
+ next_event = t.tv_sec * USEC_PER_SEC + t.tv_nsec / NSEC_PER_USEC;
+
/*
* Ask the cpuidle framework to choose a convenient idle state.
* Fall back to the default arch idle method on errors.
*/
- next_state = cpuidle_select(drv, dev, latency_req);
+ next_state = cpuidle_select(drv, dev, latency_req, next_event);
if (next_state < 0) {
use_default:
/*

--
1.9.1

Daniel Lezcano

2014-10-22 13:57:53 UTC

Permalink

It was recently added in the energy aware scheduler kernel tree the io latency
tracking mechanism. The purpose of this framework is to provide a way to
predict the IO latencies, in other words try to guess how long we will be
sleeping on waiting an IO. When the cpu goes idle, we know how long is the
sleep duration with the timer but then we rely on some statistics in the
menu governor, which is part of the cpuidle framework for other wakes up.

The io latency tracking will provide an additional information about the
length of the expected sleep time, which combined with the timer duration
should give us a more accurate prediction.

The first step of the io latency tracking was simply using a sliding average
of the values, which is not really accurate as it is not immune against IOs
ping pong or big variations.

In order to improve that, each latency is grouped into a bucket which
represent an interval of latency and for each bucket a sliding average is
computed.

Why ? Because we don't want to take all the latencies and compute the
statistics on them. It does not make sense, takes a lot of memory,
computation time, for finally a result which is mathematically impossible
to resolve. It is better to use intervals to group the small variations of
the latencies. For example. 186us, 123us, 134us can fall into the bucket
[100 - 199].

The size of the bucket is the bucket interval and represent the resolution
of the statistic model. Eg with a bucket interval of 1us, it leads us to
do statitics on all numbers, with of course a bad prediction because the
number of latencies is big. A big interval can give better statistics,
but can give us a misprediction as the interval is larger.

Choosing the size of the bucket interval vs the idle sleep time is the
tradeoff to find. With a 200us bucket interval, the measurements show
we still have good predictions, less mispredictions and cover the idle
state target residency.

The buckets are dynamically created and stored into a list. A new bucket is
added at the end of the list.

This list is always moving depending on the number of successives hits a
bucket will have. The more a bucket is successively hit, the more it will
be the first element of the list.

The guessed next latency, which is a bucket (understand it will be between
eg. 200us and 300us, with a bucket interval of 100us), is retrieved from
the list. Each bucket present in the list will mark a score, the more the
hits a bucket has, the bigger score it has. *But* this is weighted by
the position in the list. The first elements will have more weight than the
last ones. This position is dynamically changed when a bucket is hit several
times.

Example the following latencies:
10, 100, 100, 100, 100, 100, 10, 10

We will have two buckets: 0 and 1.

10 => bucket0(1)
100 => bucket0(1), bucket1(1)
100 => bucket0(1), bucket1(2)
100 => bucket0(1), bucket1(3)
100 => bucket0(1), bucket1(4)
* 100 => bucket1(5), bucket0(1)
10 => bucket1(5), bucket0(2)
10 => bucket1(5), bucket0(3)

At (*), bucket1 reached 5 successive hits at has been move at the beginning
of the list and bucket0 became the second one.

Signed-off-by: Daniel Lezcano <***@linaro.org>
---
include/linux/sched.h | 8 ++
kernel/exit.c | 1 +
kernel/fork.c | 1 +
kernel/sched/core.c | 3 +-
kernel/sched/io_latency.c | 309 ++++++++++++++++++++++++++++++++++++++++++----
kernel/sched/io_latency.h | 8 +-
6 files changed, 304 insertions(+), 26 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 6af032b..9652ad6 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1228,7 +1228,15 @@ struct io_latency_node {
unsigned int avg_latency;
ktime_t start_time;
ktime_t end_time;
+ struct list_head bucket_list;
};
+
+void exit_io_latency(struct task_struct *tsk);
+#else
+static inline void exit_io_latency(struct task_struct *tsk)
+{
+ ;
+}
#endif

struct task_struct {
diff --git a/kernel/exit.c b/kernel/exit.c
index 32c58f7..3413fbe 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -757,6 +757,7 @@ void do_exit(long code)
exit_task_namespaces(tsk);
exit_task_work(tsk);
exit_thread();
+ exit_io_latency(tsk);

/*
* Flush inherited counters to the parent - before the parent
diff --git a/kernel/fork.c b/kernel/fork.c
index 7201bc4..d4e7ecc 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -347,6 +347,7 @@ static struct task_struct *dup_task_struct(struct task_struct *orig)
tsk->task_frag.page = NULL;
#ifdef CONFIG_SCHED_IO_LATENCY
tsk->io_latency.avg_latency = 0;
+ INIT_LIST_HEAD(&tsk->io_latency.bucket_list);
#endif
account_kernel_stack(ti, 1);

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 64181f6..96403f2 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6961,6 +6961,8 @@ void __init sched_init(void)
autogroup_init(&init_task);

#endif /* CONFIG_CGROUP_SCHED */
+
+ io_latency_init();

for_each_possible_cpu(i) {
struct rq *rq;
@@ -7035,7 +7037,6 @@ void __init sched_init(void)
#endif
init_rq_hrtick(rq);
atomic_set(&rq->nr_iowait, 0);
- io_latency_init(rq);
}

set_load_weight(&init_task);
diff --git a/kernel/sched/io_latency.c b/kernel/sched/io_latency.c
index 2d56a38..5f6bd50 100644
--- a/kernel/sched/io_latency.c
+++ b/kernel/sched/io_latency.c
@@ -23,23 +23,280 @@ struct io_latency_tree {
struct io_latency_node *left_most;
};

+/*
+ * That represents the resolution of the statistics in usec, the latency
+ * for a bucket is BUCKET_INTERVAL * index.
+ * The higher the resolution is the lesser good prediction you will have.
+ * Some measurements:
+ *
+ * For 1ms:
+ * SSD 6Gb/s : 99.7%
+ * SD card class 10: 97.7%
+ * SD card class 4 : 54.3%
+ * HDD on USB : 93.6%
+ *
+ * For 500us:
+ * SSD 6Gb/s : 99.9%
+ * SD card class 10 : 96.8%
+ * SD card class 4 : 55.8%
+ * HDD on USB : 86.3%
+ *
+ * For 200us:
+ * SSD 6Gb/s : 99.7%
+ * SD card class 10 : 95.5%
+ * SD card class 4 : 29.5%
+ * HDD on USB : 66.3%
+ *
+ * For 100us:
+ * SSD 6Gb/s : 85.7%
+ * SD card class 10 : 67.63%
+ * SD card class 4 : 31.4%
+ * HDD on USB : 44.97%
+ *
+ * Aiming a 100% is not necessary good because we want to hit the correct
+ * idle state. Setting a low resolution will group the different latencies
+ * into a big interval which may overlap with the cpuidle state target
+ * residency.
+ *
+ */
+#define BUCKET_INTERVAL 200
+
+/*
+ * Number of successive hits for the same bucket. That is the thresold
+ * triggering the move of the element at the beginning of the list, so
+ * becoming more weighted for the statistics when guessing for the next
+ * latency.
+ */
+#define BUCKET_SUCCESSIVE 5
+
+/*
+ * What is a bucket ?
+ *
+ * A bucket is an interval of latency. This interval is defined with the
+ * BUCKET_INTERVAL. The bucket index gives what latency interval we have.
+ * For example, if you have an index 2 and a bucket interval of 1000usec,
+ * then the bucket contains the latencies 2000 and 2999 usec.
+ *
+ */
+struct bucket {
+ int hits;
+ int successive_hits;
+ int index;
+ int average;
+ struct list_head list;
+};
+
+static struct kmem_cache *bucket_cachep;
+
static DEFINE_PER_CPU(struct io_latency_tree, latency_trees);

/**
- * io_latency_init : initialization routine to be called for each possible cpu.
+ * io_latency_bucket_find - Find a bucket associated with the specified index
*
- * @rq: the runqueue associated with the cpu
+ * @index: the index of the bucket to find
+ * @tsk: the task to retrieve the task list
*
+ * Returns the bucket associated with the index, NULL if no bucket is found
*/
-void io_latency_init(struct rq *rq)
+static struct bucket *io_latency_bucket_find(struct task_struct *tsk, int index)
{
- int cpu = rq->cpu;
- struct io_latency_tree *latency_tree = &per_cpu(latency_trees, cpu);
- struct rb_root *root = &latency_tree->tree;
+ struct list_head *list;
+ struct bucket *bucket = NULL;
+ struct list_head *bucket_list = &tsk->io_latency.bucket_list;

- spin_lock_init(&latency_tree->lock);
- latency_tree->left_most = NULL;
- root->rb_node = NULL;
+ list_for_each(list, bucket_list) {
+
+ bucket = list_entry(list, struct bucket, list);
+
+ if (bucket->index == index)
+ return bucket;
+ }
+
+ return NULL;
+}
+
+/**
+ * io_latency_bucket_alloc - Allocate a bucket
+ *
+ * @index: index of the bucket to allow
+ *
+ * Allocate and initialize a bucket structure
+ *
+ * Returns a pointer to a bucket or NULL is the allocation failed
+ */
+static struct bucket *io_latency_bucket_alloc(int index)
+{
+ struct bucket *bucket;
+
+ bucket = kmem_cache_alloc(bucket_cachep, GFP_KERNEL);
+ if (bucket) {
+ bucket->hits = 0;
+ bucket->successive_hits = 0;
+ bucket->index = index;
+ bucket->average = 0;
+ INIT_LIST_HEAD(&bucket->list);
+ }
+
+ return bucket;
+}
+
+/**
+ * io_latency_guessed_bucket - try to predict the next bucket
+ *
+ * @tsk: the task to get the bucket list
+ *
+ * The list is ordered by history. The first element is the one with
+ * the more *successive* hits. This function is called each time a new
+ * latency is inserted. The algorithm is pretty simple here: As the
+ * first element is the one which more chance to occur next, its
+ * weight is the bigger, the second one has less weight, etc ...
+ *
+ * The bucket which has the maximum score (number of hits weighted by
+ * its position in the list) is the next bucket which has more chances
+ * to occur.
+ *
+ * Returns a pointer to the bucket structure, NULL if there are no
+ * buckets in the list
+ */
+static struct bucket *io_latency_guessed_bucket(struct task_struct *tsk)
+{
+ int weight = 0;
+ int score, score_max = 0;
+ struct bucket *bucket, *winner = NULL;
+ struct list_head *list = NULL;
+ struct list_head *bucket_list = &tsk->io_latency.bucket_list;
+
+ if (list_empty(bucket_list))
+ return NULL;
+
+ list_for_each(list, bucket_list) {
+
+ bucket = list_entry(list, struct bucket, list);
+
+ /*
+ * The list is ordered by history, the first element has
+ * more weight the next one
+ */
+ score = bucket->hits / ((2 * weight) + 1);
+
+ weight++;
+
+ if (score < score_max)
+ continue;
+
+ score_max = score;
+ winner = bucket;
+ }
+
+ return winner;
+}
+
+/*
+ * io_latency_bucket_index - Returns the bucket index for the specified latency
+ *
+ * @latency: the latency fitting a bucket with the specified index
+ *
+ * Returns an integer for the bucket's index
+ */
+static int io_latency_bucket_index(int latency)
+{
+ return latency / BUCKET_INTERVAL;
+}
+
+/*
+ * io_latency_bucket_fill - Compute and fill the bucket list
+ *
+ * @tsk: the task completing an IO
+ * @latency: the latency of the IO
+ *
+ * The dynamic of the list is the following.
+ * - Each new element is inserted at the end of the list
+ * - Each element passing <BUCKET_SUCCESSIVE> times in this function
+ * is elected to be moved at the beginning at the list
+ *
+ * Returns 0 on success, -1 if a bucket allocation failed
+ */
+static int io_latency_bucket_fill(struct task_struct *tsk, int latency)
+{
+ int diff, index = io_latency_bucket_index(latency);
+ struct bucket *bucket;
+
+ /*
+ * Find the bucket associated with the index
+ */
+ bucket = io_latency_bucket_find(tsk, index);
+ if (!bucket) {
+ bucket = io_latency_bucket_alloc(index);
+ if (!bucket)
+ return -1;
+
+ list_add_tail(&bucket->list, &tsk->io_latency.bucket_list);
+ }
+
+ /*
+ * Increase the number of times this bucket has been hit
+ */
+ bucket->hits++;
+ bucket->successive_hits++;
+
+ /*
+ * Compute a sliding average for latency in this bucket
+ */
+ diff = latency - bucket->average;
+ bucket->average += (diff >> 6);
+
+ /*
+ * We hit a successive number of times the same bucket, move
+ * it at the beginning of the list
+ */
+ if (bucket->successive_hits == BUCKET_SUCCESSIVE) {
+ list_move(&bucket->list, &tsk->io_latency.bucket_list);
+ bucket->successive_hits = 1;
+ }
+
+ return 0;
+}
+
+/*
+ * exit_io_latency - free ressources when the task exits
+ *
+ * @tsk : the exiting task
+ *
+ */
+void exit_io_latency(struct task_struct *tsk)
+{
+ struct list_head *bucket_list = &tsk->io_latency.bucket_list;
+ struct list_head *tmp, *list;
+ struct bucket *bucket;
+
+ list_for_each_safe(list, tmp, bucket_list) {
+
+ list_del(list);
+ bucket = list_entry(list, struct bucket, list);
+ kmem_cache_free(bucket_cachep, bucket);
+ }
+}
+
+/**
+ * io_latency_init : initialization routine
+ *
+ * Initializes the cache pool and the io latency rb trees.
+ */
+void io_latency_init(void)
+{
+ int cpu;
+ struct io_latency_tree *latency_tree;
+ struct rb_root *root;
+
+ bucket_cachep = KMEM_CACHE(bucket, SLAB_PANIC);
+
+ for_each_possible_cpu(cpu) {
+ latency_tree = &per_cpu(latency_trees, cpu);
+ latency_tree->left_most = NULL;
+ spin_lock_init(&latency_tree->lock);
+ root = &latency_tree->tree;
+ root->rb_node = NULL;
+ }
}

/**
@@ -54,18 +311,20 @@ s64 io_latency_get_sleep_length(struct rq *rq)
int cpu = rq->cpu;
struct io_latency_tree *latency_tree = &per_cpu(latency_trees, cpu);
struct io_latency_node *node;
- ktime_t now = ktime_get();
- s64 diff;
+ s64 diff, next_event, now;

node = latency_tree->left_most;
-
if (!node)
return 0;

- diff = ktime_to_us(ktime_sub(now, node->start_time));
- diff = node->avg_latency - diff;
+ next_event = ktime_to_us(node->start_time) + node->avg_latency;
+ now = ktime_to_us(ktime_get());
+ diff = next_event - now;

- /* Estimation was wrong, return 0 */
+ /* Estimation was wrong, so the next io event should have
+ * already occured but it actually didn't, so we have a
+ * negative value, return 0 in this case as it is considered
+ * by the caller as an invalid value */
if (diff < 0)
return 0;

@@ -78,13 +337,17 @@ s64 io_latency_get_sleep_length(struct rq *rq)
* @node: a rb tree node belonging to a task
*
*/
-static void io_latency_avg(struct io_latency_node *node)
+static void io_latency_avg(struct task_struct *tsk)
{
- /* MA*[i]= MA*[i-1] + X[i] - MA*[i-1]/N */
+ struct io_latency_node *node = &tsk->io_latency;
s64 latency = ktime_to_us(ktime_sub(node->end_time, node->start_time));
- s64 diff = latency - node->avg_latency;
+ struct bucket *bucket;
+
+ io_latency_bucket_fill(tsk, latency);

- node->avg_latency = node->avg_latency + (diff >> 6);
+ bucket = io_latency_guessed_bucket(tsk);
+ if (bucket)
+ node->avg_latency = bucket->average;
}

/**
@@ -118,7 +381,11 @@ int io_latency_begin(struct rq *rq, struct task_struct *tsk)

parent = *new;

- if (lat->avg_latency > node->avg_latency)
+ /*
+ * Check *when* will occur the next event
+ */
+ if (ktime_to_us(lat->start_time) + lat->avg_latency >
+ ktime_to_us(node->start_time) + node->avg_latency)
new = &parent->rb_left;
else {
new = &parent->rb_right;
@@ -170,5 +437,5 @@ void io_latency_end(struct rq *rq, struct task_struct *tsk)

spin_unlock(&latency_tree->lock);

- io_latency_avg(old);
+ io_latency_avg(tsk);
}
diff --git a/kernel/sched/io_latency.h b/kernel/sched/io_latency.h
index 62ece7c..c54de4d 100644
--- a/kernel/sched/io_latency.h
+++ b/kernel/sched/io_latency.h
@@ -11,12 +11,12 @@
*/

#ifdef CONFIG_SCHED_IO_LATENCY
-extern void io_latency_init(struct rq *rq);
+extern void io_latency_init(void);
extern int io_latency_begin(struct rq *rq, struct task_struct *tsk);
extern void io_latency_end(struct rq *rq, struct task_struct *tsk);
-extern int io_latency_get_sleep_length(struct rq *rq);
+extern s64 io_latency_get_sleep_length(struct rq *rq);
#else
-static inline void io_latency_init(struct rq *rq)
+static inline void io_latency_init(void)
{
;
}
@@ -31,7 +31,7 @@ static inline void io_latency_end(struct rq *rq, struct task_struct *tsk)
;
}

-static inline int io_latency_get_sleep_length(struct rq *rq)
+static inline s64 io_latency_get_sleep_length(struct rq *rq)
{
return 0;
}

--
1.9.1

Daniel Lezcano

2014-10-22 13:57:52 UTC

Permalink

Signed-off-by: Daniel Lezcano <***@linaro.org>
---
drivers/cpuidle/cpuidle.c | 8 +++
drivers/cpuidle/sysfs.c | 156 ++++++++++++++++++++++++++++++++++++++++++++++
include/linux/cpuidle.h | 7 ++-
3 files changed, 170 insertions(+), 1 deletion(-)

diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c
index bf42e17..66231a40 100644
--- a/drivers/cpuidle/cpuidle.c
+++ b/drivers/cpuidle/cpuidle.c
@@ -139,6 +139,14 @@ int cpuidle_enter_state(struct cpuidle_device *dev, struct cpuidle_driver *drv,
*/
dev->states_usage[entered_state].time += dev->last_residency;
dev->states_usage[entered_state].usage++;
+ if (diff < dev->last_residency)
+ atomic_inc(&dev->over_estimate);
+ else if (entered_state < (drv->state_count - 1) &&
+ dev->last_residency <
+ drv->states[entered_state + 1].target_residency)
+ atomic_inc(&dev->under_estimate);
+ else
+ atomic_inc(&dev->right_estimate);
} else {
dev->last_residency = 0;
}
diff --git a/drivers/cpuidle/sysfs.c b/drivers/cpuidle/sysfs.c
index 97c5903..f446bd0 100644
--- a/drivers/cpuidle/sysfs.c
+++ b/drivers/cpuidle/sysfs.c
@@ -439,6 +439,154 @@ static void cpuidle_remove_state_sysfs(struct cpuidle_device *device)
cpuidle_free_state_kobj(device, i);
}

+#define kobj_to_stats_kobj(k) container_of(k, struct cpuidle_stats_kobj, kobj)
+#define attr_to_stats_attr(a) container_of(a, struct cpuidle_stats_attr, attr)
+
+#define define_show_stats_function(_name) \
+ static ssize_t show_stats_##_name(struct cpuidle_device *dev, \
+ char *buf) \
+ { \
+ return sprintf(buf, "%d\n", atomic_read(&dev->_name)); \
+ }
+
+#define define_store_stats_function(_name) \
+ static ssize_t store_stats_##_name(struct cpuidle_device *dev, \
+ const char *buf, size_t size) \
+ { \
+ unsigned long long value; \
+ int err; \
+ if (!capable(CAP_SYS_ADMIN)) \
+ return -EPERM; \
+ err = kstrtoull(buf, 0, &value); \
+ if (err) \
+ return err; \
+ \
+ atomic_set(&dev->_name, value); \
+ return size; \
+ }
+
+#define define_one_stats_rw(_name, show, store) \
+ static struct cpuidle_stats_attr attr_stats_##_name = \
+ __ATTR(_name, 0644, show, store)
+
+struct cpuidle_stats_kobj {
+ struct cpuidle_device *dev;
+ struct completion kobj_unregister;
+ struct kobject kobj;
+};
+
+struct cpuidle_stats_attr {
+ struct attribute attr;
+ ssize_t (*show)(struct cpuidle_device *, char *);
+ ssize_t (*store)(struct cpuidle_device *, const char *, size_t);
+};
+
+static void cpuidle_stats_sysfs_release(struct kobject *kobj)
+{
+ struct cpuidle_stats_kobj *stats_kobj = kobj_to_stats_kobj(kobj);
+ complete(&stats_kobj->kobj_unregister);
+}
+
+static ssize_t cpuidle_stats_show(struct kobject *kobj, struct attribute *attr,
+ char *buf)
+{
+ int ret = -EIO;
+ struct cpuidle_stats_kobj *stats_kobj = kobj_to_stats_kobj(kobj);
+ struct cpuidle_stats_attr *dattr = attr_to_stats_attr(attr);
+
+ if (dattr->show)
+ ret = dattr->show(stats_kobj->dev, buf);
+
+ return ret;
+}
+
+static ssize_t cpuidle_stats_store(struct kobject *kobj,
+ struct attribute *attr,
+ const char *buf, size_t size)
+{
+ int ret = -EIO;
+ struct cpuidle_stats_kobj *stats_kobj = kobj_to_stats_kobj(kobj);
+ struct cpuidle_stats_attr *dattr = attr_to_stats_attr(attr);
+
+ if (dattr->store)
+ ret = dattr->store(stats_kobj->dev, buf, size);
+
+ return ret;
+}
+
+define_show_stats_function(right_estimate);
+define_store_stats_function(right_estimate);
+
+define_show_stats_function(under_estimate);
+define_store_stats_function(under_estimate);
+
+define_show_stats_function(over_estimate);
+define_store_stats_function(over_estimate);
+
+define_one_stats_rw(right_estimate,
+ show_stats_right_estimate,
+ store_stats_right_estimate);
+
+define_one_stats_rw(under_estimate,
+ show_stats_under_estimate,
+ store_stats_under_estimate);
+
+define_one_stats_rw(over_estimate,
+ show_stats_over_estimate,
+ store_stats_over_estimate);
+
+static const struct sysfs_ops cpuidle_stats_sysfs_ops = {
+ .show = cpuidle_stats_show,
+ .store = cpuidle_stats_store,
+};
+
+static struct attribute *cpuidle_stats_default_attrs[] = {
+ &attr_stats_right_estimate.attr,
+ &attr_stats_under_estimate.attr,
+ &attr_stats_over_estimate.attr,
+ NULL
+};
+
+static struct kobj_type ktype_stats_cpuidle = {
+ .sysfs_ops = &cpuidle_stats_sysfs_ops,
+ .default_attrs = cpuidle_stats_default_attrs,
+ .release = cpuidle_stats_sysfs_release,
+};
+
+static int cpuidle_add_stats_sysfs(struct cpuidle_device *dev)
+{
+ struct cpuidle_stats_kobj *kstats;
+ struct cpuidle_device_kobj *kdev = dev->kobj_dev;
+ int ret;
+
+ kstats = kzalloc(sizeof(*kstats), GFP_KERNEL);
+ if (!kstats)
+ return -ENOMEM;
+
+ kstats->dev = dev;
+ init_completion(&kstats->kobj_unregister);
+
+ ret = kobject_init_and_add(&kstats->kobj, &ktype_stats_cpuidle,
+ &kdev->kobj, "stats");
+ if (ret) {
+ kfree(kstats);
+ return ret;
+ }
+
+ kobject_uevent(&kstats->kobj, KOBJ_ADD);
+ dev->kobj_stats = kstats;
+
+ return ret;
+}
+
+static void cpuidle_remove_stats_sysfs(struct cpuidle_device *dev)
+{
+ struct cpuidle_stats_kobj *kstats = dev->kobj_stats;
+ kobject_put(&kstats->kobj);
+ wait_for_completion(&kstats->kobj_unregister);
+ kfree(kstats);
+}
+
#ifdef CONFIG_CPU_IDLE_MULTIPLE_DRIVERS
#define kobj_to_driver_kobj(k) container_of(k, struct cpuidle_driver_kobj, kobj)
#define attr_to_driver_attr(a) container_of(a, struct cpuidle_driver_attr, attr)
@@ -589,6 +737,13 @@ int cpuidle_add_device_sysfs(struct cpuidle_device *device)
ret = cpuidle_add_driver_sysfs(device);
if (ret)
cpuidle_remove_state_sysfs(device);
+
+ ret = cpuidle_add_stats_sysfs(device);
+ if (ret) {
+ cpuidle_remove_driver_sysfs(device);
+ cpuidle_remove_state_sysfs(device);
+ }
+
return ret;
}

@@ -598,6 +753,7 @@ int cpuidle_add_device_sysfs(struct cpuidle_device *device)
*/
void cpuidle_remove_device_sysfs(struct cpuidle_device *device)
{
+ cpuidle_remove_stats_sysfs(device);
cpuidle_remove_driver_sysfs(device);
cpuidle_remove_state_sysfs(device);
}
diff --git a/include/linux/cpuidle.h b/include/linux/cpuidle.h
index e99823f..c3b9bdd 100644
--- a/include/linux/cpuidle.h
+++ b/include/linux/cpuidle.h
@@ -44,7 +44,6 @@ struct cpuidle_state {
int power_usage; /* in mW */
unsigned int target_residency; /* in US */
bool disabled; /* disabled on all CPUs */
-
int (*enter) (struct cpuidle_device *dev,
struct cpuidle_driver *drv,
int index);
@@ -62,6 +61,7 @@ struct cpuidle_state {
struct cpuidle_device_kobj;
struct cpuidle_state_kobj;
struct cpuidle_driver_kobj;
+struct cpuidle_stats_kobj;

struct cpuidle_device {
unsigned int registered:1;
@@ -74,8 +74,13 @@ struct cpuidle_device {
struct cpuidle_state_kobj *kobjs[CPUIDLE_STATE_MAX];
struct cpuidle_driver_kobj *kobj_driver;
struct cpuidle_device_kobj *kobj_dev;
+ struct cpuidle_stats_kobj *kobj_stats;
struct list_head device_list;

+ atomic_t right_estimate;
+ atomic_t under_estimate;
+ atomic_t over_estimate;
+
#ifdef CONFIG_ARCH_NEEDS_CPU_IDLE_COUPLED
int safe_state_index;
cpumask_t coupled_cpus;

--
1.9.1

Daniel Lezcano

2014-10-22 13:57:50 UTC

Permalink

This simple governor takes into account the predictable events: the timer sleep
duration and the next expected IO sleep duration. By mixing both it deduced
what idle state fits better. This governor must be extended to a statistical
approach to predict all the other events.

The main purpose of this governor is to handle the guessed next events in a
categorized way:
1. deterministic events : timers
2. guessed events : IOs
3. predictable events : keystroke, incoming network packet, ...

This governor is aimed to be moved later near the scheduler, so this one
can inspect/inject more informations and act proactively rather than
reactively.

Signed-off-by: Daniel Lezcano <***@linaro.org>
---
drivers/cpuidle/Kconfig | 4 +++
drivers/cpuidle/governors/Makefile | 1 +
drivers/cpuidle/governors/select.c | 55 ++++++++++++++++++++++++++++++++++++++
3 files changed, 60 insertions(+)
create mode 100644 drivers/cpuidle/governors/select.c

diff --git a/drivers/cpuidle/Kconfig b/drivers/cpuidle/Kconfig
index 32748c3..11e9781 100644
--- a/drivers/cpuidle/Kconfig
+++ b/drivers/cpuidle/Kconfig
@@ -25,6 +25,10 @@ config CPU_IDLE_GOV_MENU
bool "Menu governor (for tickless system)"
default y

+config CPU_IDLE_GOV_SELECT
+ bool "Select governor (for tickless system)"
+ default y
+
menu "ARM CPU Idle Drivers"
depends on ARM
source "drivers/cpuidle/Kconfig.arm"
diff --git a/drivers/cpuidle/governors/Makefile b/drivers/cpuidle/governors/Makefile
index 1b51272..fa45520 100644
--- a/drivers/cpuidle/governors/Makefile
+++ b/drivers/cpuidle/governors/Makefile
@@ -4,3 +4,4 @@

obj-$(CONFIG_CPU_IDLE_GOV_LADDER) += ladder.o
obj-$(CONFIG_CPU_IDLE_GOV_MENU) += menu.o
+obj-$(CONFIG_CPU_IDLE_GOV_SELECT) += select.o
diff --git a/drivers/cpuidle/governors/select.c b/drivers/cpuidle/governors/select.c
new file mode 100644
index 0000000..0de7095
--- /dev/null
+++ b/drivers/cpuidle/governors/select.c
@@ -0,0 +1,55 @@
+/*
+ * select.c - the select governor
+ *
+ * Copyright (C) 2014 Daniel Lezcano <***@linaro.org>
+ *
+*/
+
+#include <linux/cpuidle.h>
+
+static int select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
+ struct cpuidle_times *times)
+{
+ int i, index = 0, latency_req = times->latency_req;
+ unsigned int next_event;
+
+ /*
+ * If the guessed IO next event is zero, that means there is no IO
+ * pending, so we ignore it in the equation
+ */
+ next_event = times->next_io_event ?
+ min(times->next_io_event, times->next_timer_event) :
+ times->next_timer_event;
+
+ for (i = 0; i < drv->state_count; i++) {
+
+ struct cpuidle_state *s = &drv->states[i];
+ struct cpuidle_state_usage *su = &dev->states_usage[i];
+
+ if (s->disabled || su->disable)
+ continue;
+ if (s->target_residency > next_event)
+ continue;
+ if (s->exit_latency > latency_req)
+ continue;
+
+ index = i;
+ }
+
+ return index;
+}
+
+static struct cpuidle_governor select_governor = {
+ .name = "select",
+ .rating = 10,
+ .select = select,
+ .owner = THIS_MODULE,
+};
+
+static int __init select_init(void)
+{
+ return cpuidle_register_governor(&select_governor);
+}
+
+postcore_initcall(select_init);
+

--
1.9.1

Daniel Lezcano

2014-10-22 13:57:51 UTC

Permalink

Signed-off-by: Daniel Lezcano <***@linaro.org>
---
drivers/cpuidle/governors/select.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/cpuidle/governors/select.c b/drivers/cpuidle/governors/select.c
index 0de7095..2193b78 100644
--- a/drivers/cpuidle/governors/select.c
+++ b/drivers/cpuidle/governors/select.c
@@ -41,7 +41,7 @@ static int select(struct cpuidle_driver *drv, struct cpuidle_device *dev,

static struct cpuidle_governor select_governor = {
.name = "select",
- .rating = 10,
+ .rating = 30,
.select = select,
.owner = THIS_MODULE,
};

--
1.9.1

Daniel Lezcano

2014-10-22 13:57:48 UTC

Permalink

Moving around the different functions dealing with the time made the time
headers no longer necessary in cpuidle.c.

Remove them.

Signed-off-by: Daniel Lezcano <***@linaro.org>
---
drivers/cpuidle/cpuidle.c | 4 ----
1 file changed, 4 deletions(-)

diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c
index 658be9f..1abd5a0 100644
--- a/drivers/cpuidle/cpuidle.c
+++ b/drivers/cpuidle/cpuidle.c
@@ -8,16 +8,12 @@
* This code is licenced under the GPL.
*/

-#include <linux/clockchips.h>
#include <linux/kernel.h>
#include <linux/mutex.h>
-#include <linux/sched.h>
#include <linux/notifier.h>
#include <linux/pm_qos.h>
#include <linux/cpu.h>
#include <linux/cpuidle.h>
-#include <linux/ktime.h>
-#include <linux/hrtimer.h>
#include <linux/module.h>
#include <trace/events/power.h>

--
1.9.1

Daniel Lezcano

2014-10-22 13:57:46 UTC

Permalink

As we get the latency_req from cpuidle_idle_call, just pass it to the
cpuidle layer instead of duplicating the code across the governors.

That has the benefit of moving little by little the different timings
we want to integrate with the scheduler near this one.

Signed-off-by: Daniel Lezcano <***@linaro.org>
---
drivers/cpuidle/cpuidle.c | 5 +++--
drivers/cpuidle/governors/ladder.c | 3 +--
drivers/cpuidle/governors/menu.c | 4 ++--
include/linux/cpuidle.h | 7 ++++---
kernel/sched/idle.c | 3 ++-
5 files changed, 12 insertions(+), 10 deletions(-)

diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c
index ee9df5e..372c36f 100644
--- a/drivers/cpuidle/cpuidle.c
+++ b/drivers/cpuidle/cpuidle.c
@@ -158,7 +158,8 @@ int cpuidle_enter_state(struct cpuidle_device *dev, struct cpuidle_driver *drv,
*
* Returns the index of the idle state.
*/
-int cpuidle_select(struct cpuidle_driver *drv, struct cpuidle_device *dev)
+int cpuidle_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
+ int latency_req)
{
if (off || !initialized)
return -ENODEV;
@@ -169,7 +170,7 @@ int cpuidle_select(struct cpuidle_driver *drv, struct cpuidle_device *dev)
if (unlikely(use_deepest_state))
return cpuidle_find_deepest_state(drv, dev);

- return cpuidle_curr_governor->select(drv, dev);
+ return cpuidle_curr_governor->select(drv, dev, latency_req);
}

/**
diff --git a/drivers/cpuidle/governors/ladder.c b/drivers/cpuidle/governors/ladder.c
index a7f18e7..18f0da9 100644
--- a/drivers/cpuidle/governors/ladder.c
+++ b/drivers/cpuidle/governors/ladder.c
@@ -64,12 +64,11 @@ static inline void ladder_do_selection(struct ladder_device *ldev,
* @dev: the CPU
*/
static int ladder_select_state(struct cpuidle_driver *drv,
- struct cpuidle_device *dev)
+ struct cpuidle_device *dev, int latency_req)
{
struct ladder_device *ldev = &__get_cpu_var(ladder_devices);
struct ladder_device_state *last_state;
int last_residency, last_idx = ldev->last_state_idx;
- int latency_req = pm_qos_request(PM_QOS_CPU_DMA_LATENCY);

last_state = &ldev->states[last_idx];

diff --git a/drivers/cpuidle/governors/menu.c b/drivers/cpuidle/governors/menu.c
index 4ed3915..96f8fb0 100644
--- a/drivers/cpuidle/governors/menu.c
+++ b/drivers/cpuidle/governors/menu.c
@@ -287,10 +287,10 @@ again:
* @drv: cpuidle driver containing state data
* @dev: the CPU
*/
-static int menu_select(struct cpuidle_driver *drv, struct cpuidle_device *dev)
+static int menu_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
+ int latency_req)
{
struct menu_device *data = &__get_cpu_var(menu_devices);
- int latency_req = pm_qos_request(PM_QOS_CPU_DMA_LATENCY);
int i;
unsigned int interactivity_req;
unsigned long nr_iowaiters, cpu_load;
diff --git a/include/linux/cpuidle.h b/include/linux/cpuidle.h
index 25e0df6..fb465c1 100644
--- a/include/linux/cpuidle.h
+++ b/include/linux/cpuidle.h
@@ -122,7 +122,7 @@ struct cpuidle_driver {
extern void disable_cpuidle(void);

extern int cpuidle_select(struct cpuidle_driver *drv,
- struct cpuidle_device *dev);
+ struct cpuidle_device *dev, int latency_req);
extern int cpuidle_enter(struct cpuidle_driver *drv,
struct cpuidle_device *dev, int index);
extern void cpuidle_reflect(struct cpuidle_device *dev, int index);
@@ -150,7 +150,7 @@ extern struct cpuidle_driver *cpuidle_get_cpu_driver(struct cpuidle_device *dev)
#else
static inline void disable_cpuidle(void) { }
static inline int cpuidle_select(struct cpuidle_driver *drv,
- struct cpuidle_device *dev)
+ struct cpuidle_device *dev, int latency_req)
{return -ENODEV; }
static inline int cpuidle_enter(struct cpuidle_driver *drv,
struct cpuidle_device *dev, int index)
@@ -205,7 +205,8 @@ struct cpuidle_governor {
struct cpuidle_device *dev);

int (*select) (struct cpuidle_driver *drv,
- struct cpuidle_device *dev);
+ struct cpuidle_device *dev,
+ int latency_req);
void (*reflect) (struct cpuidle_device *dev, int index);

struct module *owner;
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index 3042b924..72955e9 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -5,6 +5,7 @@
#include <linux/cpu.h>
#include <linux/cpuidle.h>
#include <linux/tick.h>
+#include <linux/pm_qos.h>
#include <linux/mm.h>
#include <linux/stackprotector.h>

@@ -115,7 +116,7 @@ static void cpuidle_idle_call(void)
* Ask the cpuidle framework to choose a convenient idle state.
* Fall back to the default arch idle method on errors.
*/
- next_state = cpuidle_select(drv, dev);
+ next_state = cpuidle_select(drv, dev, latency_req);
if (next_state < 0) {
use_default:
/*

--
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Daniel Lezcano

2014-10-22 13:57:45 UTC

Permalink

If the zero latency is required, we don't want to invoke any cpuidle code at
all.

Move the check within the governors and do the check before selecting the
state in order to fallback to the default idle function.

Signed-off-by: Daniel Lezcano <***@linaro.org>
---
drivers/cpuidle/governors/ladder.c | 6 ------
drivers/cpuidle/governors/menu.c | 4 ----
kernel/sched/idle.c | 8 ++++++++
3 files changed, 8 insertions(+), 10 deletions(-)

diff --git a/drivers/cpuidle/governors/ladder.c b/drivers/cpuidle/governors/ladder.c
index 044ee0d..a7f18e7 100644
--- a/drivers/cpuidle/governors/ladder.c
+++ b/drivers/cpuidle/governors/ladder.c
@@ -71,12 +71,6 @@ static int ladder_select_state(struct cpuidle_driver *drv,
int last_residency, last_idx = ldev->last_state_idx;
int latency_req = pm_qos_request(PM_QOS_CPU_DMA_LATENCY);

- /* Special case when user has set very strict latency requirement */
- if (unlikely(latency_req == 0)) {
- ladder_do_selection(ldev, last_idx, 0);
- return 0;
- }
-
last_state = &ldev->states[last_idx];

if (drv->states[last_idx].flags & CPUIDLE_FLAG_TIME_VALID) {
diff --git a/drivers/cpuidle/governors/menu.c b/drivers/cpuidle/governors/menu.c
index 34db2fb..4ed3915 100644
--- a/drivers/cpuidle/governors/menu.c
+++ b/drivers/cpuidle/governors/menu.c
@@ -302,10 +302,6 @@ static int menu_select(struct cpuidle_driver *drv, struct cpuidle_device *dev)

data->last_state_idx = CPUIDLE_DRIVER_STATE_START - 1;

- /* Special case when user has set very strict latency requirement */
- if (unlikely(latency_req == 0))
- return 0;
-
/* determine the expected residency time, round up */
data->next_timer_us = ktime_to_us(tick_nohz_get_sleep_length());

diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index 11e7bc4..3042b924 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -78,6 +78,7 @@ static void cpuidle_idle_call(void)
{
struct cpuidle_device *dev = __this_cpu_read(cpuidle_devices);
struct cpuidle_driver *drv = cpuidle_get_cpu_driver(dev);
+ int latency_req = pm_qos_request(PM_QOS_CPU_DMA_LATENCY);
int next_state, entered_state;
unsigned int broadcast;

@@ -104,6 +105,13 @@ static void cpuidle_idle_call(void)
rcu_idle_enter();

/*
+ * The latency requirement does not allow any latency, jump to
+ * the default idle function
+ */
+ if (latency_req == 0)
+ goto use_default;
+
+ /*
* Ask the cpuidle framework to choose a convenient idle state.
* Fall back to the default arch idle method on errors.
*/

--
1.9.1