Mobile devices technical review methodology

Last udate: 28. February 2015

Our mobile devices technical review methodology describes in some detail how we evaluate the accessibility and usability of smartphones or tablets for bind, low vison and long-sighted users.

Note: This methodology is currently subject to modifications.

The utility of a smartphone or a tablet for a user with some kind of vision impairment (from age-related long-sightedness to blindness) depends on a number of aspects, for example, the operation of the physical device via physical buttons and on-screen controls; the screen size and brightness; the default text size; the availability and performance of functions for scaling up the text and zooming in to content; or the availability and quality of a built-in screen reader for non-visual use. A complimentary approach is to conduct user tests with mobile devices. Both approaches are complementary.

In a technical review, we investigate smartphone features that are known to be important for long-sighted, low vision and blind users. The review of a particular feature can include both points that can be directly measured, such as font size, contrast, or the availability of specific settings, as well as points that involve qualitative judgements, such as the haptic quality of a button or the ease of activating a command using a particular gesture.

User groups

We tailor out test results according to different user groups with different requirements. A test may focus on a particular group (such as long-sighted users or blind users) or provide results for several groups.

  1. Long-sighted users (Hyperopia): Users with slight vision impairments (often elderly users) needing a larger text size but rarely using zoom magnification
  2. Zoom users not using the screen reader: Users with stronger vision impairments requiring enlarged text and/or zoom magnification or contrast modes, but usually without screen reader support
  3. Zoom users using the screen reader in addition: Users who require strong zoom magnification with additional / optional screen reader output
  4. Screen reader users: Non-visual users who fully depend on the screen reader

Obviously, many users will fall somewhere between these groups. Future tests may define subgroups or new groups that focus on patricular conditions such as colour blindness. We recommend that readers of test results check those sections / categories that are particularly important to them.

Accessibility requirements

Many aspects of our technical review draw on general accessibility requirements as defined in recommendations / standards and like the Web Content Accessibility Guidelines (WCAG 2.0 - ISO/IEC 40500). This means that many of our checks can be mapped to WCAG success criteria.

Depending on the respective user group adressed in our technical review, we may focus on subsets of WCAG requirements. For low vision users, for example, the critical section of the POUR Principles of WCAG 2.0 (Perceivable, Operable, Understandable, Robust) is the P (=Perceivable). Operation then works in the same way as for sighted users and, assuming that low vision users have no cognitive impairment, success criteria related to understanding are not relevant.

Often, we do not include all potentially relevant success criteria, but focus of a few key success criteria. For low vision users, we usually focus on checking content against the WCAG success criteria 1.4.3 Contrast and 1.4.4 Resize Text, as these are prerequisites for content to be perceived at all.

When we adress group 3 (low vision users who also use the screenreader) or group 4 (blind screen reader users), the review will also include checks mapping onto success criteria under the WCAG Principles O (=Operable) and R (=Robust), mainly because using the screenreader usually means a completely different touch interaction mode on mobile devices. Additional criteria now include 1.1.1 Non-text Content, 3.3.2 Labels or Instructions, 4.1.2 Name, Role, Value, and 3.1.2 Language of Parts.

Determining the criteria included in a technical review

A number of criteria cannot be checked in a simple measurement since they are only revealed in an interaction sequence. Examples are 2.2.1 Timing Adjustable, 2.4.3 Focus Order (applicable when using an external keyboard or turning on a screenreader), or 3.3.1 Error Identification. Our spot checks for text size, contrast or the accessible naming of controls can be extended to include more success criteria. Determining which criteria should be included is a question of the user group(s) addressed and ultimately a question of the effort expended in our testing.

We determine the amount of criteria included in a technical review by considering the aim of the test. In a comparative test that aims at establishing the utility of smartphones by including a wide range of different categories, detailed checks of all potentially relevant criteria across the categories included must be ruled out simply because we would never get done in time to present a current test result. In such a case, the focus on a few critical criteria is useful and usually sufficient. In contrast, when conducting reviews that have a much narrower scope, for example, one particular app or class of apps, including more criteria is sensible.

The difference between our technical review and a conformance test

WCAG has been developed with a focus on web content. It has been mapped to information and communication technology in general (see WCAG2ICT) and work is under way to define how it should be applied to mobile. Other aspects lke the physical characteristics of devices are not covered by WCAG at all.

In our review, there are instances where we go beyond WCAG in that we also compare default text sizes and look at whether magnification can be achieved via system text size settings, or whether activating the system zoom function would be necessary, which is arguably worse. WCAG just requires for text to be resizable to 200% of its original size, however tiny that size may be.

Another example is the assessment of graphical icons - do they clearly convey the function behind them? There is no WCAG success criteria for this, perhaps because a pass/fail rating is clearly inappropiate here. Nevertheless, this factor can be important in assessing the utility of some interface element on the app or system level.

To provide a third example where WCAG success criteria alone are insufficient for an adequate assessment, consider the contrast of graphics such as rulers or grids, e.g., in calendars. We know from user tests that weak contrast of a grid can make a calendar hard or impossible to use for low vision people. We therefore consider this aspect even though WCAG does not require good contrast of graphics beyond images of text.

The fact that we are focusing on subsets of WCAG criteria and include criteria not defined within WCAG means that our technical reviews are not WCAG conformance tests. In such a test, all WCAG success criteria would have to be included.

Another important difference to a WCAG-based approach is the fact that we do not rate the conformance of content to Success Criteria like 1.4.4 Resize text in WCAG's pass/fail fashion. Instead, we establish degrees of compliance to the  criteria defined by using a five point rating scale (see the section about the rating scheme below).

Categories defined for technical reviews

To reflect different aspects of the utility of a device, we have defined fourteen categories for technical reviews that reflect both content areas and different functional areas.

  1. Physical device characteristics
  2. Default system text size and text resizing
  3. Contrast modes
  4. Build-in zoom function
  5. Home screen
  6. Dial pad
  7. Virtual keyboard
  8. Default email client
  9. Default calendar
  10. Default browser
  11. Speech input
  12. Screen reader
  13. Speak screen function
  14. Support of peripherals (e.g., bluetooth keyboards)

Note that not all of the categories listed here may be used in a particular review.

Important for all users are categories like the physical device characteristics, the home screen, the dial pad, and the virtual keyboard. Three default apps - mail, calendar and browser - are also considered relevant for all users and constitute separate content categories.

Some functional categories are relevant only for particular users. Fully blind users, for example, do not care for the built-in zoom function, contrast modes, and the default system text size and text resizing options. On the other hand, long-sighted users and those low vision users who never use the built-in screen reader do not care whether the screen reader works well, whether it works in tandem with the zoom function, or whether controls in apps have accessible names.

A weighting of results on the level of features within categories and of the categories themselves is applied to reflect the different needs of different user groups.

Measuring and documenting criteria

Where possible we measure and document individual criteria according to points of comparison that can be objectively measured. For example, we measure the font size with a typometer, a transparent ruler that allows us to determine the actual font size on a particular device display. The size measured is often different from the font size according to settings. Our typometer measurements are exact with a margin of error of about 0.5pts.

When measuring contrast, we take screenshots, import them to a PC and then determine foreground to background contrast using the color contrast analyzer tool. When dimensions or distances are reported (e.g. the distance beween a control and its label), these are usually given in mm.

When determining the availability of accessible names we turn on the built-in screen reader and use touch-explore and/or swipe gestures to focus interface controls.

In comparative reviews, we also frequently take photos of the devices positioned next to each other to demonstrate differences in layout and to provide a way for users to compare criteria beyond our measurements.

Several individual points of comparison such as font size, contrast, layout, and the discernability of icons can be aggregated in an overall rating for a particular criterion within a category. For example, the category "default mail app" has the criterion "New mail view" which is rated according to several points of comparison: default font size, resizability via system font size settings, and contrast. Some points of comparison may be grouped in a separate criterion, especially when it is import. In the case of the new mail view, the availability of pinch zoom on the mail text and the question whether text reflow reflows when zooming in via pinching would grouped in a separate criterion.

Weighting of particular criteria within categories

A weighting of criteria within categories is applied to reflect the relative importance of a criterion within a given category, and to fine-tune this weight for the respective user group.

The weights given to individual criteria within a category always add up to 100%. Different weightings per user group are applied to reflect the particular needs and preferences of that group. To give an example, within the functional category Built-in zoom function, the two criteria "zoom function and screen reader can be used together" and "screen reader focus visible" are only relevant for low vision users who also use the screen reader. These two criteria have therefore no weight in group 1 and 2 (long-sighted users and zoom users who do not use the screen reader at all). By the same token, the importance of these two criteria for the group of low vision users who do use the screen reader means that for this group, other criteria in this section will be assigned a lower weight as all criteria together have to sum up to 100%.

Another example: The default text size is very important for long-sighted users who will usually not want to use the zoom function or significantly scale the text on the system level. Within category Default text size and text resizing the criterion "default size" has therefore been given a weight of 60%. For group 2 and 3 of low vision users, most texts are too small to be read without zoom function anyway, so the default text size is somewhat less important (reflected in a weight of 45%) and the criterion of resizability carries a higher relative weight.

We invite readers to give us feedback on our weighting choices, and to 'correct' for themselves the weighting applied in cases where their particular needs suggest different priorities.

The rating scheme

For our rating of individual features, we use a five point Likert scale which runs from ++ (very good) to -- (very bad / unusable, or feature does not exist). The actual ranking for a particular feature often combines several points of comparison. For example, when rating the reader mode of a browser, the degree to which the text size can be scaled up will be one point of comparison. Another point of comparison will be whether larger text reflows so that users do not need to pan horizontally in order to read enlarged text. The chosen ranking tends to be oriented towards a theoretically achievable optimum, not just on the merits of the implementations under test. For example, in all smartphones tested so far, the character size in virtual keyboards could be larger (often significantly so) within the given space of individual keys. Choosing larger keyboard characters could be a system level setting that currently does not exist. In a comparison of virtual keyboards, the maximum rating value of ++ has therefore not yet been assigned.

The aggregation and weighting of category ratings

Individual ratings are then translated into a aggregated percentage value that applies to the respective device and the respective category. If all aspects would be ranked ++ (= very good) , the percentage result for the category is 100%. If all aspects are ranked -- (= very bad / unusable, or feature does not exisit), the result is 0%

The same aggreagation approach also applies to categories to arrive at an overall result for the utility of a device for a particular user group. The weighting of categories reflects their relative weight for the given user group. Here, we use weight of 0% for categories that are irrelevant for the respective user group and should therefore not be included.

Reliability and limits of technical reviews

Even with results tailored to defined user groups, an aggregate value for a particular device is no more than a broad indication of its potential utility for any particular user. It is important to look closely at those categories that are particular relevant for the individual user.

A technical review, especially one focusing on a particular user group, will not always investigate all aspects that are potentially relevant for the overall accessibility of a device. It will also usually be restricted to particular representative features and often not cover the entire range of views, options or settings that are available on a device, in an operating system, or in individual apps. This focus on representative features is necessary to keep the overall review manageable. It also means, however, that there can be important deficiencies in devices, operating systems and apps that are not discovered in our technical review.

For example, the test of the selected features may establish that some controls in a particular app lack meaningful accessible names without determining for all app views whether all controls are accessibly labelled. The actual deficiencies discovered then serve as the basis for our rating. Deficiencies in features that are not included do not enter our rating. This means that while the deficiencies found are reliable comparative indicators for the accessibility, a positive result based on the selected features may warrant a revision as soon as features are discovered that were not included but are of practical relevance for users.

We invite all users to report on any deficiencies not captured in our technical reviews so that we can include the corresponding feature in future reviews.