A day in the life of BBCi Search

Ten years ago today I published an essay entitled “A day in the life of BBCi Search” on my fledgling blog. I don’t think it is an exaggeration to say that it changed my life.

Reading blogs by Matt Jones, Tom Coates, Lee Harker, Tom Dolan and Euan Semple had inspired me to start my own blog in December 2002. And tactic approval from within the BBC by people like Jem Stone and Tom Loosemore for using staff blogs as a form of transparency at the New Media department had encouraged me to write about my work.

The article was a lengthy write-up of a presentation I’d given to BBC staff at Bush House about my findings during a exercise in search log analysis on the BBC website. I hadn’t really appreciated just how much external interest there would be in that data. Within a few days of publishing it the article had been linked to by some of the leading lights of the information architecture world, and people were beginning to start inviting me to talk at conferences about it.

And a bit of context about the talk itself.

The BBC had introduced a box that defaulted to searching the web, rather than bbc.co.uk, on their homepage. The thinking was that if you added a family-friendly search engine with a UK focus to the wealth of content the BBC produced, you’d have a page that could compete with Yahoo!, MSN and Google as the “start page” for people’s experience of the internet. It wasn’t a popular move within the BBC — many content producers were outraged that people searching from the BBC homepage weren’t being directed to their stuff anymore. The search log analysis I carried out was partly to demonstrate to other areas of the BBC that the kind of things people were looking for were so diverse that the BBC couldn’t possibly ever hope to answer every query.

So set you time travel circuits for 2003, as we return to “A day in the life of BBCi Search”…

A day in the life of BBCi Search

Since BBCi launched in November 2001, the improved search offering has been collecting data on the way that BBC website users search both the BBC’s website, and through the homepage Websearch, the whole wide web.

Given such a mass of data, the easiest way to aggregate and make sense of it has been to measure the search terms that are most popular. Indeed, the BBCi homepage has a panel displaying the three most popular search terms of the moment, and an editorial and taxonomy team at the BBC constantly monitor the searches gaining high volume, in order to match the correct content to them.

BBCi homepage in 2002

The BBCi homepage in 2002

The team use reports that are generated hourly, daily and weekly to monitor the activity of the users. An hourly email alert identifies developing trends in the search terms, and specialist reports focus on trends within searches that have been generated specifically on the BBC News & BBC Sport sites. Daily lists of the most popular search terms from the site as a whole and the homepage websearch are generated, whilst weekly summaries focus on searches that originate in specific content areas of the site like Food or Cult TV.

BBCi Search alert email

BBC News search alert email

BBCi Search top 500 report

However, it became clear to me that the searches that make it into the top 500 searches of the day are not necessarily representative of search behaviour as a whole. The majority of users on BBCi put something unique into the search box, and 80% of the users of the service put in search terms that never appear on any of the statistical reports, because they only happen once or twice during the course of a day.

I therefore wanted to find out what it was that this vast majority of users were actually doing on the service, and had to find a way of analysing their behaviour without relying on our existing model of aggregating popular search terms.

Methodology

One way to go about this was to isolate one individual day, and to analyse in depth the searches that had been made. The log files collected by the search service contain information not only on the terms used, but on the time the search took place, and the area of the site that the search originated from.

Search log sample file

A sample from the BBC’s search log files

I chose Wednesday December 11th, as it was a weekday, during UK school terms, and there were no major breaking news stories or broadcast events to dominate results. A school term weekday is the most typical day of the year, and so the most typical use of the service — as the school calendar affects traffic to BBCi web services.

I also know from experience that search behaviour is affected by large breaking news stories, for example the loss of the space shuttle Columbia, or major UK broadcast events, like Test The Nation or the launch of BBC3.

BBC space shuttle Columbia coverage

BBC News coverage of the Columbia shuttle disaster

To analyse the search terms I took 10 separate 6 minute samples from the log files, at different times of day, from 1am to 10pm. This was still too much information to classify, so I reduced the information to searches that had been made from the BBCi homepage at www.bbc.co.uk, and the searches that were made from the 404 error page. These are the most context neutral pages on the site, and this reduced the amount of information I had to deal with down to a considerable, but manageable, 15,000 search terms.

I then took further 1 minute samples across the whole service to ensure that the data I was using was representative, and classified as a control sample an additional 3,000 search terms, to ensure that searches from the homepage and the 404 error page were representative of the usage of the service as a whole.

I measured the search activity on the day both in quantities using Perl scripts and spreadsheets, and by the hand-classification of individual search terms. All the examples given in this article are genuine search queries submitted to the BBC website on Wednesday December 11th 2002.

I’ll go on to look at the findings in turn, but my main conclusions were that:

  • There are 200,000 unique search every day
  • 4 out of 5 searches are not specifically about the BBC
  • 2 in 5 searches are specifically about the UK
  • 1 in 12 searches have incorrect spelling
  • 1 in 5 attempts to use advanced search fail

UK-ness

The first quality I looked for was the UK-ness of a given search term. The BBC is funded by everybody who has a television in the UK paying a Licence Fee, and so in an international medium like the web, it is interesting to see the extent to which usage is tailored to that market.

Some searches are clearly specifically about the UK, for example: “dvla”, “edinburgh fire” and “first aid courses london”. On the other hand, some searches can’t be about anything other than world wide events or interests, for example: “estate agents in south africa”, “hotels world wide” or “nyc transit strike”.

There are a third class of searches, which I have termed ‘fuzzy UK’ searches. These are searches where I suspect the user was UK based and looking for UK material, but they have not provided a specific enough search term to be sure — examples of these include “disability discrimination”, “newspapers” or “film news”.

For the purposes of this classification I only counted searches that were unequivocally UK focussed.

I found that 40% of the searches on the service were specifically and unequivocally looking for UK based information, either in the language they used, the topic they searched for, or the geographical qualifiers added to a search. Further examples of these included: “lanark medical centre”, “hattersley crime prevention centre”, or “david blunkett”.

In addition, many of the search terms fell into the ambiguous category, where I felt the intention was to find UK based information, but the user was not explicit in their search. Examples of these include “engineering recruitment”, “hysterectomy support”, and “stationery mail order”. In nearly all of these case I believe the intention of the user was to find UK specific information.

Regional searches

I also classified a selection of 2,000 search terms used on the regional sections of the BBCi site, from the national sites of Scotland, Wales and Northern Ireland, to the smaller ‘Where I Live’ sites covering areas of England.

I found that the searches were not any more explicitly regional than on the service as a whole, but because of the smaller data set I was also able to quantify the ‘fuzzy UK’ element.

I found that 39% of searches on these services were specifically UK focussed – and that an additional 11% of searches were of a ‘fuzzy UK’ nature.

50% of searches taking place on these national and regional areas of the BBC site were not geographically or culturally focussed on the UK at all.

Advanced search

‘Advanced Search’ is a general term describing different types of search syntax, including using quote-marks to force exact phrase matching, using ‘+’ or ‘-’ modifiers on search terms, or using Boolean constructions with AND, OR or NOT.

BBCi Search does not offer an explicit advanced search option, except on the BBC News area of the site. There are two reasons for this.

Firstly, the mix of technologies used on the site makes it difficult to provide a consistent advanced search option. When a search is made on the BBCi website, software internally known as “the wrapper” uses the referring URL to determine which of the search technologies employed by the BBC is required to answer the query. In some cases, the wrapper can make as many as four separate calls to different back-end technologies. All of these behave in different ways and understand different advanced search syntaxes, making it impossible to provide uniform functionality across the different indexes of the site. BBCi Search supports the advanced search syntax of each of its separate technologies, so the techniques can be successfully used on the site — but it is not made explicit to the users.

BBCi Search architecture diagram

Secondly, the target audience for the service have shown in user-testing that they are unlikely to use such an advanced search service, and even find the provision of it confusing.

BBCi Search is aimed at novice and inexperienced internet users, and its goal is to provide the best results with the greatest simplicity for the user. The label “Advanced Search” implies that you have to be an ‘expert’ to use it, which is off-putting, and it also implies that because you are using the ‘non-advanced’ interface, you are somehow missing out.

Nevertheless we find that through prior learned behaviour, a proportion of BBCi users enter search strings that contain advanced search syntax. Analysing this behaviour shows that users were most likely to attempt to use advanced search between 6am and 9am in the morning, and between 3pm and 10pm in the evening, when around 5% of searches showed some attempt to use advanced search. Notably at the peak time for site usage, over the lunch period, users were only half as likely to use advanced techniques – around 2.5% of searches.

Further analysis of these searches themselves revealed that 1 in 5 attempts to use advanced search fail. This can be because the users have misunderstood advanced search syntax, for example, enclosing a one word search within quotes. Or it is because users have failed to spell all the words of their search correctly.

Spelling

The spelling of search terms presents perhaps the biggest challenge to the BBCi Search team, and to the process of information retrieval on the web as a whole, in bringing back relevant and targeted results to the user.

When a search is made on the BBCi site, effectively just two pieces of information are passed to the search technology — the search query itself, and the referring page. With these two pieces of information search is able to provide results that are contextualised in places where this is appropriate — for example, different top results for the search term ‘china’ depending on whether you are on the BBC News site, or on the Antiques site.

However, an analysis of search terms shows that 1 in 12 feature incorrect spellings. On December 11th this added up to over 30,000 search queries with an incorrect spelling. The combination of having to spell words and having to type is clearly a barrier to information access on the web to a large section of the online community.

For example, in the dataset analysed, the following spelling variations on flagship soap opera “EastEnders” were observed: “eastendersd”, “easteder”, “eastend”, “eastendeurs”, “eatenders” and “eastnders”.

To provide a useful result set when one of the only two pieces of information you have about the search is wrong is a formidable task. BBCi Search employs two mechanisms to combat this.

Firstly, 66% of searches with misspellings on the site were offered a “Did you mean?” spelling correction. This uses a spelling dictionary to suggest to the user that they may have meant a different word. Selecting the spelling correction link re-runs the search query with the new spelling.

BBCi search result page with spelling prompt

Secondly, the editorial team at BBCi have also used misspellings as synonyms within their taxonomy — and because of this 96% of the searches for BBC channels, stations, programmes and brands with incorrect spellings on December 11th got the intended search as their top ‘BBCi Best Link’. The ‘BBCi Best Links’ and ‘BBCi Recommended Websites’ are hand-chosen, and the editorial team have the ability to add synonym’s to individual URLs. This means that the Cbeebies homepage can be returned as the top result even when the search query has been “cbbebies”, “cbbies” or “ceebeebies”.

CBeebies node in the BBCi Search taxonomy tool

CBeebies node in the BBC taxonomy tool

Questions & URLs

Two other types of search I examined were where users had entered either natural language questions or URLs into the search box on the BBCi site. I found that although it was a regular occurrence, it was not a significant proportion of searches – URLs made up around 3% of searches, and questions just over 0.5% of searches.

The questions tended to look like essay titles, or focus on questions of interest to children. There are some areas of the BBCi site (SOS Teacher & Ask Bruce) where the input of natural language queries is encouraged. Although not using a specific natural language parsing engine, the removal of stop words like ‘how’ and ‘why’ allows the websearch technology to return relevant results to some of these queries.

BBCi search report - questions

More importantly, study of the questions entered by users in these areas of the site also informs the content creation process. In this way the activity of the users contributes directly to the improvement of the service, as their requests for information shape the nature of the information subsequently provided.

BBCi Search reports - URLs

Searches for URLs tended to be variations on BBC web addresses, and the URLs of high profile websites outside of the BBC. In the case of the latter it seems that people are using the BBCi Web Search offering to navigate to other sites (e.g. Friends Reunited or Hotmail), which are constantly near the top of reports on the URLs that have been entered.

For variations on BBC addresses, again the BBCi Search team has used this feedback on search behaviour to set up synonyms, so that users typing in www.eastenders.co.uk will get to the BBCi EastEnders homepage, providing a better and more relevant result than a search technology could by itself if strictly looking for pages with the text ‘www.eastenders.co.uk’ on the page.

Word count

I also measured the different number of words users on the BBCi site employed when making searches.

I found that 36% of searches consisted of just one word and 35% of searches used just two words. This is a vital point. Given the opportunity of searching over the whole of the BBC site, or indeed the whole of the web, the user’s understanding or trust of search technology is such that they believe that a limited one or two word search term will achieve their goals. When we consider that the quantity of documents indexed for websearch is counted in the thousands of millions, this is a formidable task.

It is another reinforcement of the need for human intervention in search technology — to maximise the chances of these searches getting the right or relevant results. The BBCi Search editorial team are able to ensure that one word searches for “travel” or “sport” will get a top return of the best UK-based travel website, or the BBC Sport site, rather than a search result return based on the frequency of the appearance of those words within a document, or the number of links pointing at them.

Of the remaining searches, 16% contained 3 words, 7% contained 4 words, 3% contained 5 words, and the remaining 3% consisted of six or more words.

Search query word count break down

The value of a Taxonomy

As I mentioned, BBCi Search has a team constantly monitoring the search activity on the site, and attempting to match the searches being made with the best possible content available, both on the BBCi site and on the web as a whole.

This role is crucial for a site the size of bbc.co.uk with an index which consists of in excess of 2,500,000 documents, without including the BBC News and BBC Sport content. It is the only way that the language of the users can be mapped to the taxonomical conventions of the organisation.

One example of this is users searching on the BBCi Science site for information on planets. A examination of the search terms used on the site shows that “planets” is consistently one of the most used search terms. However the BBCi Science homepage does not feature the word planets at all. The site has plenty of content about the solar system, but it is described as ‘the solar system’, and branded “Space”, to tie in with a television programme broadcast some 18 months ago.

The consequence of this is that a search for “planets” that relies purely on a technological word matching solution returns as it top results information about The Blue Planet television programme — ironically probably the one planet in the solar system the user was least likely to be wanting information about.

In the absence of search technology with a better semantic understanding of the English language, the only way to align the vocabulary of the site with the vocabulary of users is to intervene, by providing ‘best bet’ results that originate form a taxonomical mapping of the content of the BBC site. It is only a human who can look at the that search, within that context, and decide that it equates to an individual piece of web content that the search technology would otherwise fail to return.

Another strong advantage of this system is the ability of the editorial team and taxonomists to assign new synonyms, best bet URLs, or change descriptions in real-time, in response to the actual recorded user behaviour.

A recent example of this was with the loss of the NASA Space Shuttle Columbia. The BBCi Search results pages include a news headline feed, if the query produces results from the BBC News or BBC Sport site that have been published within the last three days and cross a specific relevancy threshold. This worked fine if users were searching for “space shuttle” or “Columbia”.

However we also saw, within three hours of the accident, that there had been a considerable rise in searches for the country “Colombia”. Whilst it was conceivable that there was a simultaneous breaking news story about in Colombia, it was obvious that these were searches aimed at finding information about the space shuttle from users who were unaware of the correct name.

The result set they received was about the country, and did not produce any headlines about the space shuttle. Through the use of synonyming we were able to provide a result set that contained links to the latest news stories about the shuttle, even when people were unintentionally searching for the country. Again this is something that would be impossible with a reliance on technology alone.

Change of use over time

Part of the information logged by the search system is the area of the website the search originated from. By examining the pattern of search usage over different periods of the day, we are able to build up a picture of how the focus of BBCi users changes during the day.

Not surprisingly, there is a marked increase in activity from 9am, when the working day is underway in the UK, and the peak of search activity overall is around lunchtime. However, there are notable differences in specific areas of the site.

In the educational areas at www.bbc.co.uk/schools and www.bbc.co.uk/learning — activity dips sharply at lunchtime, and then has its peak in the 2pm to 4pm slot of the afternoon. There is then a lull in activity for a couple of hours, and a secondary peak between 6pm and 9pm, when presumably the nation’s children settle down to researching their homework or doing revision.

In informational areas of the site, like www.bbc.co.uk/health or www.bbc.co.uk/gardening, the peak level of search activity is in the evening, between 7pm and 9pm — with a secondary peak around lunchtime.

For entertainment areas of the site, covering the radio stations and the programme genre areas, the peak use of search is between 4pm and 7pm.

Overall, this gives us a picture of where the focus of our users is at different times of the day. This is for ‘educational’ types of searches in the early afternoon. By 4pm the attention of BBCi users has switched to the areas of the site devoted to entertaining them. By the evening, the audience seems divided between two — between children looking for educational material, and adults looking for informative material.

This understanding of user behaviour can be used to inform the placement and timing of promotional slots on the website, and on the editorial content of the BBCi homepage, and the genre sub-pages.

The most visible representation of this on the site is the ‘popular links’ panel on the homepage, which responds to trends within the search terms. The homepage team noticed that “fantasy football” consistently appeared on the homepage around lunchtime on weekdays, and so have experimented with tailoring the homepage content to suit those users at that time of day.

Conclusions

The overall conclusions that I have drawn from this study of one days activity on the BBCi Search service are:

  • There are 200,000 unique search every day
  • 4 out of 5 searches are not specifically about the BBC
  • 2 in 5 searches are specifically about the UK
  • 1 in 12 searches have incorrect spelling
  • 1 in 5 attempts to use advanced search fail

I also believe that without the provision of taxonomical matching of content to keywords, by human intervention, the results served to the users of the site would be poorer.

Originally published on currybetdotnet, March 27, 2003

Back in 2003, I always used to finish talks with a slide that showed some natural language questions that had been asked of the BBC’s website recently. Here were the examples I included in the slides for this talk.

  • How do mobile phones work?
  • When is Wallace and Gromit on the television?
  • How do I wire a light switch?
  • What are thermometers and why are they used?
  • What does a swimming cucumber look like?
  • What is a zealot?
  • What does GCSE mean?
  • How do I roast a chicken?
  • What does empowerment mean?
  • What is WWW?
  • Who is sponsored in sport?
  • What was on?
  • What is the fire strike about?
  • What is robot as air stewards?
  • What do you know about Germany BBC?
  • What is osmosis in a boat?
  • How do I find out what happened in my house?
  • What do you know?
  • What is light?
  • How do I talk to the actors of EastEnders on the computer with out the internet?
  • Who is their God?
  • What is the weather?
  • When will it snow this year?
  • How does the drier work in paint?
  • What does the Queen do?
  • How do I felt a flat roof?
  • What is a computer?
  • Who are you to judge me?
  • How do you reconfigure certain settings on a computer?
  • What is the David Beckham rumour?
  • How does the BBC use computers?
  • What will happen with Iraq?
  • How does the British democratic process work?
  • What are the signs that a guy likes you?
  • How do internet search engines optimize results?

Get blog posts like this via my email. Sign up here.