1
00:00:08,466 --> 00:00:13,946
Every day, all over the world, people use
Google as their first choice of search engine

2
00:00:18,736 --> 00:00:20,076
What is the deal with Google?

3
00:00:20,076 --> 00:00:26,286
It's well and truly a household name, it's even
a verb to Google something Why is it so popular?

4
00:00:26,586 --> 00:00:27,916
Let's go find out.

5
00:00:30,436 --> 00:00:33,516
One of main reasons is their
clever mathematical software.

6
00:00:35,016 --> 00:00:38,966
PageRank is a link analysis
algorithm, named after Larry Page.

7
00:00:39,756 --> 00:00:44,536
It's used by the Google web search engine that
assigns a numerical weighting to each element

8
00:00:44,536 --> 00:00:46,546
of a hyperlinked set of documents,

9
00:00:46,806 --> 00:00:49,956
with the purpose of measuring its
relative importance within the set.

10
00:00:53,446 --> 00:00:58,346
Maths is really the core of computer science,
so it's vital to what we do here at Google.

11
00:00:58,726 --> 00:01:03,606
So if you come up with the solution or an
algorithm to a problem, then maths is the tool

12
00:01:03,606 --> 00:01:08,436
that lets you understand how well that's gonna
run, how much time or space it's gonna take

13
00:01:08,546 --> 00:01:12,736
and it's gonna let you then prove
whether that's the optimal way of solving

14
00:01:12,736 --> 00:01:14,506
that solution, whether there are better ways.

15
00:01:15,626 --> 00:01:19,916
Or even sometimes if you have a problem, maths
can show you is it possible for a computer

16
00:01:19,916 --> 00:01:25,256
to solve that problem at all At the heart of
Google software is a system called PageRank.

17
00:01:27,086 --> 00:01:31,176
Which basically gives every site on
the Internet a rank between 0 and 1.

18
00:01:32,036 --> 00:01:33,316
So how is this calculated?

19
00:01:34,036 --> 00:01:39,316
Well, the PageRank of your site is
determined by the links to your web site.

20
00:01:39,796 --> 00:01:45,666
Each time somebody adds a link to your web site,
Google interprets this as a vote for your site.

21
00:01:46,156 --> 00:01:49,166
The more links you have to your
site, the more votes you get.

22
00:01:49,516 --> 00:01:53,726
With PageRank what we like to do
is think of all the different pages

23
00:01:53,916 --> 00:01:56,986
on the internet as different nodes in the graph.

24
00:01:57,246 --> 00:02:01,576
So if you have Page A and it has a link
pointing to Page B then we have those

25
00:02:01,576 --> 00:02:04,786
as two separate nodes with an
edge joining them together.

26
00:02:05,046 --> 00:02:07,006
And what we have is when
you have all those nodes

27
00:02:07,006 --> 00:02:10,546
and edges together that's what
we call in mathematics, a graph.

28
00:02:10,826 --> 00:02:14,616
And once we can understand the
internet's connectivity as a graph,

29
00:02:14,616 --> 00:02:19,196
then we can use a whole heap of
powerful mathematics to understand it.

30
00:02:19,196 --> 00:02:21,716
Web pages are all linked,
they're part of a network.

31
00:02:21,806 --> 00:02:26,596
So an inlink from my page to your
page is an endorsement of your page.

32
00:02:27,076 --> 00:02:30,486
The more inlinks your page
has, the more important it is.

33
00:02:31,336 --> 00:02:37,176
But if your page has an inlink from a page
with many outlinks, then that endorsement

34
00:02:37,336 --> 00:02:42,986
of your page is of lesser value as it is
coming from a page that is less discerning.

35
00:02:43,526 --> 00:02:45,956
Ranking anything, really where do you start?

36
00:02:46,736 --> 00:02:51,456
Well one of my jobs is to go through lists
of thousands of products and try to identify

37
00:02:51,526 --> 00:02:52,946
which ones are more important than others.

38
00:02:52,946 --> 00:02:58,266
A good place to start is to use the variables
as signals of importance, like product price

39
00:02:58,266 --> 00:03:00,026
or how many times a product is viewed.

40
00:03:00,576 --> 00:03:04,896
Then like Google use an algorithm to sort
the products from most to least important.

41
00:03:05,956 --> 00:03:11,006
Google PageRank is a probability
distribution used to represent the likelihood

42
00:03:11,126 --> 00:03:15,606
that a person randomly clicking on links,
will arrive at any particular page.

43
00:03:15,986 --> 00:03:21,326
A probability is expressed as a
numeric value between 0 and 1.

44
00:03:23,236 --> 00:03:27,326
Converting this idea into a formula
that can be calculated for each

45
00:03:27,376 --> 00:03:31,936
of the 14 billion web pages
is an amazing achievement.

46
00:03:33,506 --> 00:03:37,416
And Google guarantees an
accuracy between 3 and 7 digits.

47
00:03:38,056 --> 00:03:43,656
Awesome. But let's bring this down
to manageable numbers for a second.

48
00:03:43,656 --> 00:03:48,196
Using this small-scale set of 6
people, to represent just 6 web pages.

49
00:03:48,196 --> 00:03:52,796
As with any network, some might
know each other I know you.

50
00:03:52,796 --> 00:03:54,166
How do I know you?

51
00:03:54,306 --> 00:03:55,586
I've seen you around.

52
00:03:55,586 --> 00:03:56,426
Yeah you look familiar.

53
00:03:56,426 --> 00:03:58,006
And so have a link.

54
00:03:59,296 --> 00:04:01,636
Some, don't No I don't know you I'm sorry.

55
00:04:01,636 --> 00:04:02,666
Do you work online?

56
00:04:02,746 --> 00:04:03,596
No I don't know you.

57
00:04:03,786 --> 00:04:06,126
Maybe there's just one that they all know of.

58
00:04:06,126 --> 00:04:08,306
Oh yeah. I think I know that guy.

59
00:04:08,306 --> 00:04:09,406
Yes I know him.

60
00:04:09,856 --> 00:04:10,526
Everyone knows him.

61
00:04:10,526 --> 00:04:12,646
Doesn't everyone know Red?

62
00:04:12,646 --> 00:04:17,276
That one then is the most popular and
so will get the highest page rank.

63
00:04:19,316 --> 00:04:24,786
In the small example shown here, you
can see that P6 has the strongest links.

64
00:04:24,826 --> 00:04:27,286
P5 and P4 link directly to P6.

65
00:04:27,286 --> 00:04:30,736
And there are paths in the
graph from P1 and P3 to P6.

66
00:04:30,736 --> 00:04:36,156
Even though we would guess that
P6 has the highest Page rank,

67
00:04:36,156 --> 00:04:40,716
it is not at all clear what
ranks the other pages would have.

68
00:04:40,756 --> 00:04:44,936
This is where the Google
PageRank algorithm comes in.

69
00:04:44,936 --> 00:04:48,666
The first step in calculating
the Google PageRank of each page,

70
00:04:48,666 --> 00:04:51,256
is to represent the graph in a table.

71
00:04:51,576 --> 00:04:57,896
The table shows, for example, that
there are links from P1 to P2 and to P3.

72
00:04:57,896 --> 00:05:00,326
A neat way to record the results.

73
00:05:01,166 --> 00:05:09,436
The next step is to replace the table
by the corresponding 6 by 6 matrix A.

74
00:05:09,436 --> 00:05:15,096
After several manipulations we arrive at
the Google page ranking matrix G. Rather

75
00:05:15,096 --> 00:05:26,616
than a tiny 6 by 6 matrix as in our example,
the Google matrix G is 14 billion by 14 billion.

76
00:05:26,616 --> 00:05:31,116
The average number of outlinks
on a webpage is 10.

77
00:05:31,116 --> 00:05:36,096
This makes it computationally possible for
Google to process the matrix multiplication,

78
00:05:36,276 --> 00:05:40,006
to solve the equations and yield
a final list of page ranks.

79
00:05:40,166 --> 00:05:40,956
So cool!

80
00:05:43,046 --> 00:05:45,806
It's easy enough finding out who
knows who with just six people

81
00:05:45,806 --> 00:05:49,116
But imagine doing this 14 billion times?

82
00:05:49,406 --> 00:05:54,026
That's twice the population of the world.

83
00:05:54,026 --> 00:05:57,086
The thing about Google is that everything
we do, we do at a massive scale.

84
00:05:57,426 --> 00:06:00,456
So we are handling billions of
pages and billions of emails,

85
00:06:00,676 --> 00:06:05,686
billions of images you name it there's
a lot of it, there's a lot of data

86
00:06:05,856 --> 00:06:09,166
and so the maths become critical
at every single part

87
00:06:09,586 --> 00:06:12,676
of the product because without it we'd be lost.

88
00:06:23,416 --> 00:06:24,966
Oh great, thanks!

