At the end of the 1980s, when Tim Berners Lee was working at CERN on the development of HyperText Markup Language — HTML (more on that another time), no one could have anticipated what a huge impact it would have. The first web page, which appeared at the beginning of the 1990s, started off a massive ‘informatisation’ of the world. So, what exactly is a web page? And why are some pages easy to find, while others definitely are not?
Essentially, the World Wide Web can be defined as the main service provided by the global internet network. It is based on the principle of linking web pages (with text, photographs, videos, etc.) using so-called hyperlinks; these pages are represented by a web address, or URL (uniform resource locator).
We can say that if you click on this URL it will take you to a profile on the inventor of the web, Tim Berners-Lee, in his W3 Consortium role. It is precisely this principle that Berners-Lee implemented when creating the web in its current form. Put more simply, we are talking about the linking of content using hyperlinks.
Try and find what the very first web page ever created looked like. It is still available online.
The web pages that you look at through your browser every day (news, eShops, company profiles, Wikipedia,…) make up the surface web. Why is it called this? You probably think that it’s something to do with the information on these pages being superficial, and this is partially true. The surface web is formed by special web crawlers (or spiders), that compile and index information. Google has its crawlers, as does Bing and other search engines as well. Their purpose is simply to search through newly created web pages (or through updated information on existing pages) and make it available to users to search. The quality of the information is, naturally enough, not all that high, because the surface web contains a large number of pages with spam and advertising, and it is up to the user to judge the relevance of the information, i.e. the degree to which your information requirements are met, and also the accuracy of the information presented. In this sense, the surface web is a very complex environment, because verification of information is almost entirely the responsibility of the user. At present, this is particularly the case with social media, which generates both an enormous amount of information, and disinformation, which then spreads through societies like an avalanche. Incidentally, some excellent work is being done in this field by the hoax.cz project, which is the largest database of nonsense, hoaxes, alarmist and fake news. (Džubák, 2000)
The problem is that web crawlers only crawl the surface of the web, but they don’t get down into the deep web. This, in contrast to the surface web, is made up of content that is difficult or impossible for crawlers to access (e.g. scientific articles, studies, patent documents, norms, business information,…). You might object that, for example, Google Scholar does make scientific articles or patent information available. Yes, that’s true, but firstly, libraries and universities actually make this material available to crawlers, and secondly, we’re only talking about a fraction of the actual number of these types of documents. And this brings us to our main point: in 2014, the surface web exceeded one billion web pages. In comparison to the deep web, however, this is just the tip of the iceberg.
This is a good place to illustrate these two very different web environments with a picture from the end of the 1990s, when the Bright Planet company attempted to map all the information sources of the deep web:
Fishing boats have to trawl deep below the surface of the ocean to catch the best fish. The same is true for the internet. We can search for high quality, valuable information in the deep web, and we can get there by using our knowledge of the relevant information sources and specific search techniques. NASA illustrated the boundary between surface and deep web as follows:
Most of the Internet is hidden in the #DeepWeb. We’re making tools to search it http://t.co/dCPmMr9Wri @DARPA #MEMEX pic.twitter.com/CU5TSrwE2s
— NASA JPL (@NASAJPL) 22. května 2015
However, this is still not everything. Beyond the borders of the deep web we find the dark web, also known as the dark internet. This is the part of the web that is accessible to the public, but only with the help of sophisticated tools (browsers) whose primary function is anonymity and anonymous access. This makes sense, as the dark web is primarily a place for illegal activity and business.
That’s enough for our brief introduction to the web, so let’s summarise what we’ve learnt so far:
- The web is a service provided by the global internet network.
- The primary means of navigating content is through the use of hyperlinks.
- We know three levels of the web.
- The surface web is created by web crawlers indexing information.
- The deep web is made up of unindexed and hard to access information.
- The dark web got its name from the illegal activity that typically takes place within it.
DŽUBÁK, Josef, 2000. Hoax : podvodné a řetězové e-maily, poplašné zprávy, phishing, scam [online] [Accessed: 2017-02-11]. Available from: http://www.hoax.cz [in Czech]