Reply to topic  [ 1 post ] 
 How to make an Internet Search Engine 
Author Message
Level 22
Level 22
User avatar

Cash on hand:
174,929.20
Posts: 2255
Joined: Sat Nov 17, 2012 11:10 am
Location: SR388
Group: Special Access
Post How to make an Internet Search Engine
Making a search engine seems like a daunting task. but actually the search part of it is easy. the hard part is that you basically need to index the entire internet word for word, and if you want to search images more than just in metadata or file name, you will need to have scripts to Identify objects, characters and colors in the images and thus delve into binary and searchign for binary color values in an image.


The only real problem stopping an individual from doing this is bandwidth and storage space for the data.

How does it work?

First you make a script to Ping all IP addresses looking for a response with a website address. then you save these as a list similar to the hosts file. you will need to do this periodically, and this file alone could be a few GB. for the average computer this presents a recursion limit problem as there are more IP addresses than the average computer can remember in 1 go. so you'll need to find your recursion limit and program it to handle them in batches.

The basic syntax is a for loop

Again you will have to set limits, because it's 255^4 or 4 billion 228 million 250 thousand 625 different addresses, where the average recursion limit on your average computer is between a few thousand and a few hundred thousand. Keep in mind you wont have to log all these IP addresses, many will be private IP addresses with no site name, and many will identify as google (no use re-searching those), and some will actually be unused.

to set limits you will have to code in a range as your starting argument like;

for 1024to10240 in range(1024, 10241): (not actually including the last number in the range)
firstpart = your first 1 to 3 numbers of the IP address as a string "xxx", this is the region code i.e. xxx.000.000.000 where xxx is the region code from 0 to 255
secondpart = 000.xxx.000.000
thirdpart = 000.000.xxx.000
fourthpart = 000.000.000.xxx

basically you set each one to the number you want to start at.

you will need int() to turn them into an integer. and str() to turn them into a string. int() will be used on them individually before math is applied i.e. + or -, str() will be used to convert them back to a string to attach to eachother with a +"."+ in between each to make the final IP address.

your next code bend will be;


for number in 1024to10240:
if int(fourthpart) == 255:
fourthpart = "0"
if int(thirdpart) < 255:
thirdpart = str(int(thirdpart) + 1)
if int(thirdpart) == 255:
thirdpart = "0"
if int(secondpart) < 255:
secondpart = str(int(secondpart) + 1)
if int(secondpart) == 255:
secondpart = "0"
if int(firstpart) < 255:
firstpart = str(int(firstpart) + 1)
if int(fourtpart) < 255:
fourthpart = str(int(fourthpart) + 1)


next you will need to use a websocket to"try:" send some information to that address usually "'null", to get a response containing the site name(URL) if any then logging that to a file with open(file, "a").append("your formatted informaiton here") if the response is not "" or "null".
you could even maintain an archive by openning the file with open(file, "r") to read the file and the search it and replace things in it as a string, or eval(open().read) to use list[array], or dictionary{key: value}. then at the end open(file).close()

This is the beginning of what we call a webspider or webcrawler; on which technology search engines are designed.

once the scan completes (you may need a central script that runs this script every time the search script ends (at the end of each recursion point) so that you don't have to start it manually for each of the many hundred thousands or millions or hundred millions of runs) you now have a full list of the available Core internet.

What is the Core internet? well google might host a bunch of websites underneath it's IP addresses, and ISPs might do the same for users. this means google or an ISP doing this has it's own local indentification addresses. in example something.neocities.com, the viewer only gets an IP address for Neocities, and tells it it wants the "something" page. neocities redirects the user to the specified page.

so this means your initial scan of the live internet may return a short list compared to whats really out there, and there are more sites sitting below those addresses that your crawler wont pick up on initially.

this is what the second scan is for. aka *.url.whatever/, but theres an easier way, these Core websites usually keep a master index of all their pages, somewhere your goal is to program a script to search their main page and it's subsequent pages til you can find an index or directory of subpages.

these pages need to be added to a master internet directory document you make. again it could be very large.

after that you will need to have keyword indexes where you search the content of page's HTML for varied search terms and return a page with given text, keeping lists [array] of these page URLs in a dictionary under the search term.

this is a basic internet search. and each of these documents will need to be updated periodically to insure information on them is up to date.

Next is search relvancy; if you were just to display results in order you'd get a numeric-alphabetic order list. but thats hardly telling you which is the most relavant result

and you can't have your server bogged down searchign the entire internet every time someone searches a new search sentence.

so you need to omit things like double spaces, and you need to search each List in your master internet search term dictionary. and compare results that contain the given words or phrases in the search term exactly, and organize them by which page says it all in the eact order and together as one phrase, and says it the most. then run that list down in numerica-alphabetic order, follow it with results where the phrase is there intact but less frequent (less occurrences), follow that with results where the given search term is broken up amongst other text but all the words are there and in the same sentences then finally results where the phrase is incomplete and some of the search term words are not there at all.

That is a bunch of searching. so thats done right?

nope theres one more step Search Term Decomposition, where you break down each word into individual ASCII characters and keep a log of the character order, and attempt to find Similar words these are "distant results"

then theres contextual results, where you might make an archive for a specific brand or OS version. so that people who search windows 7 don't get bombarded with windows 10 results. this list can be organized by the programmer using a script to search through his internet search dictionary and finding specifications to include or omit when a user searches for a specific phrase, this creates a special prioritized search dictionary key with a special prioritized list[array] of pages.

places like google abuse this last feature by using this system to FORCE unwanted results from marketization scripts.:
you search windows 7 == Display windows 10 results list because we get paid to advertise it.

They are Assholes.

but your own search can use this to give legitimate, spot-on results.

when using these search logic systems you can develop an internet search that could actually break the internet by making microcuck/microsoftie goorble and the rest go bankrupt, just purely because your search is so accurate and theirs sux ass and omits results on purpose.

:errg :errg :errg

_________________
mepsipax

Image

got any?

His name is not Robert Paulsen, His name is Gregory Matthew Bruni, he won so hard.

_________________
Click the icon to see the image in fullscreen mode  
1 pcs.
Click the icon to see the image in fullscreen mode  
1 pcs.
Click the icon to see the image in fullscreen mode  
1 pcs.
Click the icon to see the image in fullscreen mode  
1 pcs.


Tue Feb 21, 2023 12:12 pm
Profile E-mail
Display posts from previous:  Sort by  
Reply to topic   [ 1 post ] 
 

Similar topics

 
Google Image Search: Pudding Edition
Forum: ./General Spam
Author: n0th1n
Replies: 18
make your computer more secure without extra software
Forum: ./Tutorials
Author: cluelessfurball
Replies: 1
Request that will likely make my life 10 times better
Forum: ./General Spam
Author: Odin Anarki
Replies: 12
Top


Who is online

Users browsing this forum: No registered users and 65 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum

Search for:
Jump to:  
cron
Powered by phpBB © 2000, 2002, 2005, 2007 phpBB Group
Mods Database :: Imprint :: Crawler Feeds :: Reset blocks
Designed by STSoftware for PTF.

Portal XL 5.0 ~ Premod 0.3 phpBB SEO