Going from an IPv4 address to an ASN in Python 2 with Unix brute force
For reasons, I've reached the point where I would like to be able to map IPv4 addresses into the organizations responsible for them, which is to say their Autonomous System Number (ASN), for use in DWiki, the blog engine of Wandering Thoughts. So today on the Fediverse I mused:
Current status: wondering if I can design an on-disk (read only) data structure of some sort that would allow a Python 2 program to efficiently map an IP address to an ASN. There are good in-memory data structures for this but you have to load the whole thing into memory and my Python 2 program runs as a CGI so no, not even with pickle.
(Since this is Python 2, about all I have access to is gdbm or rolling my own direct structure.)
Mapping IP addresses to ASNs comes up a lot in routing Internet traffic, so there are good in-memory data structures that are designed to let you efficiently answer these questions once you have everything loaded. But I don't think anyone really worries about on-disk versions of this information, while it's the case that I care about, although I only care about some ASNs (a detail I forgot to put in the Fediverse post).
Then I had a realization:
If I'm willing to do this by /24 (and I am) and represent the ASNs by 16-bit ints, I guess you can do this with a 32 Mbyte sparse file of two-byte blocks. Seek to a 16-byte address determined by the first three octets of the IP, read two bytes, if they're zero there's no ASN mapping we care about, otherwise they're the ASN in some byte order I'd determine.
If I don't care about the specific ASN, just a class of ASNs of interest of which there are at most 255, it's only 16 Mbytes.
(And if all I care about is a yes or know answer, I can represent each /24 by a bit, so the storage required drops even more, to only 2 Mbytes.)
This Fediverse post has a mistake. I thought ASNs were 16-bit numbers, but we've gone well beyond that by now. So I would want to use the one-byte 'class of ASN' approach, with ASNs I don't care about mapping to a class of zero. Alternately I could expand to storing three bytes for every /24, or four bytes to stay aligned with filesystem blocks.
That storage requirement is 'at most' because this will be a Unix
sparse file, where filesystem blocks that aren't written to aren't
stored on disk; when read, the data in them is all zero. The lookup
is efficient, at least in terms of system calls; I'd open the file,
lseek() to the position, and read two bytes (causing the system to
read a filesystem block, however big that is). Python 2 doesn't
have access to pread() or we could do it in one system call.
Within the OS this should be reasonably efficient, because if things are active much of the important bits of the mapping file will be cached into memory and won't have to be read from disk. 32 Mbytes is nothing these days, at least in terms of active file cache, and much of the file will be sparse anyway. The OS obviously has reasonably efficient random access to the filesystem blocks of the file, whether in memory or on disk.
This is a fairly brute force approach that's only viable if you're typically making a single query in your process before you finish. It also feels like something that is a good fit for Unix because of sparse files, although 16 Mbytes isn't that big these days even for a non-sparse file.
Realizing the brute force approach feels quite liberating. I've been turning this problem over in my mind for a while but each time I thought of complicated data structures and complicated approaches and it was clear to me that I'd never implement them. This way is simple enough that I could actually do it and it's not too impractical.
PS: I don't know if I'll actually build this, but every time a horde of crawlers descends on Wandering Thoughts from a cloud provider that has a cloud of separate /24s and /23s all over the place, my motivation is going to increase. If I could easily block all netblocks of certain hosting providers all at once, I definitely would.
(To get the ASN data there's pyasn (also). Conveniently it has a simple on-disk format that can be post-processed to go from a set of CIDRs that map to ASNs to a data file that maps from /24s to ASN classes for ASNs (and classes) that I care about.)
Update: After writing most of this entry I got enthused and wrote a stand-alone preliminary implementation (initially storing full ASNs in four-byte records), which can both create the data file and query it. It was surprisingly straightforward and not very much code, which is probably what I should have expected since the core approach is so simple. With four-byte records, a full data file of all recent routes from pyasn is about 53 Mbytes and the data file can be created in less than two minutes, which is pretty good given that the code writes records for about 16.5 million /24s.
(The whole thing even appears to work, although I haven't strongly tested it.)
