ChatGPT for Data Mining
First things first, if you want to see the results, you can visit the website I made that shows events in Bremen, bremap.de (in German).
The idea#
I’ve had this idea for a long time, to find events like concerts, exhibitions, flea markets or sports events on websites of local event venues (theaters, museums, bars etc.) and display the results on a map.
I don’t know when exactly I had the idea, but every time I discovered an interesting event by chance, I thought to myself that there must be a better way for event discovery. There could be the most interesting events going on next door without you ever knowing. FOMO probably also plays a role here, but I’d like to think it’s more about inspiration and curiosity than the fear of missing out.
Getting the data#
Collecting a list of websites#
Getting the data is actually pretty simple thanks to OpenStreetMap, which has croud-sourced locations, all tagged with tags like “theater” or “museum” and often including a link to the website of the venue. Using that data, I was able to extract a list of websites that potentially have events.
Crawling event websites#
ScraPy was the way to go here, since it takes care of all the complicated stuff like concurrency, rate limiting and handling duplicate requests. Now I was able to save websites that potentially include event data. But how do we get the actual event data from here?
Extracting event data#
The first time I looked into it - before the days of ChatGPT - the options to implement this idea were actually pretty limited. The main difficulty is that venues display events in very different manners. You can have all kinds of lists, calendars, tables, just text or combinations of images and text, with very different styles of indicating the date and time as well.
Manually instructing a program on how to extract events for all the different websites was never an option, so I did some research into ML-powered approaches.
Research#
One attempt that stood out to me was the paper WebFormer: The Web-page Transformer for Structure Information Extraction (PDF link). The idea of using the HTML markup as a training goal seemed very promising and the examples that they showed in the paper seemed convincing:

The problem was that they never released a trained model, nor the dataset, so the whole approach unfortunately was a non-starter for me.
Another approach would have been to fine-tune an extractive model (like BERT) using supervised learning, which would mean having a large dataset with websites and the corresponding sequences with the relevant data. There are two problems with that approach: Firstly, I didn’t have that kind of dataset, and secondly, the target sequences like the date, time and name of the event would probably be dispersed throughout the page, meaning the model would have to return multiple sequences.
A third approach would be the classic NLP NER (named entity recognition) technique, using tools like scikit-learn. This could be used to tag the data in the document with suitable tags. First experiments weren’t promising though, and I quickly abandoned this approach.
