ChatGPT for Data Mining

First things first, if you want to see the results, you can visit the website I made that shows events in Bremen, bremap.de (in German).

events in bremen

I’ve had this idea for a long time, to find events like concerts, exhibitions, flea markets or sports events on websites of local event venues (theaters, museums, bars etc.) and display the results on a map.

I don’t know when exactly I had the idea, but every time I discovered an interesting event by chance, I thought to myself that there must be a better way for event discovery. There could be the most interesting events going on next door without you ever knowing. FOMO probably also plays a role here, but I’d like to think it’s more about inspiration and curiosity than the fear of missing out.

The first time I looked into it - before the days of ChatGPT - the options to implement this idea were actually pretty limited. The main difficulty is that venues display events in very different manners. You can have all kinds of lists, calendars, tables, just text or combinations of images and text, with very different styles of indicating the date and time as well.

Manually instructing a program on how to extract events for all the different websites was never an option, so I did some research into ML-powered approaches.

Research#

One attempt that stood out to me was the paper WebFormer: The Web-page Transformer for Structure Information Extraction (PDF link). The idea of using the HTML markup as a training goal seemed very promising and the examples that they showed in the paper seemed convincing:

img.png

The problem was that they never released a trained model, nor the dataset, so the whole approach unfortunately was a non-starter for me.