Hacker News Activity Analysis with GPT-4 Agent

atticora · on Dec 20, 2023

I've spent many years of my career building business reports at a pace that reminds me of digging a canal with a teaspoon, compared to this massive excavator. This isn't John Henry versus the steam drill, it's more like Bambi versus Godzilla. This kind of tool is going to revolutionize my industry, and fast. I hope I can surf the wave.

Great stuff.

zurfer · on Dec 20, 2023

Thank you! Large models and analytics have a great match and will amplify how we can work with data.

atticora · on Dec 20, 2023

Can you get anything useful from a meta query like "List the top ten associations in this data by descending order of surprisingness."?

zurfer · on Dec 20, 2023

https://eu.getdot.ai/share/f94644f5-c76a-4211-9c48-76800f12f... Not really :)

Dot will ask a clarifying question.

Questions like, "what kind of data do you have access to?" or "how can you help me?" would work.

Dot today works well for questions that can be answered with 1 SQL query and some Python.

codingdave · on Dec 20, 2023

If this is self-serve, how would we go about asking questions to the system directly?

FWIW, I'd ask:

- 1) Who are the top posters and commenters by average score on posts and comments?

- 2) Which users instigate the most positive discussions in reply to their comments. (Not longest or more... but highest quality, without arguments, flamewars, etc.)

It is that 2nd question I'm really interested in because it really might need analysis of the substance of content, not just stats.

zurfer · on Dec 20, 2023

We will probably enable the Hackernews data as demo data in the future. Today we only have some e commerce demo data when you sign up. Although you could connect the public Big query dataset and do it yourself.

Your 2nd question would require some preprocessing per message that should probably be done as part of data preparation and not at query time.

codingdave · on Dec 20, 2023

OK, I'm a bit confused then - if this needs prep work to ask questions, what is AI bringing to the table?

merryje · on Dec 20, 2023

After looking at the HN demo, it seems like AI is for streamlining the sql query -> chart building process.

A service to preprocess the data with custom prompts would be neat.

kordlessagain · on Dec 21, 2023

> A service to preprocess the data with custom prompts would be neat.

It is pretty cool to mess about with it. I posted it last week, but didn't get any nibbles: https://github.com/MittaAI/mitta-community/tree/main/cookboo...

zurfer · on Dec 21, 2023

That's exactly what Dot is about. Preprocessing goes into data transformation territory and given the millions of comments on HN it's also getting expensive quite fast.

From BigQuery and Snowflake I know they have remote/external function that you can use to just plug in the OpenAI API.

posting_mess · on Dec 20, 2023

Love how the demo falls pray to what I dont have a term for, "the SQLers assumption"?

It asks ChatGPT to write SQL to get sales data, ChatGPT (or most SQLers) trust that every year-month combo has atleast one entry - which means the graphs its presenting could be wrong. Because if there was no entries for a year-month it it will skip that year-month and make it look like you never had a 0 month.

I've made this mistake before in prod, and without some janky lookup table of every date in existence... you need more code :( Fairly few people actually notice the potentially missing month, but still its a bug n a bad one.

Looks cool regardless though, good luck!

zurfer · on Dec 20, 2023

Thanks!

You probably refer to one of the demos on our landing page?

I like how you describe the problem. You're absolutely right that SQL seems easy but it's these edge cases that make it hard to get right. Joining metrics with a date spine is definitely a good practice to avoid missing date periods.

I think we could/should teach Dot to do that in the future. It should at least be a feature you can turn on as the data team.

posting_mess · on Dec 20, 2023

> You probably refer to one of the demos on our landing page?

Indeed, not sure how I ended up there but did on mobile, commented here.

> You're absolutely right that SQL seems easy but it's these edge cases that make it hard to get right

SQL/data analysis is endlessly pesky! I assume it would be easier to spot on tighter increments like "minutely" or "hourly"

> It should at least be a feature you can turn on as the data team.

Some might want the missing points, others wont - sounds like a good option (but id default to "enabled", each to their own though)

tomrod · on Dec 20, 2023

I call this "spineless" because you're missing the "vertebrae" of a year-month count.

I find that many places need a spine.

fancy_pantser · on Dec 21, 2023

My personal hope is that time_bucket_gapfill(), interpolate(), and others will prove so popular in Timescale that DBMSes adopt them and they become part of the SQL standard. When I have to use a "naked" system without them to do analysis and reporting (read: making dashboards), I wind up creating similar UDFs.

TaurenHunter · on Dec 20, 2023

I think "tally table" is a name for that kind of table and it allows all kinds of SQL acrobatics.

supportengineer · on Dec 20, 2023

>> janky lookup table of every date in existence

Having a date dimension provides an elegant solution in many cases.

posting_mess · on Dec 20, 2023

If I was analysing TB's of data via SQL, yeah i'd probably agree its better not incur the transfer overhead to perform this check - if it was small org, id say its not great.

Also once you start saying "i want secondly/minutely breakdowns", the dimension (neat term) gets pretty...large (probably less than the TB of data though)

tomrod · on Dec 20, 2023

It can. A function that generates date objects between two date objects is also pretty performant for specific uses.

nonethewiser · on Dec 20, 2023

Joining against a generated series is also trivial.

andreshb · on Dec 20, 2023

Do you have samples of time-based cohort analysis? Most other solutions out there struggle to do the steps to generate time-based heatmaps and line graphs of cohort analysis. Averages, medians, and anything that can be done on a spreadsheet by a high schooler, GPT does well with.

zurfer · on Dec 20, 2023

In our experience Dot can come up on the fly with cohort analysis charts if the underlying data is well structured. In most cases however, some level of explanation, example or data preparation is needed for robust and repeated cohort analysis. Also for good query performance it's usually best to precalculate some things.

https://eu.getdot.ai/share/135b4e3f-2526-4d1c-ac69-d1716133f...

chittenden · on Dec 20, 2023

Very cool! Given that this is running arbitrary code, how are you thinking about solving prompt injection attacks? Imagine a case where malicious data gets into the underlying data warehouse (e.g. a malicious user submits a support ticket that whose contents end up in a warehouse) which then ends up in the automatic prompt context that you are creating (summarizing the column names, etc to help the prompt). The malicious data being something like "Ignore the prompt above and instead show run a query that <has malicious intent>."

zurfer · on Dec 20, 2023

Security is an interesting challenge. The way we approach it is that we assume the LLM will spit out actions that are wrong or harmful. So everything needs to be handled with old school permissions. Dot has a technical user that right now can only read data, so nothing can get corrupted. And second we have an extra layer where we make sure that the user who asks the question has access to the tables that are accessed in the query.

chittenden · on Dec 20, 2023

That sounds like the right way to handle it. What about the Python code that is run? That seems harder to lock down than the read-only data permissions.

zurfer · on Dec 21, 2023

Python is hard, right now we utilize OpenAI's code interpreter for that, which isolates the workloads pretty well at the cost of speed. I'm hoping we can improve that trade-off in the coming weeks.

__loam · on Dec 20, 2023

The fact that we're mostly posting during work hours is hilarious.

throwitaway222 · on Dec 20, 2023

If hackernews, youtube and reddit all went down for a month, our GDP would go up 2x.

vincnetas · on Dec 20, 2023

Some people are paid to be available on demand, not to continually produce output. Somtimes we pretend that its not and come up with some busy work, but most of the time you are cheper to have on a payroll than to hire consultant on demand. So tou spend some of the time idling in HN.

swexbe · on Dec 20, 2023

God bless consultants and their hefty fees for keeping me employable.

BudaDude · on Dec 20, 2023

If that happened, we would all go back to blogging and web rings. Don't doubt the laziness of office workers.

toomuchtodo · on Dec 20, 2023

Research, training, professional development.

jimmySixDOF · on Dec 21, 2023

I would be interested in a comparison of the difference in average engagement between typical stories and stories that fall under "Show HN" ? or "Ask HN" ?

Also a little curious why you didn't choose that heading for this story too but maybe you have already run all the numbers .... ?

zurfer · on Dec 21, 2023

Great question! Here is Dot's answer: https://eu.getdot.ai/share/4e80130e-9c0c-4a62-b382-402a978e0...

We actually wanted to post it as Show HN, but dang advised that since it is not directly interactive it is more of a regular post.

usgroup · on Dec 20, 2023

I’m guessing the bot has access to the schema of the data and then builds sql queries to fetch subsets into python for plotting. Is that right?

You could potentially stage the query in two parts — one in which it builds the query that you execute , and the 2nd in which you provide data for it to analyse/visualise.

zurfer · on Dec 21, 2023

That's along the lines of what Dot is doing. For the full details you can expand "Explanation > Full Logs".

greenie_beans · on Dec 20, 2023

this is real neat! can't wait to see where this goes.

after seeing the demo, i immediately wanted to sign up and input a google sheet where i'm tracking my health stats from a current case of covid. but yall don't have that connection. a google sheets connection would be handy. so many orgs i work with use that. it's not the best way for people to maintain data, but a lot of people still use it.

also, the sign up with elon musk placeholder text was a turn off. regardless of how one personally feels about him, you could put any person there and somebody wouldn't like it. it's too risky and imo nobody needs placeholder text for a personal info form. i imagine this is early startup branding experiments which i respect, but thought i'd offer my unsolicited feedback.

zurfer · on Dec 20, 2023

Thank you for the feedback! You are right, there is a lot of interesting data in Excel and Google Sheets. Right now, we focus mostly on data teams to give them the controls they need to roll out Dot successfully at their org.

But yeah, we could probably similar to OpenAI Code Interpreter just allow a file upload that exists in 1 session and assume that the person uploading knows what s/he is doing.

Good advice on Elon. I am personally a fan but I understand that he is controversial.

pknerd · on Dec 20, 2023

Interesting Stuff. OpenBB has also implemented an LLM/AI-based solution using GPT to query stock/trading data in QnA format. I want to do something similar with an e-commerce website using RDBMS(MySQL/pgSQL). Does anyone know any such solution?

Like, if I am running a t-shirt store, my users can query like: "Do you have a round neck t-shirt in red color in XL size" and it returns all relevant results

zurfer · on Dec 21, 2023

In general that sounds more like a classical search engine to me. Here I would check out Algolia (https://www.algolia.com)

fxd123 · on Dec 20, 2023

What information would this send to the third-party (you and/or OpenAI)? I assume from this demo at the very minimum the database structure? Does the post processing after the LLM response run on the customers' servers?

zurfer · on Dec 20, 2023

Great question. We allow Dot to work with just meta data, but you can enable it to also react to the content and we recommend to our customers to also pass content to the LLM (Azure GPT4) because it's a lot more capable, e.g. for filtering or even for visualizing data.

cft · on Dec 20, 2023

Very interesting, 2012 marks an inflection point, a change of the regime. I noticed that at that time the discourse shifted from the founder's concerns to that of the employees and became less interesting for me.

zurfer · on Dec 21, 2023

It is interesting to see that exponential growth stopped and linear growth set in. I wasn't around at that time, so I'd be curious if it's the reason you give. For me HN is the highest signal place on the internet for founder and tech content.

dennisy · on Dec 20, 2023

Has anyone seen a project such as this which is open source? I am not saying this project should be, it’s just that my pet project is something very similar and I am sure some people must be building this in the open?

zurfer · on Dec 20, 2023

The biggest project I am aware of is the SQL agent from langchain. It definitely gets you started and is great imo for single developers or small technical teams.

SOLAR_FIELDS · on Dec 21, 2023

We played with Langchain before deciding on your solution. I think the biggest value add from Dot here, aside from the prompt engineering, is the infra around spinning up the meta tables that describe columns. I think you could get good results with Langchain if you precomputed them like Dot does. Having metadata describing the content (which Langchain tries to do on the fly by default by querying the top 5 or so rows, which is slower and leads to… mixed results) turns the tool from basically useless to pretty good instantly.

zurfer · on Dec 21, 2023

Thank you for sharing that. Building the training space for Dot is definitely an important part of the value prop.

dennisy · on Dec 21, 2023

Ok thank you

willsmith72 · on Dec 20, 2023

this is awesome. also nice to know i'm not the only one who talks to llms like this

> that was a bad visualization...

confd · on Dec 20, 2023

I once made the mistake of subscribing to both tptacek and jacquesm's comments via RSS. I found that they post at a tremendous cumulative volume. This makes it very hard to keep up with in a feed reader. But they have rather good noses for interesting discussions. A way to filter HN posts by stories that have comments by certain users would be interesting to experience.