10 September: How to Categorize 2 Million Tweets Without Code, Without Excel, and Get More Reliable Results Than Just Using Your Eyes or an AI

Why You Can’t Just Rely on AI

You can’t simply throw a bunch of posts into an AI and trust the output. If you just feed a sample of tweets into ChatGPT and ask for a summary, you have no way of knowing if the results are truly relevant or representative of the whole dataset.

Here’s an example: if I take a random sample of tweets about the 10 September and ask ChatGPT for a summary…

chatgpt-10septembre So, beyond the obvious “what’s trending” question, we often need answers like:

How many tweets express anger at the government?
How many talk about clashes with the police?
How many accuse the far-left of trying to take over?
…and so on.

Sure, I could try to write a better prompt so that ChatGPT gives me a more detailed answer, but it still wouldn’t work.

Why? Because there’s simply no way ChatGPT (or any AI model like it) will process 2 million tweets at once. And even is it does, I will never be able to know if its respond is relevant, a bias or just a big hallucination.

That means I’d have to work with a sample. But a sample of 1,000 tweets out of 2 million will never be representative, especially if I’m trying to measure proportions.

For example: You can’t ask ChatGPT to tell you the percentage of far-left vs. far-right tweets if you can’t even be sure both are present in your sample.

The Missing Step

Before using AI to summarize or analyze, you need one crucial step: You must use the entire dataset to cluster the tweets first.

This way, you group similar tweets together across all 2 million posts, not just a random slice. Once you have these clusters, you can summarize, label, or analyze them with AI much more reliably.

Here, I’m not going to dive into how to build a full social network map, that’s a topic for another post.

But there’s one step from that process that’s incredibly valuable, even if you never build the map itself: clustering.

When you make a network map, you end up with multiple clusters, groups of accounts that are tightly connected.
They talk to each other, retweet each other, and form their own little ecosystem.

Most of the time on Twitter/X, these clusters represent factions, sometimes even political factions when the topic is a political event.

On my maps, these clusters are usually represented by different colors, like this: map-10september
And here’s why this matters: these political factions are the perfect clusters to use before sampling.

You’ll know exactly how many accounts belong to each community.
You can then study each community separately, drawing a sample from each, to discover exactly which themes dominate each group.

This approach gives you a far more reliable and nuanced picture than throwing a random sample at an AI and hoping for the best.

How to Do It Without Code or Excel

The good news is that you don’t need to code or even use Excel to make this work. All you need is one free, open-source software: Gephi. You can download it here : https://gephi.org/ visibrain-10septembre-x

The CSV file with the raw data of the tweets you want to study

The GEXF file, representing the links between the accounts you want to study (their conversations, retweets, mentions, basically, the network of interactions around your topic).

Importing Your Data into Gephi

Once you have your CSV and GEXF files, it’s time to open them in Gephi:

Install Gephi : Download and install Gephi from gephi.org
Open Gephi : Launch the program.
Import the GEXF File

Go to File → Import Spreadsheet
Select the GEXF file you downloaded from Visibrain
Click OK to confirm the import

This will load the network of accounts and their interactions into Gephi, allowing you to visualize how the conversation is structured.

gephi1 gephi2 gephi3

Computing the Clusters

Once the GEXF file is imported into Gephi, you can compute the clusters right away. The GEXF contains all the interactions between users on your topic, retweets, replies, mentions, so Gephi can use this network to group accounts into communities.

Here’s how to do it:

In Gephi, open the Statistics panel.
Find the option for Modularity (this is Gephi’s algorithm for detecting communities).
Click Run, then press OK when prompted.

Gephi will analyze the network and automatically group accounts into clusters. Each cluster represents a community of accounts that interact frequently, often corresponding to factions or groups involved in the conversation.

gephi4

Once it’s done, you will see the results in the Appearance view. Click on the palette, then Partition, then modularity_class.
gephi5 In my results, we see that I have 3 main communities: blue, green, and orange, plus some smaller ones.
But now, the important part is to know what each community is saying. For that, I need to change one thing in the CSV containing my tweets: With Notepad (or your favorite text editor), I will change "screen name" to "Id".
export

By changing screen name to Id, I can now import my tweets into Gephi too! It’s not the usual way Gephi is meant to be used, it’s not really designed to import content, but with this little hack, Gephi will know which account posted which tweet and, more importantly, which tweets belong to which community.

To import the tweets, use the same method as for the GEXF: File → Import Spreadsheet → Select your CSV → OK/Finish

Be very careful to check the option "Append to existing workspace" at the end of the import. This ensures your tweets are added to your existing project, rather than replacing it.
gephi6

Now you have:

Your accounts and the links between them
Your clusters
A tweet associated with each account in each cluster

And… it’s almost over! Just two more things to do: To make good samples of tweets for each community, you’ll want to export the tweets community by community, in separate files. For that, you need to add a filter. By adding a partition filter (like below), you will be able to choose which community you export.
gephi7 When the filter is added, I click on the first community (the green one in my example) and then click "Play", so that only the green community is selected. You can see in my Context window, at the top, that only 34% of my nodes (the accounts) are selected — because that’s the number of accounts in my green community.
gephi8 And the last thing in Gephi: the final export!
To export the tweets made by the green accounts, I go to the Data Laboratory, click Export Table, check "Visible only", then go to Options. There, I select "Nodes" and "Attributes", and unselect all attributes except "Text".
gephi9 With this, you now have a new CSV containing only the text of the tweets made by the green community! And I can post this CSV directly into ChatGPT to get a much more precise output. Even with a very basic prompt like: "This is a sample of tweets about a French political event. Summarize in 5 points maximum. In English."

Here, my green community, representing exactly 4,427 accounts, is mostly talking about Manon Aubry, comparing the September 10 French demonstration with the one in London.
Each of my other communities is probably talking about something else, still related to September 10 (since that’s the topic of my Visibrain project), but with their own unique perspective.
To see the differences between each community, I just have to change my filter, re-export my data into another CSV, and ask ChatGPT to summarize this new sample.

Advantages

This technique is not limited by the scale of data. I’ve used a pretty similar process with 2 million tweets for the September 10 analysis.

Sometimes, we do the same thing at Agoratlas on another sources. With 50 million TikTok comments, thousands of video transcripts, or LinkedIn biographies. The only difference is that we don’t take just one sample, but multiple samples for each community, based on multiple criterias.

We also provide the ID of each post, so the AI can include specific examples in its response, letting us verify its answers as quickly as possible.

It doesn’t replace the human eye, but it provides a truly unique perspective, zooming in multiple times on multiple communities and topics, and processing thousands of posts at each step.

Limitations

All of this technique is essentially a hack of Gephi. We could have done the whole process with just 10 lines of code, automatically clustering, exporting, and even calling ChatGPT.

If you plan to do this kind of analysis on a daily basis, I strongly recommend having someone on your team learn basic Python to automate the workflow without relying on Gephi. If you’re interested, I’d be happy to write a more "dev-oriented" blog post to show how easily this can be done with code (just let me know in the LinkedIn comments!).

And remember: this is still just an AI. It can lie. It can hallucinate. It can be biased. But the simpler the task you give it, the less likely it is to fail.

Summarizing a set of tweets that are likely about the same topic (because they belong to the same community) is much easier for an AI than trying to summarize everything the entire network is saying at once.

Finally, this method makes it much easier for you to verify its answers:
If, in my example, the AI mentions Manon Aubry, I can simply search for "Manon Aubry" in my tweets on Visibrain to check whether that conversation is really happening.
Verification becomes much easier when the AI’s responses are precise, and since you can do this community by community, you can build confidence in the results step by step.

In Agoratlas, the final results are never from a LLM. But it is a part of our toolkits, that can be pretty useful when backup with other tools pointing in the same direction.

And if you’re particularly interested about the French 10 September, you can find our study here.