Readers of this blog may remember my encounter with Google’s Gemini AI software last fall. I had been told that it was useful for data analysis, so I decided to try it out with some voter registration data. It was a disappointing (and somewhat disturbing) experience. I fed Gemini a data file and began asking questions about the data.
Gemini would provide an authoritative-sounding answer but, when I checked the data, it would turn out that the answer was erroneous. When I pointed out the error, Gemini would apologize profusely, admit the error, and then come up with another erroneous answer. It became clear that Gemini was “hallucinating” (as AI researchers say) and was not actually analyzing my data.
You can read the entire dispiriting dialog with Gemini, word for word, in that post.
Lately, I’ve been hearing that Anthropic’s Claude software is actually useful in the (limited) domain of generating computer code for data analysis, so I decided it was time for another test. Claude is offered in free and paid versions. Most of the features of Claude that interest me are available in the free version, so that’s what I have used. You have to create a login, but there is no fee.
I decided to test Claude with some of the same data I used in my previous test with Google Gemini. The file was a list of registered voters in Chester County, from which I had removed the identifying information (names and street addresses). The file contains data for 381,486 registered voters.
A size limitation. Claude took in the data without complaint but then refused to answer questions. It said that it could only deal with files of 31MB or less. That’s a severe constraint when you’re dealing with the kinds of voter files, census files, and energy-use files that I tend to work with. The voter registration file for Chester County is about 130MB, more than four times bigger than Claude’s limit.
I began ruthlessly cutting columns from the data. In addition to the names and addresses, I cut information about past elections people had voted in, changes in voting status, and so on. Eventually, I had a file that was smaller than 31MB, but retained a handful of interesting information (age, gender, party, town, precinct) for each of the 381, 486 voters. Claude was now willing to answer my questions.
Reasonable answers. I asked a series of questions about the data and (unlike Gemini) Claude provided reasonable answers. I did some manual cross-checking and found that the answers I checked were correct. That provided some reassurance that Claude was actually processing my data and was doing so accurately.
It was also reassuring that Claude generated Javascript code (a fairly human-readable, relatively simple programming language) for some of the work it did, and I could download and examine the code. I admit I only gave a cursory look at the Javascript, without trying to figure out how it worked, but I was glad to have a way to check Claude’s work if it turned out to be wrong.
Graphics too. In addition to answers about percentages, ages, and the like, Claude can provide tables of data. It also produces charts and graphs. The chart below is an example.

Each dot represents a precinct (there are about 230 voting precincts in Chester County). The vertical axis is the percentage of registered Democrats in the precinct (higher=more Democrats). The horizontal axis is the percent of voters in the precinct who are 55 or older (farther to the right=more seniors). The orange circles are the 10 precincts with the most voters over 55.
Although the quality of this image isn’t great (I could have worked with Claude to improve it), you can probably see that there are only two precincts in the county with more than 50% of voters over 55 AND more than 50% registered as Democrats (upper right quadrant of the graph).
You probably already guessed that those two precincts are the ones that incorporate Kendal and Crosslands.
Identifying data-entry problems. In any dataset the size and age of these voter registration files, there are bound to be data-entry problems. This became obvious when I asked Claude to calculate the age at which voters had registered by subtracting their date of birth from their registration date.
After processing for a few minutes, Claude responded with:
“Done! The file has a new “age at registration” column added at the end, calculated as years (rounded to 1 decimal) between DOB and Registration date, across all 381,486 rows. Note there are a small number of negative values in the data, which would indicate registration dates recorded before the birth date — worth a look if data quality is a concern.”
That certainly did sound like a data problem, and I assumed there might be problems beyond just registration before birth. I responded, “What are the characteristics of those with age at registration less than 17?”
Claude processed for a few more minutes, then responded with the following four paragraphs:
“Here’s what the data shows for the 1,464 registrants with an age at registration under 17:
Two distinct groups exist — this is almost certainly a data quality issue, not real underage registrants.
Group 1: Negative ages (75 records) These have registration dates recorded before their birth date — clearly data entry errors. Ages range as low as -184 years, which are almost certainly cases where birth year and registration year were swapped or otherwise garbled.
Group 2: Ages 0–16 (1,389 records) The vast majority of these (1,290 out of 1,389) were registered in 1972, which is a strong red flag — it suggests a batch data entry issue where a default or placeholder registration date was applied….”
I checked the file for underage and negative-age entries directly in Excel, and Claude was correct: these problematic records really did exist in the source data, and they were all (or at least mostly) data entry problems. I found 1,463 impossible registrations (age at the time of registration less than 17, or negative). 1,312 of them were registered on 1/1/1972, so I have to assume a large batch of records was entered with that date—perhaps because they were converted from paper records (many such conversions were done in the 1970s) and those paper records did not contain a registration date. Another 58 were registered on 2/5/1998 (again, probably some kind of conversion process). 16 of the records had a registration date of 1/1/1800 (same idea). The rest were a handful of miscellaneous problems, such as the registration date being the same as the birth date.
Claude was helpful in pointing out these data problems, which I might have otherwise missed, given the size of the file. I don’t think these problems will have a significant impact on my work with this file, since these 1,463 records are only a fraction of 1% of the data, and the problem seems to be almost entirely bad registration dates (a data item that doesn’t figure strongly in the projects I do).
I will be doing some more work with this file, testing various aspects of what Claude can do. It looks promising. Asking Claude to do a particular analysis is far quicker than doing it in Excel the “manual” way. I have no doubt I will be blogging more about my results using Claude.
Stumbling blocks. There are three areas where Claude failed to do what I had hoped: file size limits, accessing web data, and unauthorized use of system information.
- File size. I suppose it is ungrateful of me to be disappointed that Claude limits files to 32MB. After all, it’s available for free. (There are paid versions of Claude—Pro for $20/month and Max for $100/month—but they have the same file-size limit.) But that limitation will keep me from using Claude for a lot of the analysis I’d like to do that is just too time-intensive using Excel. I have voter registration files for the entire state of Pennsylvania, and there are interesting questions to be asked of that data. I was hoping Claude would be helpful for that, but I’m afraid it won’t be, at least if the 32MB limit is retained.
- Accessing data from the web. I asked Claude to determine the percentage of registered voters for each zip code in the county. Claude responded “this dataset only contains registered voters — it doesn’t include the total population of each zip code, so I can’t calculate what percentage of residents are registered.” I then asked if Claude could get the population data from the census. Claude did a search and responded, “Yes! The Census Bureau has a free API with ACS 5-year estimates available by ZIP code. Let me first find the zip codes in this dataset, then pull the age population data for each one.” But then it said, “Unfortunately the network in this environment is restricted and can’t reach external APIs like the Census Bureau.” Claude then came up with a workaround that allowed me to download the relevant data to my browser, and Claude used it from there. Why was that workaround necessary?
- Unauthorized use of system data. Claude often had useful comments on the results of my data requests, but it had one comment I was put off by. It said, “Kennett Square (your area!) has two precincts in the top 5….” I asked, “Why did you say Kennett Square is ‘your area’?” and Claude responded, “I noticed from the system information that your location is set to Kennett Square, Pennsylvania! That’s how I knew. It’s just context that gets shared with me to help with location-relevant questions.” This should not happen without my permission, and my permission was never obtained. I consider this a serious bug, and one that Anthropic should rectify immediately.
In future blog posts, I’ll report on my use of Claude, including workarounds and fixes for these problems.
