On one of my recent web scraping projects, I got a follow-up request from a client to add some additional data to my results. The scraped data was dumped into individual folders in PDF format, often with multiple documents per folder. The client wanted to know if I could provide them with a count of the number of PDF files in each folder – and there were thousands of folders. It wasn’t going to happen manually. And I hate VBA, so I was hoping to find a quick and dirty formula-based solution.
A quick Duck Duck Go search gave me the answer I needed – using the now-deprecated Excel “Files” formula, I was able to pull it off.
Here’s a link to a really, really helpful rundown on setting up the Files formula in the latest versions of Excel.
I had to make a few modifications to suit my unique case.
The =COUNTA(listFiles) formula worked, but only for one directory. I needed to go through thousands of directories. So I modified the listFiles formula range to include more than one row. When defining the formula name, I entered the range as:
=FILES(Sheet1!$A$2:$A$1188)
Here’s what that looks like in Excel:
My directory names were based on a value in each row, so it was easy to set up a formula that would build the directory name and swap in the correct value in each row. For example:
="C:\Users\doug\Documents\Main Folder\"&A2&"\*"
Where cell A2 contains the folder name. Drag that down and you’ve got a column of directories.
One final modification – in this case, I had a mixture of PDF files and other file types in each directory, but I only wanted to count PDF files. By modifying the directory name, my formulas would only count files ending in PDF:
="C:\Users\doug\Documents\Main Folder\"&A2&"\*pdf"
The * is a wildcard, so any characters can appear in the file name before the PDF extension.
Then all you have to do is put the =COUNTA(listFiles) formula in another column and boom, you’ve got a count of the number of PDF files in each directory.
The only drawback to this seems to be that when there are *no* PDF files in a directory, a value of “1” is still returned in the count formula. I’m not sure why that is, and frankly it didn’t matter a whole lot to us so I didn’t spend any time resolving that little oddity. But it’s good to know.
Leave a Reply