top button
Flag Notify
    Connect to us
      Site Registration

Site Registration

How to quickly search over a large number of files using python?

+1 vote
580 views

I have about 500 search queries, and about 52000 files in which I have to find all matches for each of the 500 queries.

How should I approach this? Seems like the straightforward way to do it would be to loop through each of the files and go line by line comparing all the terms to the query, but this seems like it would take too long.

Can someone give me a suggestion as to how to minimize the search time?

posted Sep 25, 2013 by Garima Jain

Share this question
Facebook Share Button Twitter Share Button LinkedIn Share Button

2 Answers

+1 vote

Are these files text or binary? Are they an 8bit character set or Unicode?

Without more information about what these "queries" are, it's not even possible to say whether the above approach could work at all.

Please specify the nature of these queries, and whether all the queries are of the same form. For example, it may be that each of the queries is a simple search string, not containing newline or wildcard.

Or it may be that the queries are arbitrary regular expressions, with some of them potentially matching a multi-line block of text.

Have you implemented the brute-force approach you describe, and is it indeed too slow? By what factor? Does it take 1000 times as long as desired, or 5 times? How about if you do one query for those 52000 files, is it still too slow? And by what factor?

Assuming each of the queries is independent, and that none of them need more than one line to process, it might be possible to combine some or all of those queries into a siimpler filter or filters. Then one could
speed up the process by applying the filter to each line, and only if it triggers, to check the line with the individual queries.

You also don't indicate whether this is a one--time query, or whether the same files might need later to be searched for a different set of queries, or whether the queries might need to be applied to a different set of files. Or whether the same search may need to be repeated on a very similar set of files, or ...

Even in the most ideally placed set of constraints, some queries may produce filters that are either harder to apply than the original queries, or filters that produce so many candidates that this process takes longer than just applying the queries brute-force.

Many times, optimization efforts focus on the wrong problem, or ignore the relative costs of programmer time and machine time. Other times, the problem being optimized is simply intractiable with current
technology.

answer Sep 25, 2013 by Seema Siddique
+1 vote

Before anybody can even begin to answer this question, we need to know what you mean by "search query". Are you talking pattern matching, keyword matching, fuzzy hits OK, etc? Give us a couple of examples of the kind of searches you'd like to execute.

Also, is this a one-off thing, or are you planning to do many searches over the same collection of files? If so, you will want to do some sort of pre-processing or indexing to speed up the search execution. It's extremely unlikely you want to reinvent the wheel here. There are tons of search packages out there that do this sort of thing. Just a few to check out include Apache Lucene, Apache Solr, and Xapian.

answer Sep 25, 2013 by Mandeep Sehgal
Similar Questions
+1 vote

I want to do the Boolean search over various sentences or documents. I do not want to use special programs like Whoosh, etc.

May I use any other parser? If anybody may kindly let me know.

+2 votes

I need to search through a directory of text files for a string. Here is a short program I made in the past to search through a single text file for a line of text.

How can I modify the code to search through a directory of files that have different filenames, but the same extension?

fname = raw_input("Enter file name: ") #"*.txt"
fh = open(fname)
lst = list()
biglst=[]
for line in fh:
 line=line.rstrip()
 line=line.split()
 biglst+=line
final=[]
for out in biglst:
 if out not in final:
 final.append(out)
final.sort()
print (final)
0 votes

I've got a 170 MB file I want to search for lines that look like:

INFO (6): songza.amie.history - ENQUEUEING: /listen/the-station-one

This code runs in 1.3 seconds:

import re

pattern = re.compile(r'ENQUEUEING: /listen/(.*)')
count = 0

for line in open('error.log'):
 m = pattern.search(line)
 if m:
 count += 1

print count

If I add a pre-filter before the regex, it runs in 0.78 seconds (about twice the speed!)

import re

pattern = re.compile(r'ENQUEUEING: /listen/(.*)')
count = 0

for line in open('error.log'):
 if 'ENQ' not in line:
 continue
 m = pattern.search(line)
 if m:
 count += 1

print count

Every line which contains 'ENQ' also matches the full regex (61425 lines match, out of 2.1 million total). I don't understand why the first way is so much slower.

Once the regex is compiled, you should have a state machine pattern matcher. It should be O(n) in the length of the input to figure out that it doesn't match as far as "ENQ". And that's exactly how long it should take for "if 'ENQ' not in line" to run as well. Why is doing twice the work also twice the speed?

I'm running Python 2.7.3 on Ubuntu Precise, x86_64.

+2 votes

I am trying to search the phrase of numbers in a html page in the
sentence below:

(253 items)

I used this piece of code, but it does not work,

limit= page.search("div[class=Results]").search("div").gsub("items","")

 begin
 Integer(limit)
 rescue
 return 0
 end

Would you give me any suggestion on this?

+4 votes

I want to make an animated GIF from 3200+ png. I searched and found http://code.google.com/p/visvis/source/browse/#hg/vvmovie and I wrote:

allPic=glob.glob('*.png')
allPic.sort()
allPic=[Image.open(i) for i in allPic]
writeGif('lala3.gif',allPic, duration=0.5, dither=0)

However I got

 allPic=[Image.open(i) for i in allPic]
 File "e:prgpypython-2.7.3libsite-packagesPILImage.py", line 1952, in open
 fp = __builtin__.open(fp, "rb")
IOError: [Errno 24] Too many open files: 'out0572.png'

Is there other lib for py?

...