top button
Flag Notify
    Connect to us
      Facebook Login
      Site Registration Why to Join

Facebook Login
Site Registration

Python: Searching through more than one file.

+2 votes
62 views

I need to search through a directory of text files for a string. Here is a short program I made in the past to search through a single text file for a line of text.

How can I modify the code to search through a directory of files that have different filenames, but the same extension?

fname = raw_input("Enter file name: ") #"*.txt"
fh = open(fname)
lst = list()
biglst=[]
for line in fh:
 line=line.rstrip()
 line=line.split()
 biglst+=line
final=[]
for out in biglst:
 if out not in final:
 final.append(out)
final.sort()
print (final)
posted Dec 28, 2014 by anonymous

Share this question
Facebook Share Button Twitter Share Button Google+ Share Button LinkedIn Share Button Multiple Social Share Button

1 Answer

+1 vote
 
Best answer

There are multiple ways to this.

1) Using glob module.

import glob
import os
result_file_list=[]
search_string=raw_inout("Enter the search string: ")
file_extension=raw_input("Enter file extension: ")
os.chdir("/mydir")  #Use absolute pathname of the directory in place of mydir
for file in glob.glob("*." + file_extension):
    with open(file,r'rb') as search_file:
        lines=search_file.readlines()
    for line in lines:
        line=line.strip()
        if search_string in line:
            result_file_list.append(file)

print  "Files containing the string being searched are: "
for file in result_file_list:
    print file

2) Using subprocess module. This is similar to using shell script.

import sys
import subprocess
file_extension=raw_input("Enter file extension: ") 
proc=subprocess.Popen("ls -LR " + "/MYDIRECTORYFULLPATH" + " | grep ."+                                
                                                                  search_extension,stdout=subprocess.PIPE,shell=True)
stdout,stderr=proc.communicate()
stdout=stdout.strip().split("\n")

Now stdout gives u the list of all files with the required extension. Now iterate through the list of files and do as done above or whatever is your requirement.

answer Dec 29, 2014 by Prakash
Similar Questions
+1 vote

I want to do the Boolean search over various sentences or documents. I do not want to use special programs like Whoosh, etc.

May I use any other parser? If anybody may kindly let me know.

+1 vote

I have about 500 search queries, and about 52000 files in which I have to find all matches for each of the 500 queries.

How should I approach this? Seems like the straightforward way to do it would be to loop through each of the files and go line by line comparing all the terms to the query, but this seems like it would take too long.

Can someone give me a suggestion as to how to minimize the search time?

0 votes

I've got a 170 MB file I want to search for lines that look like:

INFO (6): songza.amie.history - ENQUEUEING: /listen/the-station-one

This code runs in 1.3 seconds:

import re

pattern = re.compile(r'ENQUEUEING: /listen/(.*)')
count = 0

for line in open('error.log'):
 m = pattern.search(line)
 if m:
 count += 1

print count

If I add a pre-filter before the regex, it runs in 0.78 seconds (about twice the speed!)

import re

pattern = re.compile(r'ENQUEUEING: /listen/(.*)')
count = 0

for line in open('error.log'):
 if 'ENQ' not in line:
 continue
 m = pattern.search(line)
 if m:
 count += 1

print count

Every line which contains 'ENQ' also matches the full regex (61425 lines match, out of 2.1 million total). I don't understand why the first way is so much slower.

Once the regex is compiled, you should have a state machine pattern matcher. It should be O(n) in the length of the input to figure out that it doesn't match as far as "ENQ". And that's exactly how long it should take for "if 'ENQ' not in line" to run as well. Why is doing twice the work also twice the speed?

I'm running Python 2.7.3 on Ubuntu Precise, x86_64.

+2 votes

I'm trying to search for several strings, which I have in a .txt file line by line, on another file. So the idea is, take input.txt and search for each line in that file in another file, let's call it rules.txt.

So far, I've been able to do this, to search for individual strings:

import re
shakes = open("output.csv", "r")

for line in shakes:
 if re.match("STRING", line):
 print line,

How can I change this to input the strings to be searched from another file?

0 votes

I'm somewhat confused working with @staticmethods. My logger and configuration methods are called n times, but I have only one call.
n is number of classes which import the loger and configuration class in the subfolder mymodule. What might be my mistake mistake?

### __init__.py ###

from mymodule.MyLogger import MyLogger
from mymodule.MyConfig import MyConfig

##### my_test.py ##########
from mymodule import MyConfig,MyLogger

#Both methods are static
key,logfile,loglevel = MyConfig().get_config('Logging')
log = MyLogger.set_logger(key,logfile,loglevel)
log.critical(time.time())

#Output
2013-05-21 17:20:37,192 - my_test - 17 - CRITICAL - 1369149637.19
2013-05-21 17:20:37,192 - my_test - 17 - CRITICAL - 1369149637.19
2013-05-21 17:20:37,192 - my_test - 17 - CRITICAL - 1369149637.19
2013-05-21 17:20:37,192 - my_test - 17 - CRITICAL - 1369149637.19
Contact Us
+91 9880187415
sales@queryhome.net
support@queryhome.net
#280, 3rd floor, 5th Main
6th Sector, HSR Layout
Bangalore-560102
Karnataka INDIA.
QUERY HOME
...