Читать все файлы в каталоге и выводить файлы, которые содержат определенные регулярные выражения в них

Question

Читать все файлы в каталоге и выводить файлы, которые содержат определенные регулярные выражения в них

Я пытаюсь прочитать все файлы в моем каталоге и вывести те, которые содержат регулярные выражения, а также то, что было регулярное выражение в каждом файле.

 import glob
import re
import PyPDF2
#-------------------------------------------------Input----------------------------------------------------------------------------------------------
folder_path = "/home/e136320/"
file_pattern = "/*"
folder_contents = glob.glob(folder_path + file_pattern)

#Search for Emails
regex1= re.compile(r'\S+@\S+')
#Search for Phone Numbers
regex2 = re.compile(r'\d\d\d[-]\d\d\d[-]\d\d\d\d')

match_list=[]

for file in folder_contents:

    if re.search(r".*(?=pdf$)",file):
        #this is pdf
        with open(file, 'rb') as pdfFileObj:
            pdfReader = PyPDF2.PdfFileReader(pdfFileObj) 
            pageObj = pdfReader.getPage(0)  
            content = pageObj.extractText()
            read_file = open(file,'rb')
            #print("{}".format(file))

    elif re.search(r".*(?=csv$)",file):
        #this is csv
        with open(file,"r+",encoding="utf-8") as csv:
            read_file = csv.read()
            #print("{}".format(file))
    elif re.search(r"/jupyter",file):
        print("wow")
    elif re.search(r"/scikit",file):
        print("wow")
    else:
        read_file = open(file, 'rb').read()
       #print("{}".format(file))
        continue
    if regex1.findall(read_file) or regex2.findall(read_file):
                print(read_file)

Мне удалось написать приведенный ниже код, но он дает следующую ошибку:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-39-f614d35e0441> in <module>()
     38        #print("{}".format(file))
     39         continue
---> 40     if regex1.findall(read_file) or regex2.findall(read_file):
     41                 print(read_file)

TypeError: expected string or bytes-like object

есть ли способ заставить это работать без ошибки?

2

python regex glob pypdf2 os.path

Источник

user10537095 03 дек '18 в 19:24

3 ответа

Другие вопросы по тегам python regex glob pypdf2 os.path

user8961316 03 дек '18 в 19:31 2018-12-03 19:31 · Answer 1 · 2018-12-03 19:31

С read() только open(filename) буду работать. Просто замените это, и вы решите свою проблему.

read_file = open(file).read()

0

Источник

user8961316 03 дек '18 в 19:31

user9380694 04 дек '18 в 06:24 2018-12-04 06:24 · Answer 2 · 2018-12-04 06:24

Сначала я прошу прощения у других людей, которые ответили на этот вопрос, потому что я скажу кое-что о предыдущем вопросе ОП.

Что касается OP, вы не должны копировать код, не задумываясь.

Content это страница, которую вы уже прочитали. Это означает, что ваш код должен быть read_file = content, И почему я пишу read_file = #потому что я думаю, что вы добавите дополнительный код. Но он не должен читать тот же файл снова.

with open(file, 'rb') as pdfFileObj:
        pdfReader = PyPDF2.PdfFileReader(pdfFileObj) 
        pageObj = pdfReader.getPage(0)  
        content = pageObj.extractText()
        read_file = open(file,'rb') 
        #^---^---^ according to your former question, `read_file` should  be `content`

И будут другие проблемы. ты должен добавить continue после print("wow"),

elif re.search(r"/jupyter",file):
    print("wow")
elif re.search(r"/scikit",file):
    print("wow")

в противном случае ваш код продолжит работать, тогда произойдет ошибка. потому что ты ничего не читал.

if regex1.findall(read_file) or regex2.findall(read_file):
    print(read_file)

user6109920 03 дек '18 в 19:30 2018-12-03 19:30 · Answer 3 · 2018-12-03 19:30

Замените код чтения файла следующим:

with open(File, mode='rb') as file:
    readFile = file.read()

0

Источник

user6109920 03 дек '18 в 19:30